Calculating a Network Map by analyzing Tweets

network map

In this article I demonstrate generating the network map of SMRT stations just by analyzing the twitter public stream. I am using data from Singapore but you could calculate the network map for any city by changing the station names that are tracked.

Steps to generate the map

  1. Collect tweets that include all station names you want to track
  2. For geocoded tweets extract longitude and latitude
  3. Filter as required (I use a supervised classifier to remove unwanted tweets)
  4. Aggregate data
  5. Write a HTML page that overlays the aggregated data on a map

Steps 1 & 2

I wrote an article on how to collect tweets using python here:

Step 3: Filter tweets

This article I wrote shows how to write a supervised classifier to label tweets. This makes it easy to remove unwanted tweets (in this case I keep only ‘personal’ tweets, news and business tweets are removed).

 

Step 4: Aggregate data

I use Apache Pig to aggregate the tweet data. You can use pig in a standalone mode and run it from the command line.

This step assumes you have tweet data stored in CSV format, if not then a conversion script will need to be written first. We calculate unique tweeters by station and by station/language. This is done for all tweets and then a daily total is calculated as well.

PIG script: aggregate.pig I use a wrapper shell script to invoke the pig script. Here is the shell script:

Running the aggregate script:

$ sh aggregate.sh 20140125
pig -l /tmp -x local -p DATE=20140125 -f aggregate.pig

...

2014-01-27 09:16:41,119 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-01-27 09:16:41,119 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete
2014-01-27 09:16:41,120 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
0.20.2-cdh3u3   0.12.1-SNAPSHOT mark    2014-01-27 09:16:31     2014-01-27 09:16:41     GROUP_BY,FILTER

Success!

...

2014-01-27 09:16:41,129 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

Step 5: Write a HTML page that overlays the aggregated data on a map

Now that we have aggregated data the next step is to make the data accessible from a web page. I do this by converting the CSV data into a javascript array and load it into the web page by using a SCRIPT tag. Alternatively you could generate JSON data then write some javascript to load the data.

Here is the python script to convert between CSV and javascript:

import collections
import sys
import mrt # utility functions

stations = mrt.get_map() # station data including longitude and latitude

data = [ x.strip().split('\t') for x in file('totals/by_station_lang/part-r-00000') ]

print 'var pnts = ['
for (st_name, lang, cnt) in data:
    try:
        st_key = (st_name + ' mrt').lower()
        st = stations[st_key]
        print "  [%f,%f,'%s','%s','%s',%s ]," % (st[0], st[1], st_name, st[2], lang, cnt)
    except:
        sys.stderr.write('Failed to process key "%s"\n' % st_key)

print '];'

I invoke it like this:

python summarize_lang.py 2>/tmp/errors.log > tweets_lang.js

The output looks like:

var pnts = [
  [103.961542,1.334504,'Expo','CG1','French',3 ],
  [103.961542,1.334504,'Expo','CG1','German',1 ],
  [103.961542,1.334504,'Expo','CG1','English',67 ],
  [103.961542,1.334504,'Expo','CG1','Tagalog',4 ],
  [103.961542,1.334504,'Expo','CG1','Indonesian',2 ],
  [103.930063,1.324185,'Bedok','EW5','Dutch',4 ],
  [103.930063,1.324185,'Bedok','EW5','Danish',1 ],
  [103.930063,1.324185,'Bedok','EW5','French',2 ],
  [103.930063,1.324185,'Bedok','EW5','German',1 ],
...
  [103.732551,1.342428,'Chinese Garden','EW25','Indonesian',2 ],
  [103.863121,1.300963,'Nicoll Highway','CC5','Korean',1 ],
  [103.863121,1.300963,'Nicoll Highway','CC5','English',28 ],
  [103.863121,1.300963,'Nicoll Highway','CC5','Tagalog',1 ],
  [103.863121,1.300963,'Nicoll Highway','CC5','Indonesian',1 ],
  [103.815007,1.322576,'Botanic Gardens','CC19','English',23 ],
  [103.815007,1.322576,'Botanic Gardens','CC19','Japanese',1 ],
  [103.815007,1.322576,'Botanic Gardens','CC19','Slovenian',1 ],
  [103.795738,1.310735,'Holland Village','CC21','English',23 ],
];

For the html page showing the tweet data overlayed onto a map I use:

  • Cloud Made for the maps
  • Leaflet.js for the javascript library that can use cloudmade map tiles and is easy to use
Tweeters by Station

network map

The colours represent the station lines:

  • Green = East West Line
  • Red = North South Line
  • Purple = North East Line
  • Orange = Circle Line

Each circle is a station, the size of the circle represents the number of tweeters that have tweeted from that station.


I use a heatmap plugin for Leaflet and display tweeters that tweet in Indonesian and Tagalog (used in the Phillipines), these are the two languages other than English that have big enough sample size to generate a heat map for.

HTML File: heatmap.html

Tweeters By Station and Language

Indonesian Tweeters

Sample Size: 1167 tweeters

indonesian tweeters


Tagalog (Spoken in the Philippines) Tweeters

Sample Size: 788 tweeters

 

tagalog tweeters