Calculating a Network Map by analyzing Tweets
In this article I demonstrate generating the network map of SMRT stations just by analyzing the twitter public stream. I am using data from Singapore but you could calculate the network map for any city by changing the station names that are tracked.
Steps to generate the map
- Collect tweets that include all station names you want to track
- For geocoded tweets extract longitude and latitude
- Filter as required (I use a supervised classifier to remove unwanted tweets)
- Aggregate data
- Write a HTML page that overlays the aggregated data on a map
Steps 1 & 2
I wrote an article on how to collect tweets using python here:
Step 3: Filter tweets
This article I wrote shows how to write a supervised classifier to label tweets. This makes it easy to remove unwanted tweets (in this case I keep only ‘personal’ tweets, news and business tweets are removed).
Step 4: Aggregate data
I use Apache Pig to aggregate the tweet data. You can use pig in a standalone mode and run it from the command line.
This step assumes you have tweet data stored in CSV format, if not then a conversion script will need to be written first. We calculate unique tweeters by station and by station/language. This is done for all tweets and then a daily total is calculated as well.
PIG script: aggregate.pig I use a wrapper shell script to invoke the pig script. Here is the shell script:
Running the aggregate script:
$ sh aggregate.sh 20140125 pig -l /tmp -x local -p DATE=20140125 -f aggregate.pig ... 2014-01-27 09:16:41,119 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2014-01-27 09:16:41,119 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete 2014-01-27 09:16:41,120 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-cdh3u3 0.12.1-SNAPSHOT mark 2014-01-27 09:16:31 2014-01-27 09:16:41 GROUP_BY,FILTER Success! ... 2014-01-27 09:16:41,129 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
Step 5: Write a HTML page that overlays the aggregated data on a map
import collections import sys import mrt # utility functions stations = mrt.get_map() # station data including longitude and latitude data = [ x.strip().split('\t') for x in file('totals/by_station_lang/part-r-00000') ] print 'var pnts = [' for (st_name, lang, cnt) in data: try: st_key = (st_name + ' mrt').lower() st = stations[st_key] print " [%f,%f,'%s','%s','%s',%s ]," % (st, st, st_name, st, lang, cnt) except: sys.stderr.write('Failed to process key "%s"\n' % st_key) print '];'
I invoke it like this:
python summarize_lang.py 2>/tmp/errors.log > tweets_lang.js
The output looks like:
var pnts = [ [103.961542,1.334504,'Expo','CG1','French',3 ], [103.961542,1.334504,'Expo','CG1','German',1 ], [103.961542,1.334504,'Expo','CG1','English',67 ], [103.961542,1.334504,'Expo','CG1','Tagalog',4 ], [103.961542,1.334504,'Expo','CG1','Indonesian',2 ], [103.930063,1.324185,'Bedok','EW5','Dutch',4 ], [103.930063,1.324185,'Bedok','EW5','Danish',1 ], [103.930063,1.324185,'Bedok','EW5','French',2 ], [103.930063,1.324185,'Bedok','EW5','German',1 ], ... [103.732551,1.342428,'Chinese Garden','EW25','Indonesian',2 ], [103.863121,1.300963,'Nicoll Highway','CC5','Korean',1 ], [103.863121,1.300963,'Nicoll Highway','CC5','English',28 ], [103.863121,1.300963,'Nicoll Highway','CC5','Tagalog',1 ], [103.863121,1.300963,'Nicoll Highway','CC5','Indonesian',1 ], [103.815007,1.322576,'Botanic Gardens','CC19','English',23 ], [103.815007,1.322576,'Botanic Gardens','CC19','Japanese',1 ], [103.815007,1.322576,'Botanic Gardens','CC19','Slovenian',1 ], [103.795738,1.310735,'Holland Village','CC21','English',23 ], ];
For the html page showing the tweet data overlayed onto a map I use:
- Cloud Made for the maps
Tweeters by Station
The colours represent the station lines:
- Green = East West Line
- Red = North South Line
- Purple = North East Line
- Orange = Circle Line
Each circle is a station, the size of the circle represents the number of tweeters that have tweeted from that station.
I use a heatmap plugin for Leaflet and display tweeters that tweet in Indonesian and Tagalog (used in the Phillipines), these are the two languages other than English that have big enough sample size to generate a heat map for.
HTML File: heatmap.html
Tweeters By Station and Language
Sample Size: 1167 tweeters
Tagalog (Spoken in the Philippines) Tweeters
Sample Size: 788 tweeters