Collecting tweets using Python

MRT, Singapore

This article will demonstrate collecting tweets from the twitter streaming api. The data will be stored in a flat file for later analysis.

I will be using Ubuntu 13.10 for this article but if you have a working version of Python 2 and a python package manager you should be able to follow along.

Setting up tweepy, a python wrapper for the twitter api

For collecting tweets we will use the tweepy library which is a wrapper around the twitter api. To install:

pip install tweepy

Before you can access the twitter streaming api you need to register your client application (the python script) with Twitter. There is a tweepy authentication tutorial that explains step by step how to set up your app for oauth authentication.

Once you have registered with Twitter go to My Applications page on Twitter and you should see your newly created application. Click on the link and you should see the application details page:

show twitter applications

Let’s write a simple script to test the authentication.

Here is the output when I run the script in ipython:

Python 2.7.3 |CUSTOM| (default, Apr 12 2012, 11:28:34)
Type "copyright", "credits" or "license" for more information.

IPython 1.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: MacOSX

In [1]: %run
Mark Kay

In [2]:

To give an idea of what you can do with the library just type ‘api.’ to see the methods available:

In [2]: api.
api.add_list_member                  api.mentions_timeline
api.api_root                         api.parser
api.auth                             api.rate_limit_status
api.blocks                           api.related_results
api.blocks_ids                       api.remove_list_member
api.cache                            api.report_spam
api.compression                      api.retry_count
api.create_block                     api.retry_delay
api.create_favorite                  api.retry_errors
api.create_friendship                api.retweet
api.create_list                      api.retweeted_by
api.create_saved_search              api.retweeted_by_ids
api.destroy_block                    api.retweets
api.destroy_direct_message           api.retweets_of_me
api.destroy_favorite                 api.reverse_geocode
api.destroy_friendship               api.saved_searches
api.destroy_saved_search             api.search_host
api.destroy_status                   api.search_root
api.direct_messages                  api.search_users
api.followers                        api.send_direct_message
api.followers_ids                    api.sent_direct_messages
api.friends                          api.set_delivery_device
api.friends_ids                      api.show_friendship
api.friendships_incoming             api.show_list_member
api.friendships_outgoing             api.show_list_subscriber
api.geo_id                           api.subscribe_list
api.geo_search                       api.suggested_categories
api.geo_similar_places               api.suggested_users
api.get_direct_message               api.suggested_users_tweets
api.get_list                         api.test
api.get_oembed                       api.timeout
api.get_saved_search                 api.trends_available
api.get_status                       api.trends_closest
api.get_user                         api.trends_daily
api.home_timeline                    api.trends_place                             api.trends_weekly
api.last_response                    api.unsubscribe_list
api.list_members                     api.update_list
api.list_subscribers                 api.update_profile
api.list_timeline                    api.update_profile_background_image
api.lists_all                        api.update_profile_colors
api.lists_memberships                api.update_profile_image
api.lists_subscriptions              api.update_status
api.lookup_friendships               api.user_timeline
api.lookup_users                     api.verify_credentials

What to collect?

So now we have oauth working we can start to write the script to collect tweets and store them in a file so we can analyze them later.

Since I live in Singapore I am going to collect tweets from people at MRT stations.

MRT, Singapore

Why? No reason other than I can think of many ways to extract some useful information from the data. Examples are:

  • A proxy for population density
  • Geocoded tweets can be used to build a network map
  • Home/Work cycle can be calculated from data
  • Commute patterns?
  • Sentiment analysis, are people happy/sad/frustrated/mad at particular stations?

There are probably many other things that can be extracted from the data limited only by your imagination (and programming skills).

You can, of course, change the query to analyze anything you are interested in. Future articles I plan to write on data science and analysis will be based on this data.

Collecting tweets

We will use one of the public streams that is available through the Twitter API to listen in for tweets that are for a known MRT station. The track api allows us to track keywords, up to 400, which will be more than enough for our requirements. If you need to track more keywords Twitter has a number of partner providers here.

First of all, here is the script:

Python file:

This script will write out each tweet received to an hourly tweet file. The entire data for the tweet in JSON format is written to disk. In addition, if the tweet is geocoded and contains a station name and number it is written to a separate file that can be used to generate a network map.

Tweets that don’t contain user information or are retweets are ignored.

Example of output (details changed):

Afvel Thompson
'Samurai' man jumped gate, boarded train at Paya Lebar MRT


Found geo coords for MRT Station (EW6) 'kembangan mrt': (103.912830, 1.320984)


<a href="" rel="nofollow">foursquare</a>
I'm at Kembangan MRT Station (EW6) - @smrt_singapore (Singapore)


Found geo coords for MRT Station (NE16) 'sengkang mrt': (103.895441, 1.391683)


<a href="" rel="nofollow">foursquare</a>
I'm at Sengkang MRT/LRT Interchange (NE16/STC) (Singapore)


Example of tweet data in json format (details changed):

    "created_at":"Wed Dec 18 09:34:43 +0000 2013",
    "text":"I'm at Sengkang MRT\/LRT Interchange (NE16\/STC) (Singapore) http:\/\/\/c1zUxPiLad",
    "source":"\u003ca href=\"http:\/\/\" rel=\"nofollow\"\u003efoursquare\u003c\/a\u003e",
        "description":"One For All, All For One",
        "created_at":"Thu Mar 11 11:24:04 +0000 2013",
        "coordinates":[1.39168285, 103.89544129]},
            "coordinates":[103.80000000, 1.30000000]
        "name":"Johor Bahru",
        "full_name":"Johor Bahru, Johore",
                            [103.5363541, 1.3416253],
                            [103.5363541, 1.673503],
                            [104.0161667, 1.673503],
                            [104.0161667, 1.3416253]
                "indices":[59, 81]

This is what the geocoded station data file looks like:

103.848310  1.350732    bishan mrt  NS17
103.848310  1.350732    bishan mrt  NS17
103.800749  1.440672    admiralty mrt   NS10
103.856120  1.300588    bugis mrt   EW12
103.849468  1.369641    ang mo kio mrt  NS16
103.839126  1.281345    outram park mrt EW16

Final Note

Twitter restricts publicly releasing datasets according to their API Terms of Service. However you can share derivative analysis from tweets, such as content analysis and aggregate statistics.