Labelling tweets using Supervised Classification

In this article I use python and the NLTK library to show how tweets can be categorized using a technique called Supervised

Reason to Categorize

Depending on the type of analysis needed it can be beneficial to include or exclude certain categories of tweets. Some companies send large numbers of tweets that can skew analysis if not identified and removed. Manually removing tweets is not practical when dealing with a large amount of data so some automatic method has to be found. Supervised Classification is a method that can be used to quickly identify and remove undesirable data.


If you have a working version of Python 2 then the only setup needed is to install the Natural Language Processing Toolkit (NLTK):

pip install nltk

Quick overview of Supervised Classification


The steps needed to classify a tweet are:

  • Decide what properties of the data to pass to the classifier
  • Collect and manually label sample tweets (the more representative the sample is the better the accuracy you can achieve)
  • Train the classifier with the a subset of the data
  • Test the accuracy of the classifier with the remaining data
  • Repeat until an acceptable accuracy is achieved

For this example three categories will be defined:

  • personal – default category, the tweet is personal
  • business – tweet from a company
  • news – tweet from a news organisation or a news-related tweet

Next we write a script that will process the sample tweets and write out only the text property and the default label ‘personal’.

import json
import re

DEFAULT_LABEL = "personal"

# are extracted features from the tweet data
# that will be used to train the classifier
with open('', 'w') as f:
    # sample data is a sample of tweets in JSON format that need to labelled
    for tweet in file(''):
        data = json.loads(tweet)

        text = data['text'].strip().encode('ascii', errors='ignore')

        text = re.sub(r"\n", " ", text) # remove newlines from text

        # output features and a default category of 'personal'
        f.write("\n".join([ text, DEFAULT_LABEL, '\n' ]))

Now comes the manual process of manually assigning the correct label for each text message.

Once all tweets have been labelled then the classifier can be trained.

Training the classifier

For each tweet we generate a tuple containing a list of features and a label. For example this labelled tweet:

    'tweet': 'Currently stuck at paya lebar mrt till my dad transfers me $ to top up my ezlink.',
    'label': 'business'

Is first normalized into this:

    'tweet': 'currently stuck at MRT_STATION till my dad transfers me $ to top up my ezlink',
    'label': 'business'

And then the features are extracted:

        'contains(up,my)': True,
        'contains(my,dad)': True,
        'contains(MRT_STATION,till)': True,
        'contains(dad,transfers)': True,
        'contains($,to)': True,
        'contains(me,$)': True,
        'contains(stuck,at)': True,
        'contains(till,my)': True,
        'contains(my,ezlink)': True,
        'contains(top,up)': True,
        'contains(to,top)': True,
        'contains(currently,stuck)': True,
        'contains(at,MRT_STATION)': True,
        'contains(transfers,me)': True

Putting the classifier to use on real data

I will use tweet data collected from filtering on the twitter public stream, I have already blogged about capturing this data.

Refer to this article if you need to set up a python development environment.

Python file:

Running the classifier script and testing with unlabelled tweets


$ python
Most Informative Features
contains(MRT_STATION,LINK) = True           busine : person =     33.3 : 1.0
 contains(MRT_STATION,@) = True             news : person =     22.3 : 1.0
        contains(@,LINK) = True             news : person =      7.4 : 1.0
  contains(MRT_STATION,) = True             news : person =      7.1 : 1.0
          contains(in,a) = True             news : person =      6.6 : 1.0
   contains(bus,service) = True             news : person =      6.6 : 1.0
   contains(service,NUM) = True             news : person =      6.6 : 1.0
        contains(on,the) = True             news : person =      6.6 : 1.0
contains(to,MRT_STATION) = True             news : busine =      4.4 : 1.0
      contains(LINK,via) = True             news : person =      3.9 : 1.0
         contains(was,a) = True             news : person =      3.9 : 1.0
         contains(via,@) = True             news : person =      3.9 : 1.0
    contains(station,is) = True             news : person =      3.9 : 1.0
contains(at,MRT_STATION) = None           busine : news   =      3.0 : 1.0
contains(MRT_STATION,LINK) = None           person : busine =      2.9 : 1.0
   contains(samurai,man) = None           person : news   =      2.8 : 1.0
 contains(boarded,train) = None           person : news   =      2.8 : 1.0
   contains(jumped,gate) = None           person : news   =      2.8 : 1.0
      contains(train,at) = None           person : news   =      2.8 : 1.0
  contains(gate,boarded) = None           person : news   =      2.8 : 1.0
Total errors: 0
Accuracy:  1.0

We get 100% accuracy which is a good start! For production quality classifier you would need to label many more tweets to get a representative sample.

To use the classifier on unlabelled data:

$ ipython
...(omitted, same output as above)...
In [2]: classifier.classify(tweet_features({'tweet':'meet you at eunos mrt tonight'}))
Out[2]: 'personal'

In [3]: classifier.classify(tweet_features({'tweet':'Room for rent near Potong Pasir MRT'}))
Out[3]: 'business'

In [4]: classifier.classify(tweet_features({'tweet':'new bus terminal opens near raffles place mrt'}))
Out[4]: 'news'

In [5]:

See Also

  • Natural Language Processing with Python (2009) – Excellent book that will help you understand how to use the NLTK library for natural language processing
  • NLTK – Visit the NLTK site for the latest news and documentation