Using Chi Square and TFID Vectorization

tuckshop

We will analyze a collection of tweets collected from the public stream of twitter that includes SMRT station names. I wrote about collecting this data in a previous blog article Collecting Tweets using Python. The method for determining important words is the Term Frequency Inverse Document Frequency (tf-idf) method. To determine how good a fit the word is to the class (station name) we will use Chi Squared statistic to weed out words that are independent of class and therefore irrelevant for classification. This is also a good method for finding spam because spam tweets can be highly correlated with a class, e.g. a business advertising it’s location near an SMRT Station will have it’s tweets highly correspond to the station name. We can then label the tweets as spam and re-run our classifier to eliminate them.

The tweet data is stored in a CSV file that contains the text of the tweet and the station name tab spaced. This data has already been classified to only contain personal tweets, business and news tweets should have been automatically removed. See the blog article Labelling Tweets using Supervised Classification to see how to do that.

Python file: chi_squared.py

Each bar shows the computed chi squared value, the higher the value the more relevant the term is for it’s category, in this case the station the tweet originated from. In this example we can see that the terms tuckshop and antoncasey have the highest relevance. terms

For each high scoring word find the class (station name) it belongs to.

from collections import Counter
words = topchi2[0]
station_words = defaultdict(lambda: Counter(), {})
for word in words:
    for (i, st) in enumerate(stations):
        m = re.search('(^|\W)' + word + '($|\W)', tweets[i])
        if not m: continue
        station_words[word].update({st: 1})

for word in sorted(station_words.keys()):
    if sum(station_words[word].values()) < 10: continue
    print '%s: %s' % (word, station_words[word].most_common(5))

The output of highest rated terms:

animated: [('Outram Park', 12)]
antoncasey: [('Dakota', 16)]
armed: [('Dhoby Ghaut', 29), ('Redhill', 1), ('Somerset', 1), ('Bugis', 1)]
arts: [('Outram Park', 14), ('Tiong Bahru', 2), ('Kent Ridge', 1), ('Bras Basah', 1), ('Mountbatten', 1)]
downtown: [('Chinatown', 20), ('Bugis', 13), ('Downtown', 8), ('Telok Ayer', 4), ('Dhoby Ghaut', 2)]
hong: [('Choa Chu Kang', 6), ('Redhill', 1), ('Nicoll Highway', 1), ('Bedok', 1), ('Queenstown', 1)]
interchange: [('Dhoby Ghaut', 75), ('Bugis', 64), ('Tanah Merah', 44), ('City Hall', 42), ('Chinatown', 40)]
line: [('Downtown', 29), ('Chinatown', 19), ('Bugis', 16), ('Outram Park', 8), ('Telok Ayer', 4)]
lrt: [('Choa Chu Kang', 8), ('Sengkang', 6), ('Punggol', 4), ('Hougang', 1), ('Joo Koon', 1)]
mnc: [('Pasir Panjang', 5), ('Expo', 3), ('Jurong East', 1), ('Tanjong Pagar', 1)]
played: [('Dakota', 14), ('Kent Ridge', 1), ('Bedok', 1), ('Telok Blangah', 1)]
power: [('Ang Mo Kio', 7), ('Mountbatten', 2), ('Outram Park', 2), ('Potong Pasir', 1), ('Orchard', 1)]
short: [('Dakota', 14), ('Jurong East', 3), ('Kranji', 2), ('Ang Mo Kio', 1), ('Orchard', 1)]
stroll: [('Dakota', 14), ('Serangoon', 2), ('Admiralty', 1), ('Woodlands', 1), ('Marsiling', 1)]
sword: [('Dhoby Ghaut', 32)]
tenancy: [('Tanjong Pagar', 4), ('Tanah Merah', 4), ('Braddell', 4)]
tuckshop: [('Dakota', 16)]

Let’s look at individual tweets for the term ‘antoncasey’:

for tweet in file('tweets.csv'):
    if re.search('antoncasey', tweet.lower()): print tweet.strip()

Gives the output:

Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/fyYdvvThSM Dakota
"@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/eCu7a5b0sd" too good    Dakota
Lol! "@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/egyVvqI7S8"    Dakota
Going! @mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/nWLq1bCAqM    Dakota
@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/pFvMdAE9lF   Dakota
Haha good one!  "@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/mL53ssi6NE" Dakota
"like" @mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/Lzb82mfUav    Dakota
"@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/cT2CUR1K9D"  to that "rich" fella...    Dakota
aha! Sucker! @mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/o9on4DmPql  Dakota
HAHAHA WIN "@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/rycjfP4ZfE"  Dakota
@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/8SzvWMqdeO Brilliant!    Dakota
@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/91YVH4cfwq -- bahahahaha! b(&gt;_&lt;)   Dakota
This is absolutely GOLD. Thank you, The Tuckshop (off Dakota MRT), big tips ahead! #AntonCasey http://t.co/UXW5QR17Jx"  Dakota
@mrbrown: Well played, The Tuckshop  (A short stroll from the Dakota MRT). #AntonCasey http://t.co/ykr5r2zQgesmelly ones $1.20? Dakota
"@galrocker: This is absolutely GOLD. Thank you, The Tuckshop (off Dakota MRT), big tips ahead! #AntonCasey http://t.co/oIsEyWFFFA""    Dakota

Here is the image referenced in the tweet and retweets:

tuckshop

See Also: