Using Chi Square and TFID Vectorization


We will analyze a collection of tweets collected from the public stream of twitter that includes SMRT station names. I wrote about collecting this data in a previous blog article Collecting Tweets using Python. The method for determining important words is the Term Frequency Inverse Document Frequency (tf-idf) method. To determine how good a fit the word is to the class (station name) we will use Chi Squared statistic to weed out words that are independent of class and therefore irrelevant for classification. This is also a good method for finding spam because spam tweets can be highly correlated with a class, e.g. a business advertising it’s location near an SMRT Station will have it’s tweets highly correspond to the station name. We can then label the tweets as spam and re-run our classifier to eliminate them.
Read more