The Glowing Python: Terms selection with chi-square

Wednesday, February 26, 2014

Terms selection with chi-square

In Natural Language Processing, the identification the most relevant terms in a collection of documents is a common task. It can produce meaningful insights about the data and it can also be useful to improve classification performances and computational efficiency. A popular measure of relevance for terms is the χ² statistic. To compute it we can convert the terms of our document collection and turn them into features of a vectorial model, then χ² can be computed as follow:

Where f is a feature (a term in this case), t is a target variable that we, usually, want to predict, A is the number of times that f and t cooccur, B is the number of times that f occurs without t, C is the number of times that t occurs without f, D is the number of times neither t or f occur and N is the number of observations.

Let's see how χ² can be used through a simple example. We load some posts from 4 different newsgroups categories using the sklearn interface:

from sklearn.datasets import fetch_20newsgroups
 # newsgroups categories
categories = ['alt.atheism','talk.religion.misc',
              'comp.graphics','sci.space']

posts = fetch_20newsgroups(subset='train', categories=categories,
                           shuffle=True, random_state=42,
                           remove=('headers','footers','quotes'))

From the posts loaded, we build a linear model using all the terms in the document collection but the stop words:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(posts.data)

Now, X is a document-term matrix where the element X_i,j is the frequency of the term j in the document i. Then, the features are given by the columns of X and we want to compute χ² between the categories of interest and each feature in order to figure out what are the most relevant terms. This can be done as follows

from sklearn.feature_selection import chi2
# compute chi2 for each feature
chi2score = chi2(X,posts.target)[0]

To have a visual insight, we can plot a bar chart where each bar shows the χ² value computed above:

from pylab import barh,plot,yticks,show,grid,xlabel,figure
figure(figsize=(6,6))
wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1]) 
topchi2 = zip(*wchi2[-25:])
x = range(len(topchi2[1]))
labels = topchi2[0]
barh(x,topchi2[1],align='center',alpha=.2,color='g')
plot(topchi2[1],x,'-o',markersize=2,alpha=.8,color='g')
yticks(x,labels)
xlabel('$\chi^2$')
show()

We can observe that the terms with a high χ² can be considered relevant for the newsgroup categories we are analyzing. For example, the terms space, nasa and launch can be considered relevant for the group sci.space. The terms god, jesus and atheism can be considered relevant for the groups alt.atheism and talk.religion.misc. And, the terms image, graphics and jpeg can be considered relevant in the category comp.graphics.

8 comments:

priyankaDecember 2, 2015 at 11:46 AM
Hello, plz can you tell me how to compute ingormation gain for text min ing.m using information gain as feture selector ..plz tell me how to use it?
ReplyDelete
Replies
AnonymousJanuary 13, 2016 at 7:43 AM
when i have get the tokens with high scores,how can i know whick class they belong to ,and i want to know how if i can compute the chi square of single token with scikit learn
ReplyDelete
Replies
UnknownJanuary 13, 2016 at 3:21 PM
if i have get the terms that are relevant in the category,so how can i use them to advance my prediction accuracy,i mean how can i use the terms to convert my text to vectors
ReplyDelete
Replies
AnonymousMarch 9, 2016 at 7:14 AM
Hi, basically you can choose top K terms to build vector for each document instead use all the terms.
ReplyDelete
Replies
Farhan RamadhaniNovember 6, 2017 at 12:19 PM
I got an Nan values for chi2score if i only have 1 category/class in predictor. any help?
ReplyDelete
Replies
AnonymousAugust 16, 2019 at 2:08 PM
if i understand what you are doing is to compute the occurrence of each word--> term. Next you calculate the chi-square of each term?
chi-square is the sum of (exp-observed)^2/exp of all terms.
How to plot this table?
Based on what values?
Can you explain?
Can you give more details about the procedure?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Wednesday, February 26, 2014

Terms selection with chi-square

8 comments:

Quote