Tuesday, July 23, 2013

Combining Scikit-Learn and NTLK

In Chapter 6 of the book Natural Language Processing with Python there is a nice example where is showed how to train and test a Naive Bayes classifier that can identify the dialogue act types of instant messages. Th classifier is trained on the NPS Chat Corpus which consists of over 10,000 posts from instant messaging sessions labeled with one of 15 dialogue act types.
The implementation of the Naive Bayes classifier used in the book is the one provided in the NTLK library. Here we will see how to use use the Support Vector Machine (SVM) classifier implemented in Scikit-Learn without touching the features representation of the original example.
Here is the snippet to extract the features (equivalent to the one in the book):
import nltk

def dialogue_act_features(sentence):
        Extracts a set of features from a message.
    features = {}
    tokens = nltk.word_tokenize(sentence)
    for t in tokens:
        features['contains(%s)' % t.lower()] = True    
    return features

# data structure representing the XML annotation for each post
posts = nltk.corpus.nps_chat.xml_posts() 
# label set
cls_set = ['Emotion', 'ynQuestion', 'yAnswer', 'Continuer',
'whQuestion', 'System', 'Accept', 'Clarify', 'Emphasis', 
'nAnswer', 'Greet', 'Statement', 'Reject', 'Bye', 'Other']
featuresets = [] # list of tuples of the form (post, features)
for post in posts: # applying the feature extractor to each post
 # post.get('class') is the label of the current post
After the feature extraction we can split the data we obtained in training and testing set:
from random import shuffle
size = int(len(featuresets) * .1) # 10% is used for the test set
train = featuresets[size:]
test = featuresets[:size]
Now we can instantiate the model that implements classifier using the scikitlearn interface provided by NLTK and train it:
from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
# SVM with a Linear Kernel and default parameters 
classif = SklearnClassifier(LinearSVC())
In order to use the batch_classify method provided by scikitlearn we have to organize the test set in two lists, the first one with the train data and the second one with the target labels:
test_skl = []
t_test_skl = []
for d in test:
Then we can run the classifier on the test set and print a full report of its performances:
# run the classifier on the train test
p = classif.batch_classify(test_skl)
from sklearn.metrics import classification_report
# getting a full report
print classification_report(t_test_skl, p, labels=list(set(t_test_skl)),target_names=cls_set)
The report will look like this:
              precision    recall  f1-score   support

    Emotion       0.83      0.85      0.84       101
 ynQuestion       0.78      0.78      0.78        58
    yAnswer       0.40      0.40      0.40         5
  Continuer       0.33      0.15      0.21        13
 whQuestion       0.78      0.72      0.75        50
     System       0.99      0.98      0.98       259
     Accept       0.80      0.59      0.68        27
    Clarify       0.00      0.00      0.00         6
   Emphasis       0.59      0.59      0.59        17
    nAnswer       0.73      0.80      0.76        10
      Greet       0.94      0.91      0.93       160
  Statement       0.76      0.86      0.81       311
     Reject       0.57      0.31      0.40        13
        Bye       0.94      0.68      0.79        25
      Other       0.00      0.00      0.00         1

avg / total       0.84      0.85      0.84      1056

Friday, July 12, 2013

Hopalong fractals

Have you ever wondered what happen if you pick a point (x0,y0) and compute hundreds of point using these equations?

Well, you get a hopalong fractal.
Let's plot this fractal using Pylab. The following function computes n points using the equations above:
from __future__ import division
from numpy import sqrt,power

def hopalong(x0,y0,n,a=-55,b=-1,c=-42):
 def update(x,y):
  x1 = y-x/abs(x)*sqrt(abs(b*x+c))
  y1 = a-x
  return x1,y1
 xx = []
 yy = []
 for _ in xrange(n):
  x0,y0 = update(x0,y0) 
 return xx,yy
and this snippet computes 40000 points starting from (-1,10):
from pylab import scatter,show, cm, axis
from numpy import array,mean
x = -1
y = 10
n = 40000
xx,yy = hopalong(x,y,n)
cr = sqrt(power(array(xx)-mean(xx),2)+power(array(yy)-mean(yy),2))
scatter(xx, yy, marker='.', c=cr/max(cr), 
        edgecolor='w', cmap=cm.Dark2, s=50)
Here we have one of the possible hopalong fractals:

Varying the starting point and the values of a, b and c we have different fractals. Here are some of them: