The implementation of the Naive Bayes classifier used in the book is the one provided in the NTLK library. Here we will see how to use use the Support Vector Machine (SVM) classifier implemented in Scikit-Learn without touching the features representation of the original example.
Here is the snippet to extract the features (equivalent to the one in the book):
import nltk def dialogue_act_features(sentence): """ Extracts a set of features from a message. """ features = {} tokens = nltk.word_tokenize(sentence) for t in tokens: features['contains(%s)' % t.lower()] = True return features # data structure representing the XML annotation for each post posts = nltk.corpus.nps_chat.xml_posts() # label set cls_set = ['Emotion', 'ynQuestion', 'yAnswer', 'Continuer', 'whQuestion', 'System', 'Accept', 'Clarify', 'Emphasis', 'nAnswer', 'Greet', 'Statement', 'Reject', 'Bye', 'Other'] featuresets = [] # list of tuples of the form (post, features) for post in posts: # applying the feature extractor to each post # post.get('class') is the label of the current post featuresets.append((dialogue_act_features(post.text),cls_set.index(post.get('class'))))After the feature extraction we can split the data we obtained in training and testing set:
from random import shuffle shuffle(featuresets) size = int(len(featuresets) * .1) # 10% is used for the test set train = featuresets[size:] test = featuresets[:size]Now we can instantiate the model that implements classifier using the scikitlearn interface provided by NLTK and train it:
from sklearn.svm import LinearSVC from nltk.classify.scikitlearn import SklearnClassifier # SVM with a Linear Kernel and default parameters classif = SklearnClassifier(LinearSVC()) classif.train(train)In order to use the batch_classify method provided by scikitlearn we have to organize the test set in two lists, the first one with the train data and the second one with the target labels:
test_skl = [] t_test_skl = [] for d in test: test_skl.append(d[0]) t_test_skl.append(d[1])Then we can run the classifier on the test set and print a full report of its performances:
# run the classifier on the train test p = classif.batch_classify(test_skl) from sklearn.metrics import classification_report # getting a full report print classification_report(t_test_skl, p, labels=list(set(t_test_skl)),target_names=cls_set)The report will look like this:
precision recall f1-score support Emotion 0.83 0.85 0.84 101 ynQuestion 0.78 0.78 0.78 58 yAnswer 0.40 0.40 0.40 5 Continuer 0.33 0.15 0.21 13 whQuestion 0.78 0.72 0.75 50 System 0.99 0.98 0.98 259 Accept 0.80 0.59 0.68 27 Clarify 0.00 0.00 0.00 6 Emphasis 0.59 0.59 0.59 17 nAnswer 0.73 0.80 0.76 10 Greet 0.94 0.91 0.93 160 Statement 0.76 0.86 0.81 311 Reject 0.57 0.31 0.40 13 Bye 0.94 0.68 0.79 25 Other 0.00 0.00 0.00 1 avg / total 0.84 0.85 0.84 1056
The link to the NLTK book is broken.
ReplyDeleteAlso, you can use train_test_split function to do the random splitting into train/test data in one line. scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html#sklearn.cross_validation.train_test_split
thanks rolisz!
DeleteThank you very much for you example. It was very helpful for getting me started with my experiments.
ReplyDeleteYou left a minor error, however: you should witch the order of 'p' and 't_test_skl' when asking for the classification report. The API lists the true labels first and then the predicted labels second:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
Oh dear, I have been typing for too long today... Should have been:
Delete"for *your example"
"you should *switch"
I hope I caught all of my errors..
Thank you Ruben, I fixed the code and the report.
DeleteHow would you do this with a Random Forest classifier?
ReplyDeleteInitializing the classifier this way should work:
Deleteclassif = SklearnClassifier(RandomForestClassifier())
This comment has been removed by a blog administrator.
ReplyDeleteHow do I output the probability of the predicted instead of the classes?
ReplyDeleteHi Hock, you canĂ t get the probability with LinearSVC. Nut, there are other classifiers, the ones in sklearn.naive_bayes or sklearn.svm.SVC for example, that expose the method predict_proba that gives you what you need.
DeleteHi, thanks for posting your work online. I want to use your code to detect outlier in a time-series data. Hence it is an 1D clustering problem. Please help me using your code for such an one dimensional problem.
ReplyDeleteThanks