I use IPython almost every day and I am very happy to review Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant, and published by Packt Publishing.
The book introduces the IPython basics and then focuses on how to combine IPython with some of the most useful libraries for data analysis such as Numpy, Matplotlib, Basemap and Pandas. Every topic is covered with examples and the code presented is also available online. The references proposed are always up-to-date and give the reader the opportunity to discovery resources not covered in the book.
Favorite chapter
The Chapter 5 is a little gem. Here, you can find an introduction on how to use IPython to write high performance code through Cython and the parallel programming facilities of IPython. The attention paid by the author on how to write efficient code is remarkable.
Conclusions
This book definitely achieves its goal to provide a technical introduction to IPython. It is intended for Python users who want an easy to follow introduction to IPython, but also experienced users will find this book useful. It is to notice that, at the moment, this is the only book about IPython.
Thursday, November 7, 2013
Sunday, September 15, 2013
Self Organizing Maps
Notice: For an update tutorial on how to use minisom refere to the examples in the official documentation.The Self Organizing Maps (SOM), also known as Kohonen maps, are a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display. In a SOM the neurons are organized in a bidimensional lattice and each neuron is fully connected to all the source nodes in the input layer. An illustration of the SOM by Haykin (1999) is the following
Each neuron n has a vector wn of weights associated. The process for training a SOM involves stepping through several training iteration until the item in your dataset are learnt by the SOM. For each pattern x one neuron n will "win" (which means that wn is the weights vector more similar to x) and this winning neuron will have its weights adjusted so that it will have a stronger response to the input the next time it sees it (which means that the distance between x and wn will be smaller). As different neurons win for different patterns, their ability to recognize that particular pattern will increase. The training algorithm can be summarized as follows:
- Initialize the weights of each neuron.
- Initialize t = 0
- Randomly pick an input x from the dataset
- Determine the winning neuron i as the neuron such that
- Adapt the weights of each neuron n according to the following rule
- Increment t by 1
- if t < tmax go to step 3
MiniSom is a minimalistic and Numpy based implementation of the SOM. I made it during the experiments for my thesis in order to have fully hackable SOM algorithm and lately I decided to release it on GitHub. The next part of this post will show how to train MiniSom on the Iris Dataset and how to visualize the result. The first step is to import and normalize the data:
from numpy import genfromtxt,array,linalg,zeros,apply_along_axis # reading the iris dataset in the csv format # (downloaded from http://aima.cs.berkeley.edu/data/iris.csv) data = genfromtxt('iris.csv', delimiter=',',usecols=(0,1,2,3)) # normalization to unity of each pattern in the data data = apply_along_axis(lambda x: x/linalg.norm(x),1,data)The snippet above reads the dataset from a CSV and creates a matrix where each row corresponds to a pattern. In this case, we have that each pattern has 4 dimensions. (Note that only the first 4 columns of the file are used because the fifth column contains the labels). The training process can be started as follows:
from minisom import MiniSom ### Initialization and training ### som = MiniSom(7,7,4,sigma=1.0,learning_rate=0.5) som.random_weights_init(data) print("Training...") som.train_random(data,100) # training with 100 iterations print("\n...ready!")Now we have a 7-by-7 SOM trained on our dataset. MiniSom uses a Gaussian as neighborhood function and its initial spread is specified with the parameter sigma. While with the parameter learning_rate we can specify the initial learning rate. The training algorithm implemented decreases both parameters as training progresses. This allows rapid initial training of the neural network that is then "fine tuned" as training progresses.
To visualize the result of the training we can plot the average distance map of the weights on the map and the coordinates of the associated winning neuron for each patter:
from pylab import plot,axis,show,pcolor,colorbar,bone bone() pcolor(som.distance_map().T) # distance map as background colorbar() # loading the labels target = genfromtxt('iris.csv', delimiter=',',usecols=(4),dtype=str) t = zeros(len(target),dtype=int) t[target == 'setosa'] = 0 t[target == 'versicolor'] = 1 t[target == 'virginica'] = 2 # use different colors and markers for each label markers = ['o','s','D'] colors = ['r','g','b'] for cnt,xx in enumerate(data): w = som.winner(xx) # getting the winner # palce a marker on the winning position for the sample xx plot(w[0]+.5,w[1]+.5,markers[t[cnt]],markerfacecolor='None', markeredgecolor=colors[t[cnt]],markersize=12,markeredgewidth=2) axis([0,som.weights.shape[0],0,som.weights.shape[1]]) show() # show the figureThe result should be like the following:
For each pattern in the dataset the corresponding winning neuron have been marked. Each type of marker represents a class of the iris data ( the classes are setosa, versicolor and virginica and they are respectively represented with red, green and blue colors). The average distance map of the weights is used as background (the values are showed in the colorbar on the right). As expected from previous studies on this dataset, the patterns are grouped according to the class they belong and a small fraction of Iris virginica is mixed with Iris versicolor.
For a more detailed explanation of the SOM algorithm you can look at its inventor's paper.
Tuesday, July 23, 2013
Combining Scikit-Learn and NTLK
In Chapter 6 of the book Natural Language Processing with Python there is a nice example where is showed how to train and test a Naive Bayes classifier that can identify the dialogue act types of instant messages. Th classifier is trained on the NPS Chat Corpus which consists of over 10,000 posts from instant messaging sessions labeled with one of 15 dialogue act types.
The implementation of the Naive Bayes classifier used in the book is the one provided in the NTLK library. Here we will see how to use use the Support Vector Machine (SVM) classifier implemented in Scikit-Learn without touching the features representation of the original example.
Here is the snippet to extract the features (equivalent to the one in the book):
The implementation of the Naive Bayes classifier used in the book is the one provided in the NTLK library. Here we will see how to use use the Support Vector Machine (SVM) classifier implemented in Scikit-Learn without touching the features representation of the original example.
Here is the snippet to extract the features (equivalent to the one in the book):
import nltk def dialogue_act_features(sentence): """ Extracts a set of features from a message. """ features = {} tokens = nltk.word_tokenize(sentence) for t in tokens: features['contains(%s)' % t.lower()] = True return features # data structure representing the XML annotation for each post posts = nltk.corpus.nps_chat.xml_posts() # label set cls_set = ['Emotion', 'ynQuestion', 'yAnswer', 'Continuer', 'whQuestion', 'System', 'Accept', 'Clarify', 'Emphasis', 'nAnswer', 'Greet', 'Statement', 'Reject', 'Bye', 'Other'] featuresets = [] # list of tuples of the form (post, features) for post in posts: # applying the feature extractor to each post # post.get('class') is the label of the current post featuresets.append((dialogue_act_features(post.text),cls_set.index(post.get('class'))))After the feature extraction we can split the data we obtained in training and testing set:
from random import shuffle shuffle(featuresets) size = int(len(featuresets) * .1) # 10% is used for the test set train = featuresets[size:] test = featuresets[:size]Now we can instantiate the model that implements classifier using the scikitlearn interface provided by NLTK and train it:
from sklearn.svm import LinearSVC from nltk.classify.scikitlearn import SklearnClassifier # SVM with a Linear Kernel and default parameters classif = SklearnClassifier(LinearSVC()) classif.train(train)In order to use the batch_classify method provided by scikitlearn we have to organize the test set in two lists, the first one with the train data and the second one with the target labels:
test_skl = [] t_test_skl = [] for d in test: test_skl.append(d[0]) t_test_skl.append(d[1])Then we can run the classifier on the test set and print a full report of its performances:
# run the classifier on the train test p = classif.batch_classify(test_skl) from sklearn.metrics import classification_report # getting a full report print classification_report(t_test_skl, p, labels=list(set(t_test_skl)),target_names=cls_set)The report will look like this:
precision recall f1-score support Emotion 0.83 0.85 0.84 101 ynQuestion 0.78 0.78 0.78 58 yAnswer 0.40 0.40 0.40 5 Continuer 0.33 0.15 0.21 13 whQuestion 0.78 0.72 0.75 50 System 0.99 0.98 0.98 259 Accept 0.80 0.59 0.68 27 Clarify 0.00 0.00 0.00 6 Emphasis 0.59 0.59 0.59 17 nAnswer 0.73 0.80 0.76 10 Greet 0.94 0.91 0.93 160 Statement 0.76 0.86 0.81 311 Reject 0.57 0.31 0.40 13 Bye 0.94 0.68 0.79 25 Other 0.00 0.00 0.00 1 avg / total 0.84 0.85 0.84 1056
Subscribe to:
Posts (Atom)