Friday, May 23, 2014

Code parallelization with joblib

Recently I've been working on the parallelization of some Python code and I discovered Joblib. It is a library that supports pipelining and offers a good support for parallelization. In this post we will implement a (very naive) paraller matrix by matrix multiplication algorithm to show the parallelization capabilities of this library.
from joblib import Parallel, delayed

def parallel_dot(A,B,n_jobs=2):
"""
Computes A x B using more CPUs.
This works only when the number
of rows of A and the n_jobs are even.
"""
parallelizer = Parallel(n_jobs=n_jobs)
# this iterator returns the functions to execute for each task
for A_block in np.split(A,n_jobs) )
# merging the output of the jobs
return np.vstack(result)
This function spreads the computation across more precesses. The strategy applied to distribute the data is very simple. Each process has the full matrix B and a contiguous block of rows of A, so it can compute a block of rows A*B. In the end, the result of each process is stacked to build final matrix.

Let's compare the parallel version of the algorithm with the sequential one:
A = np.random.randint(0,high=10,size=(1000,1000))
B = np.random.randint(0,high=10,size=(1000,1000))
%time _ = np.dot(A,B)
CPU times: user 13.2 s, sys: 36 ms, total: 13.2 s
Wall time: 13.4 s
%time _ = parallel_dot(A,B,n_jobs=2)
CPU times: user 92 ms, sys: 76 ms, total: 168 ms
Wall time: 8.49 s
Wow, we had a speedup of 1.6X, not bad for a so naive algorithm. It's important to notice that the arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process. Which means that the last time that parallel_dot have been called, the matrix B have been entirely replicated two times in memory. To avoid this problem, we can dump the matrices on the filesystem and pass a reference to the worker to open them as memory map.
import tempfile
import os

# saving A and B to a local file for memmapping
temp_folder = tempfile.mkdtemp()
filenameA = os.path.join(temp_folder, 'A.mmap')
dump(A, filenameA)
filenameB = os.path.join(temp_folder, 'B.mmap')
dump(A, filenameB)
Now, when parallel_dot(A_memmap,B_memmap,n_jobs=2) is called, both the processes created will use only a reference to the matrix B..

Tuesday, April 22, 2014

Parameters selection with Cross-Validation

Most of the pattern recognition techniques have one or more free parameters and choose them for a given classification problem is often not a trivial task. In real applications we only have access to a finite set of examples, usually smaller than we wanted, and we need to test our model on samples not seen during the training process. A model that would just classify the samples that it has seen would have a very good score, but would definitely fail to predict unseen data. This situation is called overfitting and to avoid it we need to apply an appropriate validation procedure to select the parameters. A tool that can help us solve this problem is the Cross-Validation (CV). The idea behind CV is simple: the data are split into train and test sets several consecutive times and the averaged value of the prediction scores obtained with the different sets is the evaluation of the classifier.
Let's see a simple example where a smoothing parameter for a Bayesian classifier is select using the capabilities of the Sklearn library.
To begin we load one of the test datasets provided by sklearn (the same used here) and we hold 33% of the samples for the final evaluation:
from sklearn.cross_validation import train_test_split
X,X_test,y,y_test = train_test_split(data.data,data.target,
test_size=.33,
random_state=1899)
Now, we import the classifier we want to use (a Bernoullian Naive Bayes in this case), specify a set of values for the parameter we want to choose and run a grid search:
from sklearn.naive_bayes import BernoulliNB
# test the model for alpha = 0.1, 0.2, ..., 1.0
parameters = [{'alpha':np.linspace(0.1,1,10)}]

from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(BernoulliNB(), parameters, cv=10, scoring='f1')
clf.fit(X,y) # running the grid search
The grid search has evaluated the classifier for each value specified for the parameter alpha using the CV. We can visualize the results as follows:
res = zip(*[(f1m, f1s.std(), p['alpha'])
for p, f1m, f1s in clf.grid_scores_])
subplot(2,1,1)
plot(res[2],res[0],'-o')
subplot(2,1,2)
plot(res[2],res[1],'-o')
show()

The plots above show the average score (top) and the standard deviation of the score (bottom) for each values of alpha used. Looking at the graphs it seems plausible that a small alpha could be a good choice.
We can also see thet using the alpha value that gave us the best results on the the test set we selected at the beginning gives us results that are similar to the ones obtained during the CV stage:
from sklearn.metrics import f1_score
print 'Best alpha in CV = %0.01f' % clf.best_params_['alpha']
final = f1_score(y_test,clf.best_estimator_.predict(X_test))
print 'F1-score on the final testset: %0.5f' % final
Best alpha in CV = 0.1
F1-score on the final testset: 0.85861

Wednesday, February 26, 2014

Terms selection with chi-square

In Natural Language Processing, the identification the most relevant terms in a collection of documents is a common task. It can produce meaningful insights about the data and it can also be useful to improve classification performances and computational efficiency. A popular measure of relevance for terms is the χ2 statistic. To compute it we can convert the terms of our document collection and turn them into features of a vectorial model, then χ2 can be computed as follow:

Where f is a feature (a term in this case), t is a target variable that we, usually, want to predict, A is the number of times that f and t cooccur, B is the number of times that f occurs without t, C is the number of times that t occurs without f, D is the number of times neither t or f occur and N is the number of observations.

Let's see how χ2 can be used through a simple example. We load some posts from 4 different newsgroups categories using the sklearn interface:
from sklearn.datasets import fetch_20newsgroups
# newsgroups categories
categories = ['alt.atheism','talk.religion.misc',
'comp.graphics','sci.space']

posts = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
From the posts loaded, we build a linear model using all the terms in the document collection but the stop words:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(posts.data)
Now, X is a document-term matrix where the element Xi,j is the frequency of the term j in the document i. Then, the features are given by the columns of X and we want to compute χ2 between the categories of interest and each feature in order to figure out what are the most relevant terms. This can be done as follows
from sklearn.feature_selection import chi2
# compute chi2 for each feature
chi2score = chi2(X,posts.target)[0]
To have a visual insight, we can plot a bar chart where each bar shows the χ2 value computed above:
from pylab import barh,plot,yticks,show,grid,xlabel,figure
figure(figsize=(6,6))
wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1])
topchi2 = zip(*wchi2[-25:])
x = range(len(topchi2[1]))
labels = topchi2[0]
barh(x,topchi2[1],align='center',alpha=.2,color='g')
plot(topchi2[1],x,'-o',markersize=2,alpha=.8,color='g')
yticks(x,labels)
xlabel('$\chi^2$')
show()

We can observe that the terms with a high χ2 can be considered relevant for the newsgroup categories we are analyzing. For example, the terms space, nasa and launch can be considered relevant for the group sci.space. The terms god, jesus and atheism can be considered relevant for the groups alt.atheism and talk.religion.misc. And, the terms image, graphics and jpeg can be considered relevant in the category comp.graphics.