The Glowing Python

Monday, May 14, 2012

Manifold learning on handwritten digits with Isomap

The Isomap algorithm is an approach to manifold learning. Isomap seeks a lower dimensional embedding of a set of high dimensional data points estimating the intrinsic geometry of a data manifold based on a rough estimate of each data point’s neighbors.
The scikit-learn library provides a great implmentation of the Isomap algorithm and a dataset of handwritten digits. In this post we'll see how to load the dataset and how to compute an embedding of the dataset on a bidimentional space.
Let's load the dataset and show some samples:

from pylab import scatter,text,show,cm,figure
from pylab import subplot,imshow,NullLocator
from sklearn import manifold, datasets

# load the digits dataset
# 901 samples, about 180 samples per class 
# the digits represented 0,1,2,3,4
digits = datasets.load_digits(n_class=5)
X = digits.data
color = digits.target

# shows some digits
figure(1)
for i in range(36):
 ax = subplot(6,6,i)
 ax.xaxis.set_major_locator(NullLocator()) # remove ticks
 ax.yaxis.set_major_locator(NullLocator())
 imshow(digits.images[i], cmap=cm.gray_r)

The result should be as follows:

Now X is a matrix where each row is a vector that represent a digit. Each vector has 64 elements and it has been obtained using spatial resampling on the above images. We can apply the Isomap algorithm on this data and plot the result with the following lines:

# running Isomap
# 5 neighbours will be considered and reduction on a 2d space
Y = manifold.Isomap(5, 2).fit_transform(X)

# plotting the result
figure(2)
scatter(Y[:,0], Y[:,1], c='k', alpha=0.3, s=10)
for i in range(Y.shape[0]):
 text(Y[i, 0], Y[i, 1], str(color[i]),
      color=cm.Dark2(color[i] / 5.),
      fontdict={'weight': 'bold', 'size': 11})
show()

The new embedding for the data will be as follows:

We computed a bidimensional version of each pattern in the dataset and it's easy to see that the separation between the five classes in the new manifold is pretty neat.

Friday, May 4, 2012

Analyzing your Gmail with Matplotlib

Lately, I read this post about using Mathematica to analyze a Gmail account. I found it very interesting and I worked a with imaplib and matplotlib to create two of the graph they showed:

A diurnal plot, which shows the date and time each email was sent (or received), with years running along the x axis and times of day on the y axis.
And a daily distribution histogram, which represents the distribution of emails sent by time of day.

In order to plot those graphs I created three functions. The first one, retrieve the headers of the emails we want to analyze:

from imaplib import IMAP4_SSL
from datetime import date,timedelta,datetime
from time import mktime
from email.utils import parsedate
from pylab import plot_date,show,xticks,date2num
from pylab import figure,hist,num2date
from matplotlib.dates import DateFormatter

def getHeaders(address,password,folder,d):
 """ retrieve the headers of the emails 
     from d days ago until now """
 # imap connection
 mail = IMAP4_SSL('imap.gmail.com')
 mail.login(address,password)
 mail.select(folder) 
 # retrieving the uids
 interval = (date.today() - timedelta(d)).strftime("%d-%b-%Y")
 result, data = mail.uid('search', None, 
                      '(SENTSINCE {date})'.format(date=interval))
 # retrieving the headers
 result, data = mail.uid('fetch', data[0].replace(' ',','), 
                         '(BODY[HEADER.FIELDS (DATE)])')
 mail.close()
 mail.logout()
 return data

The second one, make us able to make the diurnal plot:

def diurnalPlot(headers):
 """ diurnal plot of the emails, 
     with years running along the x axis 
     and times of day on the y axis.
 """
 xday = []
 ytime = []
 for h in headers: 
  if len(h) > 1:
   timestamp = mktime(parsedate(h[1][5:].replace('.',':')))
   mailstamp = datetime.fromtimestamp(timestamp)
   xday.append(mailstamp)
   # Time the email is arrived
   # Note that years, month and day are not important here.
   y = datetime(2010,10,14, 
     mailstamp.hour, mailstamp.minute, mailstamp.second)
   ytime.append(y)

 plot_date(xday,ytime,'.',alpha=.7)
 xticks(rotation=30)
 return xday,ytime

And this is the function for the daily distribution histogram:

def dailyDistributioPlot(ytime):
 """ draw the histogram of the daily distribution """
 # converting dates to numbers
 numtime = [date2num(t) for t in ytime] 
 # plotting the histogram
 ax = figure().gca()
 _, _, patches = hist(numtime, bins=24,alpha=.5)
 # adding the labels for the x axis
 tks = [num2date(p.get_x()) for p in patches] 
 xticks(tks,rotation=75)
 # formatting the dates on the x axis
 ax.xaxis.set_major_formatter(DateFormatter('%H:%M'))

Now we got everything we need to make the graphs. Let's try to analyze the outgoing mails of last 5 years:

print 'Fetching emails...'
headers = getHeaders('iamsupersexy@gmail.com',
                      'ofcourseiamsupersexy','inbox',365*5)

print 'Plotting some statistics...'
xday,ytime = diurnalPlot(headers)
dailyDistributioPlot(ytime)
print len(xday),'Emails analysed.'
show()

The result would appear as follows

We can analyze the outgoing mails just using selecting the folder '[Gmail]/Sent Mail':

print 'Fetching emails...'
headers = getHeaders('iamsupersexy@gmail.com',
                     'ofcourseiamsupersexy','[Gmail]/Sent Mail',365*5)

print 'Plotting some statistics...'
xday,ytime = diurnalPlot(headers)
dailyDistributioPlot(ytime)
print len(xday),'Emails analysed.'
show()

And this is the result:

Thursday, April 26, 2012

K-Nearest Neighbour Classifier

The Nearest Neighbour Classifier is one of the most straightforward classifier in the arsenal of machine learning techniques. It performs the classiﬁcation by identifying the nearest neighbours to a query pattern and using those neighbors to determine the label of the query. The idea behind the algorithm is simple: Assign the query pattern to the class which occurs the most in the k nearest neighbors. In this post we'll use the function knn_search(...) that we have seen in the last post to implement a K-Nearest Neighbour Classifier. The implementation of the classifier is as follows:

from numpy import random,argsort,argmax,bincount,int_,array,vstack,round
from pylab import scatter,show

def knn_classifier(x, D, labels, K):
 """ Classify the vector x
     D - data matrix (each row is a pattern).
     labels - class of each pattern.
     K - number of neighbour to use.
     Returns the class label and the neighbors indexes.
 """
 neig_idx = knn_search(x,D,K)
 counts = bincount(labels[neig_idx]) # voting
 return argmax(counts),neig_idx

Let's test the classifier on some random data:

 # generating a random dataset with random labels
data = random.rand(2,150) # random points
labels = int_(round(random.rand(150)*1)) # random labels 0 or 1
x = random.rand(2,1) # random test point

# label assignment using k=5
result,neig_idx = knn_classifier(x,data,labels,5)
print 'Label assignment:', result

# plotting the data and the input pattern
# class 1, red points, class 0 blue points
scatter(data[0,:],data[1,:], c=labels,alpha=0.8)
scatter(x[0],x[1],marker='o',c='g',s=40)
# highlighting the neighbours
plot(data[0,neig_idx],data[1,neig_idx],'o',
  markerfacecolor='None',markersize=15,markeredgewidth=1)
show()

The script will show the following graph:

The query vector is represented with a green point and we can see that the 3 out of 5 nearest neighbors are red points (label 1) while the remaining 2 are blue (label 2).
The result of the classification will be printed on the console:

Label assignment: 1

As we expected, the green point have been assigned to the class with red markers.