Thursday, December 12, 2013

Multiple axes and subplots in Plotly

Some time ago we have seen how to visualize 2D histograms with Plotly and in this post we will see how to use one of the mostin interesting new features introduced by the Plotly guys: multiple axes into subplots. This features makes us able to couple subplots, so when you zoom or pan in one subplot, it zooms and pans in the other subplots. Just like the graphs produced by D3.
Here's how to plot a subplots matrix where each cell is a scatterplot between the features of the Iris dataset that we already used here. The first thing we need to do is to convert the data in the format required by the Plotly API:
from sklearn.datasets import load_iris
iris = load_iris()

attr = [f.replace(' (cm)', '') for f in iris.feature_names]
colors = {'setosa': 'rgb(31, 119, 180)', 
          'versicolor': 'rgb(255, 127, 14)', 
          'virginica': 'rgb(44, 160, 44)'}

data = []
for i in range(4):
    for j in range(4):
        for t,flower in enumerate(iris.target_names):
            data.append({"name": flower, 
                         "x":[ == t,i],
                         "y":[ == t,j],
                         "type":"scatter", "mode":"markers",
                         'marker': {'color': colors[flower], 
                         "xaxis": "x"+(str(i) if i!=0 else ''),
                         "yaxis": "y"+(str(j) if j!=0 else '')})
Then, we create a layout to adjust the look and feel:
d = 0.04; # padding
dms = [[i*d+i*(1-3*d)/4,i*d+((i+1)*(1-3*d)/4)] for i in range(4)]

layout = {
    "xaxis":{"domain":dms[0], "title":attr[0], 
    "yaxis":{"domain":dms[0], "title":attr[0], 
    "xaxis1":{"domain":dms[1], "title":attr[1], 
    "yaxis1":{"domain":dms[1], "title":attr[1], 
    "xaxis2":{"domain":dms[2], "title":attr[2], 
    "yaxis2":{"domain":dms[2], "title":attr[2], 
    "xaxis3":{"domain":dms[3], "title":attr[3], 
    "yaxis3":{"domain":dms[3], "title":attr[3], 
    "width": 500,
    "height": 550,
    "title":"Iris flower data set",
    "titlefont":{'color':'rgb(67,67,67)', 'size': 20}
Finally, we import the plotly module (see this page for more details about the installation) and we are read to invoke the Plotly remote service:
import plotly
p = plotly.plotly('supersexyusername', 'mysecretkey')
# iplot shows the graph in the ipython notebook
# use plot if you're outside of the notebook
p.iplot(data,layout=layout, width=500,height=550)
The result should be as follows: This interactive graph of the iris data set below was inspired by this wonderful D3 example by Mike Bostock. Find out more example of Plotly visualizations in Python inside the IPython notebook here.

Thursday, November 7, 2013

Book review: Learning IPython for Interactive Computing and Data Visualization

I use IPython almost every day and I am very happy to review Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant, and published by Packt Publishing.

The book introduces the IPython basics and then focuses on how to combine IPython with some of the most useful libraries for data analysis such as Numpy, Matplotlib, Basemap and Pandas. Every topic is covered with examples and the code presented is also available online. The references proposed are always up-to-date and give the reader the opportunity to discovery resources not covered in the book.

Favorite chapter

The Chapter 5 is a little gem. Here, you can find an introduction on how to use IPython to write high performance code through Cython and the parallel programming facilities of IPython. The attention paid by the author on how to write efficient code is remarkable.


This book definitely achieves its goal to provide a technical introduction to IPython. It is intended for Python users who want an easy to follow introduction to IPython, but also experienced users will find this book useful. It is to notice that, at the moment, this is the only book about IPython.

Sunday, September 15, 2013

Self Organizing Maps

The Self Organizing Maps (SOM), also known as Kohonen maps, are a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display. In a SOM the neurons are organized in a bidimensional lattice and each neuron is fully connected to all the source nodes in the input layer. An illustration of the SOM by Haykin (1999) is the following

Each neuron n has a vector wn of weights associated. The process for training a SOM involves stepping through several training iteration until the item in your dataset are learnt by the SOM. For each pattern x one neuron n will "win" (which means that wn is the weights vector more similar to x) and this winning neuron will have its weights adjusted so that it will have a stronger response to the input the next time it sees it (which means that the distance between x and wn will be smaller). As different neurons win for different patterns, their ability to recognize that particular pattern will increase. The training algorithm can be summarized as follows:
  1. Initialize the weights of each neuron.
  2. Initialize t = 0
  3. Randomly pick an input x from the dataset
  4. Determine the winning neuron i as the neuron such that

  5. Adapt the weights of each neuron n according to the following rule

  6. Increment t by 1
  7. if t < tmax go to step 3
We have that η(t) is called learning rate and that h(i) is called neighborhood function which has high values for i and the neurons close to i on the lattice (a Gaussian centered on i is a good example of neighborhood function). And, when t increases η also decrease and h decrease its spread. This way at each training step the weights of the neurons close to the winning one are adjusted to have a stronger response to the current pattern. After the training process, we have that the locations of the neurons become ordered and a meaningful coordinate system for the input features is created on the lattice. So, if we consider the coordinates of the associated winning neuron for each patter the SOM forms a topographic map of the input patterns.

MiniSom is a minimalistic and Numpy based implementation of the SOM. I made it during the experiments for my thesis in order to have fully hackable SOM algorithm and lately I decided to release it on GitHub. The next part of this post will show how to train MiniSom on the Iris Dataset and how to visualize the result. The first step is to import and normalize the data:
from numpy import genfromtxt,array,linalg,zeros,apply_along_axis

# reading the iris dataset in the csv format    
# (downloaded from
data = genfromtxt('iris.csv', delimiter=',',usecols=(0,1,2,3))
# normalization to unity of each pattern in the data
data = apply_along_axis(lambda x: x/linalg.norm(x),1,data)
The snippet above reads the dataset from a CSV and creates a matrix where each row corresponds to a pattern. In this case, we have that each pattern has 4 dimensions. (Note that only the first 4 columns of the file are used because the fifth column contains the labels). The training process can be started as follows:
from minisom import MiniSom
### Initialization and training ###
som = MiniSom(7,7,4,sigma=1.0,learning_rate=0.5)
som.train_random(data,100) # training with 100 iterations
Now we have a 7-by-7 SOM trained on our dataset. MiniSom uses a Gaussian as neighborhood function and its initial spread is specified with the parameter sigma. While with the parameter learning_rate we can specify the initial learning rate. The training algorithm implemented decreases both parameters as training progresses. This allows rapid initial training of the neural network that is then "fine tuned" as training progresses.
To visualize the result of the training we can plot the average distance map of the weights on the map and the coordinates of the associated winning neuron for each patter:
from pylab import plot,axis,show,pcolor,colorbar,bone
pcolor(som.distance_map().T) # distance map as background
# loading the labels
target = genfromtxt('iris.csv',
t = zeros(len(target),dtype=int)
t[target == 'setosa'] = 0
t[target == 'versicolor'] = 1
t[target == 'virginica'] = 2
# use different colors and markers for each label
markers = ['o','s','D']
colors = ['r','g','b']
for cnt,xx in enumerate(data):
 w = som.winner(xx) # getting the winner
 # palce a marker on the winning position for the sample xx
show() # show the figure
The result should be like the following:

For each pattern in the dataset the corresponding winning neuron have been marked. Each type of marker represents a class of the iris data ( the classes are setosa, versicolor and virginica and they are respectively represented with red, green and blue colors). The average distance map of the weights is used as background (the values are showed in the colorbar on the right). As expected from previous studies on this dataset, the patterns are grouped according to the class they belong and a small fraction of Iris virginica is mixed with Iris versicolor.

For a more detailed explanation of the SOM algorithm you can look at its inventor's paper.

Tuesday, July 23, 2013

Combining Scikit-Learn and NTLK

In Chapter 6 of the book Natural Language Processing with Python there is a nice example where is showed how to train and test a Naive Bayes classifier that can identify the dialogue act types of instant messages. Th classifier is trained on the NPS Chat Corpus which consists of over 10,000 posts from instant messaging sessions labeled with one of 15 dialogue act types.
The implementation of the Naive Bayes classifier used in the book is the one provided in the NTLK library. Here we will see how to use use the Support Vector Machine (SVM) classifier implemented in Scikit-Learn without touching the features representation of the original example.
Here is the snippet to extract the features (equivalent to the one in the book):
import nltk

def dialogue_act_features(sentence):
        Extracts a set of features from a message.
    features = {}
    tokens = nltk.word_tokenize(sentence)
    for t in tokens:
        features['contains(%s)' % t.lower()] = True    
    return features

# data structure representing the XML annotation for each post
posts = nltk.corpus.nps_chat.xml_posts() 
# label set
cls_set = ['Emotion', 'ynQuestion', 'yAnswer', 'Continuer',
'whQuestion', 'System', 'Accept', 'Clarify', 'Emphasis', 
'nAnswer', 'Greet', 'Statement', 'Reject', 'Bye', 'Other']
featuresets = [] # list of tuples of the form (post, features)
for post in posts: # applying the feature extractor to each post
 # post.get('class') is the label of the current post
After the feature extraction we can split the data we obtained in training and testing set:
from random import shuffle
size = int(len(featuresets) * .1) # 10% is used for the test set
train = featuresets[size:]
test = featuresets[:size]
Now we can instantiate the model that implements classifier using the scikitlearn interface provided by NLTK and train it:
from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
# SVM with a Linear Kernel and default parameters 
classif = SklearnClassifier(LinearSVC())
In order to use the batch_classify method provided by scikitlearn we have to organize the test set in two lists, the first one with the train data and the second one with the target labels:
test_skl = []
t_test_skl = []
for d in test:
Then we can run the classifier on the test set and print a full report of its performances:
# run the classifier on the train test
p = classif.batch_classify(test_skl)
from sklearn.metrics import classification_report
# getting a full report
print classification_report(t_test_skl, p, labels=list(set(t_test_skl)),target_names=cls_set)
The report will look like this:
              precision    recall  f1-score   support

    Emotion       0.83      0.85      0.84       101
 ynQuestion       0.78      0.78      0.78        58
    yAnswer       0.40      0.40      0.40         5
  Continuer       0.33      0.15      0.21        13
 whQuestion       0.78      0.72      0.75        50
     System       0.99      0.98      0.98       259
     Accept       0.80      0.59      0.68        27
    Clarify       0.00      0.00      0.00         6
   Emphasis       0.59      0.59      0.59        17
    nAnswer       0.73      0.80      0.76        10
      Greet       0.94      0.91      0.93       160
  Statement       0.76      0.86      0.81       311
     Reject       0.57      0.31      0.40        13
        Bye       0.94      0.68      0.79        25
      Other       0.00      0.00      0.00         1

avg / total       0.84      0.85      0.84      1056

Friday, July 12, 2013

Hopalong fractals

Have you ever wondered what happen if you pick a point (x0,y0) and compute hundreds of point using these equations?

Well, you get a hopalong fractal.
Let's plot this fractal using Pylab. The following function computes n points using the equations above:
from __future__ import division
from numpy import sqrt,power

def hopalong(x0,y0,n,a=-55,b=-1,c=-42):
 def update(x,y):
  x1 = y-x/abs(x)*sqrt(abs(b*x+c))
  y1 = a-x
  return x1,y1
 xx = []
 yy = []
 for _ in xrange(n):
  x0,y0 = update(x0,y0) 
 return xx,yy
and this snippet computes 40000 points starting from (-1,10):
from pylab import scatter,show, cm, axis
from numpy import array,mean
x = -1
y = 10
n = 40000
xx,yy = hopalong(x,y,n)
cr = sqrt(power(array(xx)-mean(xx),2)+power(array(yy)-mean(yy),2))
scatter(xx, yy, marker='.', c=cr/max(cr), 
        edgecolor='w', cmap=cm.Dark2, s=50)
Here we have one of the possible hopalong fractals:

Varying the starting point and the values of a, b and c we have different fractals. Here are some of them:



Friday, June 28, 2013

Shape Matching Experiments

Often, in Computer Vision tasks we have a model of an interesting shape and we want to know if this model matches with a target shape found through the analysis of the edges of an image. The target shape is never identical to the model we have, because it could be a representation of the model in a different pose or it could match only partially our model, so we have to find a proper transformation of the model in order to match the target. In this post we will see how to find the closest shape model to a target shape using the fmin function provided by Scipy.
We can represent a shape with a matrix like the following:

where each pair (xi,yi) is a "landmark" point of the shape and we can transform every generic point (x,y) using this function:

In this transformation we have that θ is a rotation angle, tx and ty are the translation along the x and the y axis respectively, while s is a scaling coefficient. Applying this transformation to every point of the shape described by X we get a new shape which is a transformation of the original one according to the parameters used in T. Given a matrix as defined above, the following Python function is able to apply the transformation to every point of the shape represented by the matrix:
import numpy as np
import pylab as pl
from scipy.optimize import fmin

def transform(X,theta,tx=0,ty=0,s=1):
 """ Performs scaling, rotation and translation
     on the set of points in the matrix X 
      X, points (each row have to be a point [x, y])
      theta, rotation angle (in radiants)
      tx, translation along x
      ty, translation along y
      s, scaling coefficient
      Y, transformed set of points
 a = math.cos(theta)
 b = math.sin(theta)
 R = np.mat([[a*s, -b*s], [b*s, a*s]])
 Y = R*X # rotation and scaling
 Y[0,:] = Y[0,:]+tx # translation
 Y[1,:] = Y[1,:]+ty
 return Y
Now, we want to find the best pose (translation, scale and rotation) to match a model shape X to a target shape Y. We can solve this problem minimizing the sum of square distances between the points of X and the ones of Y, which means that we want to find

where T' is a function that applies T to all the point of the input shape. This task turns out to be pretty easy using the fmin function provided by Scipy:
def error_fun(p,X,Y):
 """ cost function to minimize """
 return np.linalg.norm(transform(X,p[0],p[1],p[2],p[3])-Y)

def match(X,Y,p0):
 """ Match the model X with the shape Y
     returns the set of parameters to transform
     X into Y and a set of partial solutions.
 p_opt = fmin(error_fun, p0, args=(X,Y),retall=True)
 return p_opt[0],p_opt[1] # optimal solutions, other solutions
In the code above we have defined a cost function which measures how close the two shapes are and another function that minimizes the difference between the two shapes and returns the transfomation parameters. Now, we can try to match the trifoil shape starting from one of its transformations using the functions defined above:
# generating a shape to match
t = np.linspace(0,2*np.pi,50)
x = 2*np.cos(t)-.5*np.cos( 4*t )
y = 2*np.sin(t)-.5*np.sin( 4*t )
A = np.array([x,y]) # model shape
# transformation parameters
p = [-np.pi/2,.5,.8,1.5] # theta,tx,ty,s
Ar = transform(A,p[0],p[1],p[2],p[3]) # target shape
p0 = np.random.rand(4) # random starting parameters
p_opt, allsol = match(A,Ar,p0) # matching
It's interesting to visualize the error at each iteration:
pl.plot([error_fun(s,A,Ar) for s in allsol])

and the value of the parameters of the transformation at each iteration:

From the graphs above we can observe that the the minimization process found its solution after 150 iterations. Indeed, after 150 iterations the error become very close to zero and there is almost no variation in the parameters. Now we can visualize the minimization process plotting some of the solutions in order to compare them with the target shape:
def plot_transform(A,p):
 Afound = transform(A,p[0],p[1],p[2],p[3])

for k,i in enumerate(range(0,len(allsol),len(allsol)/15)):
 pl.plot(Ar[0,:].T,Ar[1,:].T,'--',linewidth=2) # target shape

In the graph above we can see that the initial solutions (the green ones) are very different from the shape we are trying to match (the dashed blue ones) and that during the process of minimization they get closer to the target shape until they fits it completely. In the example above we used two shapes of the same form. Let's see what happen when we have a starting shape that has a different form from the target model:
t = np.linspace(0,2*np.pi,50)
x = np.cos(t)
y = np.sin(t)
Ac = np.array([x,y]) # the circle (model shape)
p0 = np.random.rand(4) # random starting parameters
# the target shape is the trifoil again
p_opt, allsol = match(Ac,Ar,p0) # matching
In the example above we used a circle as model shape and a trifoil as target shape. Let's see what happened during the minimization process:
for k,i in enumerate(range(0,len(allsol),len(allsol)/15)):
 pl.plot(Ar[0,:].T,Ar[1,:].T,'--',linewidth=2) # target shape

In this case, where the model shape can't match the target shape completely, we see that the minimization process is able to find the circle that is closer to the trifoil.

Sunday, June 16, 2013

2D Histrograms with Plotly

Plotly is an online tool that makes us able to create wonderful interactive visualizations of our data. It can plot data from csv files, spreadsheet, etc. but it also has a Python sandbox where we can put our Python snippets! In this post we will see a simple example that shows how to plot a 2D histogram in Plotly.

First, we need a snippet to generate some random sets of data:
from numpy import *
# generate some random sets of data
y0 = random.randn(100)/5. + 0.5 
x0 = random.randn(100)/5. + 0.5 
y1 = random.rayleigh(size=20)/7. + 0.1
x1 = random.rayleigh(size=20)/8. + 1.1
y2 = random.randn(50)/10. + 0.9
x2 = random.rayleigh(size=50)/10. + 0.1
y3 = random.randn(50)/8. + 0.1
x3 = random.randn(50)/8. + 0.1
y = concatenate([y0,y1,y2,y3])
x = concatenate([x0,x1,x2,x3])

The distribution of the variable x looks like:

The distribution of the variable y looks like: And the 2D histogram of both variables looks like this:

As showed in the colorbar, cells with lighter colors correspond to high density areas of the our distribution.

All the plots above were made with Plotly inside their Python sandbox using the following code:
## place the data into Plotly's dict format

# histograms
histx = {'x': x, 'type':'histogramx'}
histy = {'y': y, 'type':'histogramy'}
hist2d = {'x': x, 'y': y, 'type':'histogram2d'}

# scatter plots above the 1D histograms
# "jitter" the scatter plot points to make their distribution easier to distinguish
jitterx = {'x': x, 'y': 60+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

jittery = {'x': y, 'y': 35+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

# scatter points in the 2D histogram
xy = {'x': x, 'y': y, 'type':'scatter','mode':'markers','marker':{'size':5,'opacity':0.5,'symbol':'square'}}

# NOTE: the following lines plot all the graph above
plot([histx, jitterx], layout={'title': 'Distribution of Variable 1'})
plot([histy, jittery], layout={'title': 'Distribution of Variable 2'})
plot([hist2d,xy], layout={'title': 'Distribution of Variable 1 and Variable 2'})
Plots made with Plotly automatically provide interactions (click-drag to zoom, double-click to autoscale, shift-click to pan) and are very easy to embed in web page using the embedding snippet.

Thanks to the Plotly guys for providing the code of this post and this amazing tool :)

Saturday, May 25, 2013

A colorful Trifoil Knot

Some time ago I was looking for a cool way to visualize parametric equations and I started working on the the trefoil knot. It is the simplest example of a nontrivial knot and it can be defined with the following parametric equations:

After a while this snippet came out:
from numpy import linspace,pi,cos,sin
phi = linspace(0,2*pi,100)
x = sin(phi)+2*sin(2*phi)
y = cos(phi)-2*cos(2*phi)
z = -sin(3*phi)

from pylab import scatter,subplot,cm,show
from numpy import abs
So, here's my colorful version of the trifold knot:

Friday, May 3, 2013

A new RefCard from the GlowingPython!

Check out the DZone RefCard from the GlowingPython:

This Refcard is a collection of code examples that introduces the reader to the principal Data Mining tasks using Python. In the RefCard you will find the following contents:
  • How to import and visualize data.
  • How to classify and cluster data.
  • How to discover relationships in the data using regression and correlation measures.
  • How to reduce the dimensionality of the data in order to compress and visualize the information it brings.
  • How to analyze structured data with networkx.
Each topic is covered with code examples based on four of the major Python libraries for data analysis and manipulation: numpy, matplotlib,sklearn and networkx. Here is a preview of the first two pages:

Click on the preview to get the RefCard!

Thursday, April 11, 2013

Odd-Even sort visualized

The Odd/Even sort is a sorting algorithm which uses the concept of the Bubble Sort to move elements around. Unlike Bubble sort, the Odd/Even sort compares disjointed pairs by using alternating odd and even index values splitting the sorting in different phases. We have that in the odd phase all the element with an odd index i are compared with the adjacent even index i+1 and in the even phase all the element with an even index i are compared with the adjacent odd index i+1. These two phases are repeated until no exchanges are required. This is a Python snippet that make us able to visualize the behavior of the algorithm:
def oddeven_anim(a):
 imgidx = 0
 x = range(len(a))
 sort = False; 
 while not sort:
  sort = True;
  for i in range(1,len(a)-1,2): # odd phase
   if a[i] > a[i+1]:
    a[i+1], a[i] = a[i], a[i+1]
    sort = False;
  for i in range(0,len(a)-1,2): # even phase
   if a[i] > a[i+1]:
    a[i+1], a[i] = a[i], a[i+1]
    sort = False;
  pylab.savefig("oddevensort/img" + '%04d' % imgidx + ".png")
  imgidx =  imgidx + 1

# run the algorithm
a = range(300)
As in the other examples of sorting visualization in this blog, we have an image for each step of the algorithm. The following video have been produced joining all the images (here is explained how):

Monday, April 1, 2013

Real-time Twitter analysis

The twitter API is a great tool for analyze tweets by code. In particular, the streaming API gives real time access to the global stream of tweets and, unlike a conventional REST API, it is used through a continuous connection to the Twitter servers. So it requires a persistent HTTP connection open as long as you need to collect tweets. The typical workflow of an application which uses this API is the following:

The easiest way to handle an HTTP streaming in Python is to use PyCurl, the Python bindings for the famous Curl network library. PyCurl allows you to provide a callback function that will be executed every time a new block of data is available. The following code is a simple demonstration of how we can use HTTP streaming with PyCurl in order to analyzie a stream of tweet:
from __future__ import division
from collections import defaultdict
from pylab import barh,show,yticks
import pycurl
import simplejson
import sys
import nltk
import re

def plot_histogram(freq, mean):
 # using dict comprehensions to remove not frequent words
 topwords = {word : count 
             for word,count in freq.items() 
             if count > round(2*mean)}
 # plotting
 y = topwords.values()
 x = range(len(y))
 labels = topwords.keys()

class TwitterAnalyzer:
 def __init__(self):
  self.freq = defaultdict(int)
  self.cnt = 0
  self.mean = 0.0
  # composing the twitter stream url
  nyc_area = 'locations=-74,40,-73,41'
  self.url = ""+nyc_area

 def begin(self,usr,pws):
    init and start the connection with twitter using pycurl
    usr and pws must be valid twitter credentials
  self.conn = pycurl.Curl()
  # we use the basic authentication, 
  # in future oauth2 could be required
  self.conn.setopt(pycurl.USERPWD, "%s:%s" % (usr, pws))
  self.conn.setopt(pycurl.URL, self.url)
  self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)

 def on_receive(self,data):
  """ Handles the arrive of a single tweet """
  self.cnt += 1
  tweet = simplejson.loads(data)
  # a little bit of natural language processing
  tokens = nltk.word_tokenize(tweet['text']) # tokenize
  tagged_sent = nltk.pos_tag(tokens) # Part Of Speech tagging
  for word,tag in tagged_sent:
   # filter sigle chars words and symbols
   if len(word) > 1 and re.match('[A-Za-z0-9 ]+',word):
    # consider only adjectives and nouns
    if tag == 'JJ' or tag == 'NN':
     self.freq[word] += 1 # keep the count
  # print the statistics every 50 tweets
  if self.cnt % 50 == 0:

 def print_stats(self):
  maxc = 0
  sumc = 0
  for word,count in self.freq.items():
   if maxc < count:
    maxc = count
   sumc += count
  self.mean = sumc/len(self.freq)
  print '-------------------------------'
  print ' tweets analyzed:', self.cnt
  print ' words extracted:', len(self.freq)
  print '   max frequency:', maxc
  print '  mean frequency:', self.mean

 def close_and_plot(self,signal,frame):
  print ' Plotting...'  
In the constructor of this class we initialize a dictionary that will contain the frequency of each word, a string that contains the url of the service we need to call (composed in order to query twitter for the tweets in NYC) and the variables cnt and mean to keep track of the number of tweets analyzed and of the mean frequency over all the words.
In the method begin, we use the PyCurl library for the authentication to Twitter and start the connection. In particular, we set that the method on_receive is the callback function demanded to processing of the incoming. In this method the actual analysis is done, every tweet is split in tokens and a part of speech tagging is performed. Then, the frequency of all the words that are adjective or nouns is updated.
The method print_stats is used to print the our statistics on the console while close_and_plot plots an histogram using the frequencies in the dictionary and closes the program.
Let's use this class:
import signal
usr = 'supersexytwitteruser'
pws = "yessosexyiam"

ta = TwitterAnalyzer()
# invoke the close_and_plot() method when a Ctrl-D arrives
signal.signal(signal.SIGINT, ta.close_and_plot)
ta.begin(usr,pws) # run the analysis
In the code above, a TwitterAnalyzer object is instantiated, its method close_and_plot is registered as handler for the SIGINT signal and, finally, begin is invoked.
This code starts a program which analyzes all the tweet of the New York Area in real time and prints the statistics every 50 tweets, just like follows:
 tweets analyzed: 50
 words extracted: 110
   max frequency: 8
  mean frequency: 1.12727272727
 tweets analyzed: 100
 words extracted: 200
   max frequency: 22
  mean frequency: 1.235
 tweets analyzed: 150
 words extracted: 286
   max frequency: 29
  mean frequency: 1.26573426573
 tweets analyzed: 200
 words extracted: 407
   max frequency: 39
  mean frequency: 1.31203931204
 tweets analyzed: 250
 words extracted: 495
   max frequency: 49
  mean frequency: 1.37575757576
Pressing Cntrl-D we can stop the program and plot a bar chart of adjectives and nouns detected. This is what I got in the morning of March 21, 2013:

We see that is very common to post a link in a tweet (turned out that http is considered as a noun most of the time) and that the words day, today, good and morning were the most used during the analysis.

Monday, March 18, 2013

Binary Plots

A binary plot of an integer sequence is a plot of the binary representations of successive terms where each term is represented as a sequence of bits with 1s colored black and 0s colored white. Then each representation is stacked to form a table where each entry represents a bit in the sequence.
To make a binary plot of a sequence we need to convert each term of the sequence into its binary representation. Then we have to put this representation in a form which make us able to build the plot easily. The following function converts the sequence in a matrix where the element i-j represents is the bit j of the element i of the sequence:
from numpy import zeros,uint8
from pylab import imshow,subplot,show

def seq_to_bitmatrix(s):
 """ Returns a matrix where the row i 
     is the binary representation of s[i] 
     s must be a list of integers (also long) """
 # maximum number of bits used in this sequence
 maxbit = len(bin(max(s)))
 M = zeros((len(s),maxbit),dtype=uint8)
 for i,e in enumerate(s):
  bits = bin(e)[2:] # bin() -> 0b100, we drop 2 chars
  for j,b in enumerate(bits):
   M[i,maxbit-len(bits)+j] = int(b) # fill the matrix
 return M       
The matrix returned by the function above represent each bit with an integer. Of course, this is a waste of space but makes us able to use the function imshow in order to draw the binary plot:
def show_binary_matrix(M):
 imshow(M,cmap='Greys', interpolation='nearest')
Now we have all the necessary. Let's plot the factorial sequence:
from math import factorial
s = [factorial(n) for n in range(200)]
The following graph shows the result:

From this graph we can observe two interesting things. The first is that we need more than 1200 bits to represent the biggest number of the sequence we generated, which is 199!, and the second is that when n grows the bits on the right tend to be unused.
Let's see what happen with the famous Fibonacci sequence:
fib_cache = {0:0,1:1}
def fib_memo(n):
 """ Returns the first n Fibonacci numbers """
 if n not in fib_cache:
  fib_cache[n] = fib_memo(n-1)+fib_memo(n-2)
 return fib_cache[n]

s = [fib_memo(n) for n in range(1,150)]
In the snippet above we compute the first 149 Fibonacci numbers using memoization then we plotted the sequence in the same way of the factorial. This time the result is surprising:

Indeed, it is very interesting to observe that the graph reveals a fractal like pattern of hollow and filled triangles.
Another sequence that shows a fractal like series of patter is the sum of the first n natural numbers which we can plot as follows:
s = [n*(n+1)/2 for n in range(124)]
In this case the Gauss formula have been used to compute the terms of the sequence and this is the result:

This time the matrix have been transposed to have a vertical visualization. We have that each column is a binary representation of a term of the sequence. In conclusion, we plot the sequence np for p = 1,2,3,4:
# n^p
for p in range(1,5):
    s = [pow(n,p) for n in range(124)]
    imshow(seq_to_bitmatrix(s).T,cmap='Greys',  interpolation='nearest')
And we have this:

We can see that for p=1,2 we have something that looks like the graph we obtained with the sequence of the sum of the first n natural numbers.

Thursday, March 7, 2013

Insertion Sort animation

To continue the series of animated sorting algorithm this time we will deal with the Insertion Sort. We are still in the basic algorithms and the way it works is pretty simple. It inserts each element of the array into its proper position, leaving progressively larger stretches of the array sorted. What this means in practice is that the sort iterates down an array, and the part of the array already covered is in order; then, the current element of the array is inserted into the proper position at the head of the array while the rest of the elements are moved using the space just vacated by the element inserted. The Python snippet for creating the animation is the following:
def insertionsort_anim(a):
 x = range(len(a))
 for i in range(1,len(a)):
  val = a[i]
  j = i-1
  while j >= 0 and a[j] > val:
   a[j+1] = a[j] # creating the space for the insertion
   j = j-1
  a[j+1] = val # insertion
  pylab.savefig("insertionsort/img" + '%04d' % i + ".png")

# running the algorithm
a = range(300)
The strategy used to create the video is the usual and we can see the result in the following video:

It's interesting to see that the algorithm forms small subsets of sorted items that tend to join when the number of iteration grows and the holes between the them are filled by the insertion process.

Saturday, March 2, 2013

Bubble Sort visualized

The bubble sort algorithm is one of the simplest sorting methods to implement and in this post we will see that it is also one of the best to visualize. It works by repeatedly exchanging adjacent elements when necessary and stops when no exchanges are required. Thus, smaller elements will "bubble" to the front, (or bigger elements will be "bubbled" to the back). The following snippet is able to visualize the behavior of the bubble sort:
import pylab
from random import shuffle

def bubblesort_anim(a):
 x = range(len(a))
 imgidx = 0
 # bubble sort algorithm
 swapped = True
 while swapped: # until there's no swapping
  swapped = False
  for i in range(len(a)-1):
   if a[i] > a[i+1]:
    a[i+1], a[i] = a[i], a[i+1] # swap
    swapped = True
  pylab.savefig("bubblesort/img" + '%04d' % imgidx + ".png")
  pylab.clf() # figure clear
  imgidx = imgidx + 1

# running the algorithm
a = range(300)
The snipped is based same technique we have seen to visualize the insertion sort algorithm and to build the animation. At each iteration an image that represents the status of the array is saved to the disk and ffmpeg is used to build a video as showed in the last post. This time the result should be as follows:

The animation is easy on the eyes and we can observe that during the sorting process the smallest elements approach to their correct positions as the bubbles go up in a glass of soda.

Stay tuned to see the animation of the insertion sort algorithm!

Sunday, February 24, 2013

Selection Sort animated

Lately I bumped in the Wikipedia page dedicated to the Selection Sort algorithm and I found an interesting animation that shows its behavior. So I decided to write this post which shows how to recreate that animation using Python.
The idea behind the selection sort is straightforward: at each iteration the unsorted element with the smallest (or largest) value is moved to its proper position in the array. Assume that we wish to sort an array in increasing order. We begin by selecting the lowest element and moving it to the lowest index position. We can do this by swapping the element at the lowest index and the lowest element. We then reduce the effective size of the unsorted items by one element and repeat the process on the smaller unsorted (sub)array. The process stops when the effective number of the unsorted items becomes 1.
Let's see a Python implementation of the selection sort which is able to visualize with a graph the status of the sorting at each iteration:
import pylab

def selectionsort_anim(a):
 x = range(len(a)) 
 for j in range(len(a)-1):
  iMin = j
  for i in range(j+1,len(a)):
   if a[i] < a[iMin]: # find the smallest value
    iMin = i
  if iMin != j: # place the value into its proper location
   a[iMin], a[j] = a[j], a[iMin]
  # plotting
  pylab.savefig("selectionsort/img" + '%04d' % j + ".png")
  pylab.clf() # figure clear

# running the algorithm
a = range(300) # initialization of the array
shuffle(a)     # shuffle!
selectionsort_anim(a) # sorting
At each iteration the status of the algorithm is visualized plotting the indexes of the array versus its values. Every plot is saved as an image and we can easily join them as a video using ffmpeg:
$ cd selectionsort # the directory where the images are
$ ffmpeg -qscale 5 -r 20 -b 9600 -i img%04d.png movie.mp4
The result should be as follows

At each frame of the video we can see that the elements on the left form a subset of items already sorted and the rest of the items remain to be sorted. It's nice to see that the number unsorted elements decrease at each iterations while the subset of sorted items grows forming a straight line.

This is just the first of a series of posts about the visualization of sorting algorithms. Stay tuned for the animations of other sorting algorithms!

Monday, February 18, 2013

Visualizing the tangent

The tangent to a curve is the straight line that touches the curve at a given point. One reason that tangents are so important is that they give the slopes of straight lines. So, if your curve represents a time series you can tell the ratio of change of your values just looking at the tangent.
Suppose that a curve is given as the graph of a function, y = f(x). We have that the slope in the point (a, f(a)) is equal to its derivative in a

and the equation of the tangent line can be stated as follows

With this in mind we can write a snippet of code which visualize the tangent of a curve:
from numpy import sin,linspace,power
from pylab import plot,show

def f(x): # sample function
 return x*sin(power(x,2))

# evaluation of the function
x = linspace(-2,4,150)
y = f(x)

a = 1.4
h = 0.1
fprime = (f(a+h)-f(a))/h # derivative
tan = f(a)+fprime*(x-a)  # tangent

# plot of the function and the tangent
The result is as follows:

The f is plotted with a blue curve and the tangent is the dashed line. Looking at the graph it is actually easy to observe that the tangent gives us a way to visualize the slope of a curve in a point. Let's see how this can help us in a practical example. Consider the fresh potatoes consumer price index between the years 1949 and 2006:
from numpy import arange
# Fresh potatoes: Annual Consumer price index, 1949-2006
# obatined at
price_index = [21.0,17.6,19.3,28.9,21.1,20.5,22.1,26.4,22.3,24.4,24.6,28.0,24.7,24.9,25.7,31.6,39.1,31.3,31.3,32.1,34.4,38.0,36.7,39.6,58.8,71.8,57.7,62.6,63.8,66.3,63.6,81.0,109.5,92.7,91.3,116.0,101.5,96.1,116.0,119.1,153.5,162.6,144.6,141.5,154.6,174.3,174.7,180.6,174.1,185.2,193.1,196.3,202.3,238.6,228.1,231.1,247.7,273.1]
t = np.arange(1949,2007)
From the calculus we have that the derivative is positive when f is increasing, it is negative when f is decreasing and zero when f has a saddle point. So, if we look at the tangent of the curve of the consumer price index in a certain year we have that it has a positive slope when the price index is increasing, a negative slope when the price are decreasing and it is constant when the trend is going to change. Using an interpolation of the data we loaded above we can plot the tangent in each year we want:
from scipy import interpolate

def draw_tangent(x,y,a):
 # interpolate the data with a spline
 spl = interpolate.splrep(x,y)
 small_t = arange(a-5,a+5)
 fa = interpolate.splev(a,spl,der=0)     # f(a)
 fprime = interpolate.splev(a,spl,der=1) # f'(a)
 tan = fa+fprime*(small_t-a) # tangent



The graph shows the data contained in the array price_index and shows the tangent of the curve for the years 1991 and 1998. Using the tangent, this graph gives an emphasis about the fact that the price index is decreasing during the years around 1991 and increasing around 1998.

Find out more about derivative approximation in the post Finite differences with Toeplitz matrix.

Wednesday, February 6, 2013

Betweenness Centrality

In Network Analysis the identification of important nodes is a common task. We have various centrality measures that we can use and in this post we will focus on the Betweenness Centrality. We will see how this measure is computed and how to use the library networkx in order to create a visualization of the network where the nodes with the highest betweenness are highlighted. The betweenness focuses on the number of visits through the shortests paths. If a walker moves from one node to another node via the shortests path, then the nodes with a large number of visits have a higher centrality. The betweenness centrality is defined as

where s(s,t) is total number of shortest paths from node s to node t and sv(s,t) is the number of those paths that pass through v.
Let's see how to compute the betweenness with networkx. As first step we have to load a sample network (yes, it's the same of this post):
# read the graph (gml format)
G = nx.read_gml('lesmiserables.gml',relabel=True)
Now we have a representation G of our network and we can use the function betweenness_centrality() to compute the centrality of each node. This function returns a list of tuples, one for each node, and each tuple contains the label of the node and the centrality value. We can use this information in order to trim the original network and keep only the most important nodes:
def most_important(G):
 """ returns a copy of G with
     the most important nodes
     according to the pagerank """ 
 ranking = nx.betweenness_centrality(G).items()
 print ranking
 r = [x[1] for x in ranking]
 m = sum(r)/len(r) # mean centrality
 t = m*3 # threshold, we keep only the nodes with 3 times the mean
 Gt = G.copy()
 for k, v in ranking:
  if v < t:
 return Gt

Gt = most_important(G) # trimming
And we can use the original network and the trimmed one to visualize the network as follows:
from pylab import show
# create the layout
pos = nx.spring_layout(G)
# draw the nodes and the edges (all)

# draw the most important nodes with a different style
# also the labels this time
The graph should be like this one:

This graph is pretty interesting, indeed it highlights the nodes which are very influential on the way the information spreads over the network. In the sample network we used each node represents a character and the connection between two characters represent the coappearance in the same chapter of the book 'Les miserable'. Looking at the graph we can easily say what are the most important characters according to the Betweenness Centrality. We can also observe some interesting situations like the ones of Valjean and Myriel. They are to connected to groups of characters who don't have a direct connection with the main ones.

Monday, January 28, 2013

A toy Bloom Filter

A Bloom Filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is based on a probabilistic mechanism where false positive retrieval results are possible, but false negatives are not. In this post we will see a pure python implementation of the Bloom Filter and the end we will see how to tune the parameters in order to minimize the number of false positive results.
Let's begin with a little bit of theory. The idea behind the filter is to allocate a bit vector of length m, initially all set to 0, and then choose k independent hash functions, h1, h2, ..., hk, each with range [1 m]. When an element a is added to the set then the bits at positions h(a)1, h(a)2, ..., h(a)k in the bit vector are set to 1. Given a query element q we can test whether it is in the set using the bits at positions h(q)1, h(q)2, ..., h(q)k in the vector. If any of these bits is 0 we report that q is not in the set otherwise we report that q is. The thing we have to care about is that in the first case there remains some probability that q is not in the set which could lead us to a false positive response.
The following class is a naive implementation of a Bloom Filter (pay attention: this implementation is not supposed to be suitable for production. It is made just to show how a Bloom Filter works and to study its behavior):
class Bloom:
 """ Bloom Filter """
 def __init__(self,m,k,hash_fun):
   m, size of the vector
   k, number of hash fnctions to compute
   hash_fun, hash function to use
  self.m = m
  # initialize the vector 
  # (attention a real implementation 
  #  should use an actual bit-array)
  self.vector = [0]*m
  self.k = k
  self.hash_fun = hash_fun = {} # data structure to store the data
  self.false_positive = 0

 def insert(self,key,value):
  """ insert the pair (key,value) in the database """[key] = value
  for i in range(self.k):
   self.vector[self.hash_fun(key+str(i)) % self.m] = 1

 def contains(self,key):
  """ check if key is cointained in the database
      using the filter mechanism """
  for i in range(self.k):
   if self.vector[self.hash_fun(key+str(i)) % self.m] == 0:
    return False # the key doesn't exist
  return True # the key can be in the data set

 def get(self,key):
  """ return the value associated with key """
  if self.contains(key):
    return[key] # actual lookup
   except KeyError:
    self.false_positive += 1
The usage of this filter is pretty easy, we have to initialize the data structure with a hash function, a value for k and the size of the bit vector then we can start adding items as in this example:
import hashlib

def hash_f(x):
 h = hashlib.sha256(x) # we'll use sha256 just for this example
 return int(h.hexdigest(),base=16)

b = Bloom(100,10,hash_f)
b.insert('this is a key','this is a value')
print b.get('this is a key')
Now, the problem is to choose the parameters of the filter in order to minimize the number of false positive results. We have that after inserting n elements into a table of size m, the probability that a particular bit is still 0 is exactly

Hence, afer n insertions, the probability that a certain bit is 1 is

So, for fixed parameters m and n, the optimal value k that minimizes this probability is

With this in mind we can test our filter. The first thing we need is a function which tests the Bloom Filter for fixed values of m, n and k countinig the percentage of false positive:
import random

def rand_data(n, chars):
 """ generate random strings using the characters in chars """
 return ''.join(random.choice(chars) for i in range(n))

def bloomTest(m,n,k):
 """ return the percentage of false positive """
 bloom = Bloom(m,k,hash_f)
 # generating a random data
 rand_keys = [rand_data(10,'abcde') for i in range(n)]
 # pushing the items into the data structure
 for rk in rand_keys:
 # adding other elements to the dataset
 rand_keys = rand_keys + [rand_data(10,'fghil') for i in range(n)]
 # performing a query for each element of the dataset
 for rk in rand_keys:
 return float(bloom.false_positive)/n*100.0
If we fix m = 10000 and n = 1000, according to the equations above, we have that the value of k which minimizes the false positive number is around 6.9314. We can confirm that experimentally with the following test:
# testing the filter
m = 10000
n = 1000
k = range(1,64)
perc = [bloomTest(m,n,kk) for kk in k] # k is varying

# plotting the result of the test
from pylab import plot,show,xlabel,ylabel
ylabel('false positive %')
The result of the test should be as follows

Looking at the graph we can confirm that for k around 7 we have the lowest false positive percentage.

Monday, January 14, 2013

Box-Muller Transformation

The Box-Muller transform is a method for generating normally distributed random numbers from uniformly distributed random numbers. The Box-Muller transformation can be summarized as follows, suppose u1 and u2 are independent random variables that are uniformly distributed between 0 and 1 and let

then z1 and z2 are independent random variables with a standard normal distribution. Intuitively, the transformation maps each circle of points around the origin to another circle of points around the origin where larger outer circles are mapped to closely-spaced inner circles and inner circles to outer circles.
Let's see a Python snippet that implements the transformation:
from numpy import random, sqrt, log, sin, cos, pi
from pylab import show,hist,subplot,figure

# transformation function
def gaussian(u1,u2):
  z1 = sqrt(-2*log(u1))*cos(2*pi*u2)
  z2 = sqrt(-2*log(u1))*sin(2*pi*u2)
  return z1,z2

# uniformly distributed values between 0 and 1
u1 = random.rand(1000)
u2 = random.rand(1000)

# run the transformation
z1,z2 = gaussian(u1,u2)

# plotting the values before and after the transformation
subplot(221) # the first row of graphs
hist(u1)     # contains the histograms of u1 and u2 
subplot(223) # the second contains
hist(z1)     # the histograms of z1 and z2
The result should be similar to the following:

In the first row of the graph we can see, respectively, the histograms of u1 and u2 before the transformation and in the second row we can see the values after the transformation, respectively z1 and z2. We can observe that the values before the transformation are distributed uniformly while the histograms of the values after the transformation have the typical Gaussian shape.

Thursday, January 3, 2013

Book review: NumPy Cookbook

This year I have the chance to review the book NumPy Cookbook written by Ivan Idris and published by Packt Publishing. It introduces the numpy library by examples (which the author refers as recipes :). It is written with a simple language and it covers a wide range of topics, from the istallation of numpy to the combination with Cython.

My impression of the book was good and, in particular, I liked the structure of the book. Every chapter face a series of problem related to the a specific topic through examples. Each example comes with an introduction to the problem that will be solved, the code commented line by line and a short recap of the techniques applied to solve the problem. Most of the examples are about practical problems and the code is made to be adapted in your own projects.

Favorite chapters

Chapters 5 and 10 are my favorite. The first one is about audio end image processing and explains some basic operation about the manipulation, the generation and the filtering of audio and video signals. The second is about the combination of numpy with some scikits,like scikits-learn, scikits-statsmodels and pandas. I loved these chapters because they cover some topics related to complex fields, such as machine learning and data analysis, in a very straightforward fashion.

Favorite example

Some examples presented by the book kept my attention. In particular, I found very interesting the one about the generation of the Mandelbrot. This example contains an explanation of the mathematical formula behind the fractal and the combination of the image generated using the formula and a simpler one. It is my favorite because provides one of the most practical explanation of the Mandelbrot fractal I have ever seen.


This book could be a good starting point for who want to begin with numpy using a gentle approach. It can be used also as a manual which can help you in the development of small parts of more complex projects.