Wednesday, November 26, 2014

Comparing strikers statistics

Here we compare the scoring statistics of four of the best strikers of the recent football history: Del Piero, Trezeguet, Ronaldo and Vieri. The statistics that we will look at are the scoring trajectory, scoring rate and number of appearances.
To compute these values we need to scrape the career statistics (number of goals and appearances per season) on the Wikipedia pages of the players:
from bs4 import BeautifulSoup
from urllib2 import urlopen

def get_total_goals(url):
    Given the url of a wikipedia page about a football striker
    returns three numy arrays:
    - years, each element corresponds to a season
    - apprearances, contains the number of appearances each season
    - goals, contains the number of goal scored each season
    Unfortunately this function is able to parse 
    only the pages of few strikers.
    soup = BeautifulSoup(urlopen(url).read())
    table = soup.find("table", { "class" : "wikitable" })
    years = []
    apps = []
    goals = []
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        if len(cells) > 1:
    return np.array(years), 
           np.array(apps, dtype='float'), 

ronaldo = get_total_goals('')
vieri = get_total_goals('')
delpiero = get_total_goals('')
trezeguet = get_total_goals('')
Now we are ready to compute our statistics. For each statistics we will produce an interactive chart using plotly.

Scoring trajectory

import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in("sexyusername", "mypassword")

data = Data([
            name='Del Piero', mode='lines'),
            name='Trezeguet', mode='lines'),
            name='Ronaldo', mode='lines'),
            name='Vieri', mode='lines'),

layout = Layout(
    title='Scoring Trajectory',
    yaxis=YAxis(title='Cumuative goal'),

fig = Figure(data=data, layout=layout)

py.iplot(fig, filename='cumulative-goals')
The scoring trajectory is given by the yearly cumulative totals of goals scored. From the scoring trajectories we can see that Ronaldo was a goal machine since his first professional season and his worse period was from 1999 to 2001. Del Piero and Trezeguet have the longest careers (and they're still playing!). Vieri had the shortest career but it's impressive to see that the number of goals he scored increased almost constantly from 1996 to 2004.

Scoring rate

data = Data([
        x=['Ronaldo', 'Vieri', 'Trezeguet', 'Del Piero'],
py.iplot(data, filename='goal-average')
The scoring rate is the number of goals scored divided by the number of appearances. Ronaldo has a terrific 0.67 scoring rate, meaning that, on average he scored more than three goals each five games. Vieri and Trezeguet have a very similar scoring rate, almost one goal each two games. While Del Piero has 0.40, two goals each five games.


data = Data([
        x=['Del Piero', 'Trezeguet', 'Ronaldo', 'Vieri'],
py.iplot(data, filename='appearances')
The number of Del Piero's appearances on a football field is impressive. At the moment I'm writing, he played 773 games. No one of the other players was able to play the 70% of the games played by the Italian numero 10.

Friday, October 17, 2014

Andrews curves

Andrews curves are a method for visualizing multidimensional data by mapping each observation onto a function. This function is defined as

It has been shown the Andrews curves are able to preserve means, distance (up to a constant) and variances. Which means that Andrews curves that are represented by functions close together suggest that the corresponding data points will also be close together. Now, we will demonstrate the effectiveness of the Andrew curves on the iris dataset (which we already used here). Let's create a function to compute the values of the functions give a single sample:
import numpy as np
def andrew_curve4(x,theta):
    # iris has 4 four dimensions
    base_functions = [lambda x : x[0]/np.sqrt(2.), 
                      lambda x : x[1]*np.sin(theta), 
                      lambda x : x[2]*np.cos(theta), 
                      lambda x : x[3]*np.sin(2.*theta)]
    curve = np.zeros(len(theta))
    for f in base_functions:
        curve = curve + f(x)
    return curve
At this point we can load the dataset and plot the curves for a subset of samples:
samples = np.loadtxt('iris.csv', usecols=[0,1,2,3], delimiter=',')
#samples = samples - np.mean(samples)
#samples = samples / np.std(samples)
classes = np.loadtxt('iris.csv', usecols=[4], delimiter=',',dtype=np.str)
theta = np.linspace(-np.pi,np.pi,100)
import pylab as pl
for s in samples[:20]: # setosa
    pl.plot(theta, andrew_curve4(s,theta), 'r')

for s in samples[50:70]: # versicolor
    pl.plot(theta, andrew_curve4(s,theta), 'b')

for s in samples[100:120]: # virginica
    pl.plot(theta, andrew_curve4(s,theta), 'g')


In the plot above, the each color used represents a class and we can easily note that the lines that represent samples from the same class have similar curves.

Wednesday, September 24, 2014

Text summarization with NLTK

The target of the automatic text summarization is to reduce a textual document to a summary that retains the pivotal points of the original document. The research about text summarization is very active and during the last years many summarization algorithms have been proposed.
In this post we will see how to implement a simple text summarizer using the NLTK library (which we also used in a previous post) and how to apply it to some articles extracted from the BBC news feed. The algorithm that we are going to see tries to extract one or more sentences that cover the main topics of the original document using the idea that, if a sentences contains the most recurrent words in the text, it probably covers most of the topics of the text. Here's the Python class that implements the algorithm:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class FrequencySummarizer:
  def __init__(self, min_cut=0.1, max_cut=0.9):
     Initilize the text summarizer.
     Words that have a frequency term lower than min_cut 
     or higer than max_cut will be ignored.
    self._min_cut = min_cut
    self._max_cut = max_cut 
    self._stopwords = set(stopwords.words('english') + list(punctuation))

  def _compute_frequencies(self, word_sent):
      Compute the frequency of each of word.
       word_sent, a list of sentences already tokenized.
       freq, a dictionary where freq[w] is the frequency of w.
    freq = defaultdict(int)
    for s in word_sent:
      for word in s:
        if word not in self._stopwords:
          freq[word] += 1
    # frequencies normalization and fitering
    m = float(max(freq.values()))
    for w in freq.keys():
      freq[w] = freq[w]/m
      if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
        del freq[w]
    return freq

  def summarize(self, text, n):
      Return a list of n sentences 
      which represent the summary of text.
    sents = sent_tokenize(text)
    assert n <= len(sents)
    word_sent = [word_tokenize(s.lower()) for s in sents]
    self._freq = self._compute_frequencies(word_sent)
    ranking = defaultdict(int)
    for i,sent in enumerate(word_sent):
      for w in sent:
        if w in self._freq:
          ranking[i] += self._freq[w]
    sents_idx = self._rank(ranking, n)    
    return [sents[j] for j in sents_idx]

  def _rank(self, ranking, n):
    """ return the first n sentences with highest ranking """
    return nlargest(n, ranking, key=ranking.get)
The FrequencySummarizer tokenizes the input into sentences then computes the term frequency map of the words. Then, the frequency map is filtered in order to ignore very low frequency and highly frequent words, this way it is able to discard the noisy words such as determiners, that are very frequent but don't contain much information, or words that occur only few times. And finally, the sentences are ranked according to the frequency of the words they contain and the top sentences are selected for the final summary.

To test the summarizer, let's create a function that extract the natural language from a html page using BeautifulSoup:
import urllib2
from bs4 import BeautifulSoup

def get_only_text(url):
  return the title and the text of the article
  at the specified url
 page = urllib2.urlopen(url).read().decode('utf8')
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text
We can finally apply our summarizer on a set of articles extracted from the BBC news feed:
feed_xml = urllib2.urlopen('').read()
feed = BeautifulSoup(feed_xml.decode('utf8'))
to_summarize = map(lambda p: p.text, feed.find_all('guid'))

fs = FrequencySummarizer()
for article_url in to_summarize[:5]:
  title, text = get_only_text(article_url)
  print '----------------------------------'
  print title
  for s in fs.summarize(text, 2):
   print '*',s
And here are the results:
BBC News - Scottish independence: Campaigns seize on Scotland powers pledge
* Speaking ahead of a visit to apprentices at an engineering firm in 
Renfrew, Deputy First Minister Nicola Sturgeon said: Only a 'Yes' vote will 
ensure we have full powers over job creation - enabling us to create more 
and better jobs across the country.
* Asked if the move smacks of panic, Mr Alexander told BBC Breakfast: 
I don't think there's any embarrassment about placing policies on the 
front page of papers with just days two go.
BBC News - US air strike supports Iraqi troops under attack
* Gabriel Gatehouse reports from the front line of Peshmerga-held territory 
in northern Iraq The air strike south-west of Baghdad was the first taken as 
part of our expanded efforts beyond protecting our own people and humanitarian 
missions to hit Isil targets as Iraqi forces go on offence, as outlined in the 
president's speech last Wednesday, US Central Command said.
* But Iran's Supreme Leader Ayatollah Ali Khamenei said on Monday that the US 
had requested Iran's co-operation via the US ambassador to Iraq.
BBC News - Passport delay victims deserve refund, say MPs
* British adult passport costs Normal service - £72.50 Check  Send - 
Post Office staff check application correct and it is sent by Special Delivery 
- £81.25 Fast-Track - Applicant attends Passport Office in person and passport 
delivered within one week - £103 Premium - Passport available for collection 
on same day applicant attends Passport Office - £128 In mid-June it announced 
that - for people who could prove they were booked to travel within seven days 
and had submitted passport applications more than three weeks earlier - there 
would be a free upgrade to its fast-track service.
* The Passport Office has since cut the number of outstanding applications to 
around 90,000, but the report said: A number of people have ended up 
out-of-pocket due to HMPO's inability to meet its service standard.
BBC News - UK inflation rate falls to 1.5%
* Howard Archer, chief UK and European economist at IHS Global Insight, 
said: August's muted consumer price inflation is welcome news for consumers' 
purchasing power as they currently continue to be hampered by very 
low earnings growth.
* Consumer Price Index (CPI) inflation fell to 1.5% from 1.6% in August, 
the Office for National Statistics said.
BBC News - Thailand deaths: Police have 'number of suspects'
* The BBC's Jonathan Head, on Koh Tao, says police are focussing on the 
island's Burmese community BBC south-east Asia correspondent Jonathan Head 
said the police's focus on Burmese migrants would be quite controversial as 
Burmese people were often scapegoated for crimes in Thailand.
* By Jonathan Head, BBC south-east Asia correspondent The shocking death of 
the two young tourists has cast a pall over this scenic island resort Locals 
say they can remember nothing like it happening before.
Of course, the evaluation a text summarizer is not an easy task. But, from the results above we note that the summarizer often picked quoted text reported in the original article and that the sentences picked by the summarizer often represent decent insights if we consider the title of the article.

Wednesday, August 27, 2014

Visualizing electricity prices with Plotly

We have already mentioned plotly many times (here are other two posts about it) and this time we'll see how to use it in order to build an interactive visualization of the latest data about the domestic electricity prices provided by International Energy Agency (IEA).

In the chart that we are going to make, we will show the prices of the domestic electricity among the countries monitored by IEA in 2013 with a bar chart where each bar shows the electricity price and the fraction of the price represented by the taxes.

First, we import the data (the full data is available here, in this post we'll use only the Table 5.5.1 in cvs format) using pandas:
import pandas as pd
ieaprices = pd.read_csv('iea_prices.csv',
ieaprices = ieaprices.dropna()
countries = ieaprices.sort('2013_with_tax').index
Then, we arrange the data in order create a plotly bar chart:
from plotly.graph_objs import Bar,Data,Layout,Figure
from plotly.graph_objs import XAxis,YAxis,Marker,Scatter,Legend

prices_bars = []

# computing the taxes
taxes = ieaprices['2013_with_tax']-ieaprices['2013_no_tax']

# adding the prices to the chart
             name='price without taxes'))

# adding the taxes to the chart
And now we are ready to submit the data to the plotly server to render the chart:
import plotly.plotly as py

py.sign_in("SexyUser", "asexykeyforasexyuser")

meadian_line = Scatter(
    marker=Marker(color='rgb(40, 40, 40)'),

data = Data(prices_bars+[meadian_line])

layout = Layout(
    title='Domestic electricity prices in the IEA in 2013',
    yaxis=YAxis(title='Price (Pence per Kwh)'),

fig = Figure(data=data, layout=layout)

# this line will work only in ipython
# use py.plot() in other environments
plot_url = py.iplot(fig, filename='ieaprices2013') 
The result should look like this:

Looking at the chart we note that, during 2013, the average domestic electricity prices, including taxes, in Denmark and Germany were the highest in the IEA. We also note that in Denmark the fraction of taxes paid is higher than the actual electricity price whereas in Germany the actual electricity price and the taxes are almost the same. Interestingly, USA has the lowest price and the lowest taxation.

This post shows how to create one of the charts commented here, where a more insights about the IEA data are provided.

Wednesday, August 20, 2014

Quick HDF5 with Pandas

HDF5 is a format designed to store large numerical arrays of homogenous type. It cames particularly handy when you need to organize your data models in a hierarchical fashion and you also need a fast way to retrieve the data. Pandas implements a quick and intuitive interface for this format and in this post will shortly introduce how it works.

We can create a HDF5 file using the HDFStore class provided by Pandas:
import numpy as np
from pandas import HDFStore,DataFrame
# create (or open) an hdf5 file and opens in append mode
hdf = HDFStore('storage.h5')
Now we can store a dataset into the file we just created:
df = DataFrame(np.random.rand(5,3), columns=('A','B','C'))
# put the dataset in the storage
hdf.put('d1', df, format='table', data_columns=True)
The structure used to represent the hdf file in Python is a dictionary and we can access to our data using the name of the dataset as key:
print hdf['d1'].shape
(5, 3)
The data in the storage can be manipulated. For example, we can append new data to the dataset we just created:
hdf.append('d1', DataFrame(np.random.rand(5,3), 
           format='table', data_columns=True)
hdf.close() # closes the file
There are many ways to open a hdf5 storage, we could use again the constructor of the class HDFStorage, but the function read_hdf makes us also able to query the data:
from pandas import read_hdf
# this query selects the columns A and B
# where the values of A is greather than 0.5
hdf = read_hdf('storage.h5', 'd1',
               where=['A>.5'], columns=['A','B'])
At this point, we have a storage which contains a single dataset. The structure of the storage can be organized using groups. In the following example we add three different datasets to the hdf5 file, two in the same group and another one in a different one:
hdf = HDFStore('storage.h5')
hdf.put('tables/t1', DataFrame(np.random.rand(20,5)))
hdf.put('tables/t2', DataFrame(np.random.rand(10,3)))
hdf.put('new_tables/t1', DataFrame(np.random.rand(15,2)))
Our hdf5 storage now looks like this:
print hdf

File path: storage.h5
/d1             frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/new_tables/t1  frame        (shape->[15,2])                                                   
/tables/t1      frame        (shape->[20,5])                                                   
/tables/t2      frame        (shape->[10,3])  
On the left we can see the hierarchy of the groups added to the storage, in the middle we have the type of dataset and on the right there is the list of attributes attached to the dataset. Attributes are pieces of metadata you can stick on objects in the file and the attributes we see here are automatically created by Pandas in order to describe the information required to recover the data from the hdf5 storage system.

Friday, May 23, 2014

Code parallelization with joblib

Recently I've been working on the parallelization of some Python code and I discovered Joblib. It is a library that supports pipelining and offers a good support for parallelization. In this post we will implement a (very naive) paraller matrix by matrix multiplication algorithm to show the parallelization capabilities of this library.
from joblib import Parallel, delayed

def parallel_dot(A,B,n_jobs=2):
     Computes A x B using more CPUs.
     This works only when the number 
     of rows of A and the n_jobs are even.
    parallelizer = Parallel(n_jobs=n_jobs)
    # this iterator returns the functions to execute for each task
    tasks_iterator = ( delayed(,B) 
                      for A_block in np.split(A,n_jobs) )
    result = parallelizer( tasks_iterator )
    # merging the output of the jobs
    return np.vstack(result)
This function spreads the computation across more processes. The strategy applied to distribute the data is very simple. Each process has the full matrix B and a contiguous block of rows of A, so it can compute a block of rows A*B. In the end, the result of each process is stacked to build final matrix.

Let's compare the parallel version of the algorithm with the sequential one:
A = np.random.randint(0,high=10,size=(1000,1000))
B = np.random.randint(0,high=10,size=(1000,1000))
%time _ =,B)
CPU times: user 13.2 s, sys: 36 ms, total: 13.2 s
Wall time: 13.4 s
%time _ = parallel_dot(A,B,n_jobs=2)
CPU times: user 92 ms, sys: 76 ms, total: 168 ms
Wall time: 8.49 s
Wow, we had a speedup of 1.6X, not bad for a so naive algorithm. It's important to notice that the arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process. Which means that the last time that parallel_dot have been called, the matrix B have been entirely replicated two times in memory. To avoid this problem, we can dump the matrices on the filesystem and pass a reference to the worker to open them as memory map.
import tempfile
import os
from joblib import load, dump

# saving A and B to a local file for memmapping
temp_folder = tempfile.mkdtemp()
filenameA = os.path.join(temp_folder, 'A.mmap')
dump(A, filenameA)
filenameB = os.path.join(temp_folder, 'B.mmap')
dump(A, filenameB)
Now, when parallel_dot(A_memmap,B_memmap,n_jobs=2) is called, both the processes created will use only a reference to the matrix B..

Tuesday, April 22, 2014

Parameters selection with Cross-Validation

Most of the pattern recognition techniques have one or more free parameters and choose them for a given classification problem is often not a trivial task. In real applications we only have access to a finite set of examples, usually smaller than we wanted, and we need to test our model on samples not seen during the training process. A model that would just classify the samples that it has seen would have a very good score, but would definitely fail to predict unseen data. This situation is called overfitting and to avoid it we need to apply an appropriate validation procedure to select the parameters. A tool that can help us solve this problem is the Cross-Validation (CV). The idea behind CV is simple: the data are split into train and test sets several consecutive times and the averaged value of the prediction scores obtained with the different sets is the evaluation of the classifier.
Let's see a simple example where a smoothing parameter for a Bayesian classifier is select using the capabilities of the Sklearn library.
To begin we load one of the test datasets provided by sklearn (the same used here) and we hold 33% of the samples for the final evaluation:
from sklearn.datasets import load_digits
data = load_digits()
from sklearn.cross_validation import train_test_split
X,X_test,y,y_test = train_test_split(,,
Now, we import the classifier we want to use (a Bernoullian Naive Bayes in this case), specify a set of values for the parameter we want to choose and run a grid search:
from sklearn.naive_bayes import BernoulliNB
# test the model for alpha = 0.1, 0.2, ..., 1.0
parameters = [{'alpha':np.linspace(0.1,1,10)}]

from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(BernoulliNB(), parameters, cv=10, scoring='f1'),y) # running the grid search
The grid search has evaluated the classifier for each value specified for the parameter alpha using the CV. We can visualize the results as follows:
res = zip(*[(f1m, f1s.std(), p['alpha']) 
            for p, f1m, f1s in clf.grid_scores_])

The plots above show the average score (top) and the standard deviation of the score (bottom) for each values of alpha used. Looking at the graphs it seems plausible that a small alpha could be a good choice.
We can also see thet using the alpha value that gave us the best results on the the test set we selected at the beginning gives us results that are similar to the ones obtained during the CV stage:
from sklearn.metrics import f1_score
print 'Best alpha in CV = %0.01f' % clf.best_params_['alpha']
final = f1_score(y_test,clf.best_estimator_.predict(X_test))
print 'F1-score on the final testset: %0.5f' % final
Best alpha in CV = 0.1
F1-score on the final testset: 0.85861

Wednesday, February 26, 2014

Terms selection with chi-square

In Natural Language Processing, the identification the most relevant terms in a collection of documents is a common task. It can produce meaningful insights about the data and it can also be useful to improve classification performances and computational efficiency. A popular measure of relevance for terms is the χ2 statistic. To compute it we can convert the terms of our document collection and turn them into features of a vectorial model, then χ2 can be computed as follow:

Where f is a feature (a term in this case), t is a target variable that we, usually, want to predict, A is the number of times that f and t cooccur, B is the number of times that f occurs without t, C is the number of times that t occurs without f, D is the number of times neither t or f occur and N is the number of observations.

Let's see how χ2 can be used through a simple example. We load some posts from 4 different newsgroups categories using the sklearn interface:
from sklearn.datasets import fetch_20newsgroups
 # newsgroups categories
categories = ['alt.atheism','talk.religion.misc',

posts = fetch_20newsgroups(subset='train', categories=categories,
                           shuffle=True, random_state=42,
From the posts loaded, we build a linear model using all the terms in the document collection but the stop words:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(
Now, X is a document-term matrix where the element Xi,j is the frequency of the term j in the document i. Then, the features are given by the columns of X and we want to compute χ2 between the categories of interest and each feature in order to figure out what are the most relevant terms. This can be done as follows
from sklearn.feature_selection import chi2
# compute chi2 for each feature
chi2score = chi2(X,[0]
To have a visual insight, we can plot a bar chart where each bar shows the χ2 value computed above:
from pylab import barh,plot,yticks,show,grid,xlabel,figure
wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1]) 
topchi2 = zip(*wchi2[-25:])
x = range(len(topchi2[1]))
labels = topchi2[0]

We can observe that the terms with a high χ2 can be considered relevant for the newsgroup categories we are analyzing. For example, the terms space, nasa and launch can be considered relevant for the group The terms god, jesus and atheism can be considered relevant for the groups alt.atheism and talk.religion.misc. And, the terms image, graphics and jpeg can be considered relevant in the category

Tuesday, January 14, 2014

Review: Fundamentals of Data Analytics in Python

I massively use Python for data analysis and when I was offered to review the video tutorial with the title “Fundamentals of Data Analytics in Python LiveLessons”, I couldn't refuse.

The tutorial starts from the basics showing how to install Python and its data analysis libraries. Then it continues explaining the main uses that data scientists and engineers practice during their analysis: importing and cleaning data, vectorial computing, visualization and data summarization.

Most of the videos are commented sessions of IPython notebook sometimes supported by some slides. The authors go deep into the explanation of how to use the libraries for the manipulation of the data (Numpy, Scipy and Pandas), while they summarize the potential of the other complementary libraries. In particular, the last video is a survey of various visualization tools.

In conclusion, this video tutorial provides a solid introduction to the main tools for data analysis in Python and a clear view of the open source Python tools relevant to scientific and engineering programming. This tutorial seems perfect for people who need to learn the technical methodologies for data analysis and for people who already know Python but want to acquire skills about data analysis.