Wednesday, November 26, 2014

Comparing strikers statistics

Here we compare the scoring statistics of four of the best strikers of the recent football history: Del Piero, Trezeguet, Ronaldo and Vieri. The statistics that we will look at are the scoring trajectory, scoring rate and number of appearances.
To compute these values we need to scrape the career statistics (number of goals and appearances per season) on the Wikipedia pages of the players:
from bs4 import BeautifulSoup
from urllib2 import urlopen

def get_total_goals(url):
    Given the url of a wikipedia page about a football striker
    returns three numy arrays:
    - years, each element corresponds to a season
    - apprearances, contains the number of appearances each season
    - goals, contains the number of goal scored each season
    Unfortunately this function is able to parse 
    only the pages of few strikers.
    soup = BeautifulSoup(urlopen(url).read())
    table = soup.find("table", { "class" : "wikitable" })
    years = []
    apps = []
    goals = []
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        if len(cells) > 1:
    return np.array(years), 
           np.array(apps, dtype='float'), 

ronaldo = get_total_goals('')
vieri = get_total_goals('')
delpiero = get_total_goals('')
trezeguet = get_total_goals('')
Now we are ready to compute our statistics. For each statistics we will produce an interactive chart using plotly.

Scoring trajectory

import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in("sexyusername", "mypassword")

data = Data([
            name='Del Piero', mode='lines'),
            name='Trezeguet', mode='lines'),
            name='Ronaldo', mode='lines'),
            name='Vieri', mode='lines'),

layout = Layout(
    title='Scoring Trajectory',
    yaxis=YAxis(title='Cumuative goal'),

fig = Figure(data=data, layout=layout)

py.iplot(fig, filename='cumulative-goals')
The scoring trajectory is given by the yearly cumulative totals of goals scored. From the scoring trajectories we can see that Ronaldo was a goal machine since his first professional season and his worse period was from 1999 to 2001. Del Piero and Trezeguet have the longest careers (and they're still playing!). Vieri had the shortest career but it's impressive to see that the number of goals he scored increased almost constantly from 1996 to 2004.

Scoring rate

data = Data([
        x=['Ronaldo', 'Vieri', 'Trezeguet', 'Del Piero'],
py.iplot(data, filename='goal-average')
The scoring rate is the number of goals scored divided by the number of appearances. Ronaldo has a terrific 0.67 scoring rate, meaning that, on average he scored more than three goals each five games. Vieri and Trezeguet have a very similar scoring rate, almost one goal each two games. While Del Piero has 0.40, two goals each five games.


data = Data([
        x=['Del Piero', 'Trezeguet', 'Ronaldo', 'Vieri'],
py.iplot(data, filename='appearances')
The number of Del Piero's appearances on a football field is impressive. At the moment I'm writing, he played 773 games. No one of the other players was able to play the 70% of the games played by the Italian numero 10.

Friday, October 17, 2014

Andrews curves

Andrews curves are a method for visualizing multidimensional data by mapping each observation onto a function. This function is defined as

It has been shown the Andrews curves are able to preserve means, distance (up to a constant) and variances. Which means that Andrews curves that are represented by functions close together suggest that the corresponding data points will also be close together. Now, we will demonstrate the effectiveness of the Andrew curves on the iris dataset (which we already used here). Let's create a function to compute the values of the functions give a single sample:
import numpy as np
def andrew_curve4(x,theta):
    # iris has 4 four dimensions
    base_functions = [lambda x : x[0]/np.sqrt(2.), 
                      lambda x : x[1]*np.sin(theta), 
                      lambda x : x[2]*np.cos(theta), 
                      lambda x : x[3]*np.sin(2.*theta)]
    curve = np.zeros(len(theta))
    for f in base_functions:
        curve = curve + f(x)
    return curve
At this point we can load the dataset and plot the curves for a subset of samples:
samples = np.loadtxt('iris.csv', usecols=[0,1,2,3], delimiter=',')
#samples = samples - np.mean(samples)
#samples = samples / np.std(samples)
classes = np.loadtxt('iris.csv', usecols=[4], delimiter=',',dtype=np.str)
theta = np.linspace(-np.pi,np.pi,100)
import pylab as pl
for s in samples[:20]: # setosa
    pl.plot(theta, andrew_curve4(s,theta), 'r')

for s in samples[50:70]: # versicolor
    pl.plot(theta, andrew_curve4(s,theta), 'b')

for s in samples[100:120]: # virginica
    pl.plot(theta, andrew_curve4(s,theta), 'g')


In the plot above, the each color used represents a class and we can easily note that the lines that represent samples from the same class have similar curves.

Wednesday, September 24, 2014

Text summarization with NLTK

The target of the automatic text summarization is to reduce a textual document to a summary that retains the pivotal points of the original document. The research about text summarization is very active and during the last years many summarization algorithms have been proposed.
In this post we will see how to implement a simple text summarizer using the NLTK library (which we also used in a previous post) and how to apply it to some articles extracted from the BBC news feed. The algorithm that we are going to see tries to extract one or more sentences that cover the main topics of the original document using the idea that, if a sentences contains the most recurrent words in the text, it probably covers most of the topics of the text. Here's the Python class that implements the algorithm:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class FrequencySummarizer:
  def __init__(self, min_cut=0.1, max_cut=0.9):
     Initilize the text summarizer.
     Words that have a frequency term lower than min_cut 
     or higer than max_cut will be ignored.
    self._min_cut = min_cut
    self._max_cut = max_cut 
    self._stopwords = set(stopwords.words('english') + list(punctuation))

  def _compute_frequencies(self, word_sent):
      Compute the frequency of each of word.
       word_sent, a list of sentences already tokenized.
       freq, a dictionary where freq[w] is the frequency of w.
    freq = defaultdict(int)
    for s in word_sent:
      for word in s:
        if word not in self._stopwords:
          freq[word] += 1
    # frequencies normalization and fitering
    m = float(max(freq.values()))
    for w in freq.keys():
      freq[w] = freq[w]/m
      if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
        del freq[w]
    return freq

  def summarize(self, text, n):
      Return a list of n sentences 
      which represent the summary of text.
    sents = sent_tokenize(text)
    assert n <= len(sents)
    word_sent = [word_tokenize(s.lower()) for s in sents]
    self._freq = self._compute_frequencies(word_sent)
    ranking = defaultdict(int)
    for i,sent in enumerate(word_sent):
      for w in sent:
        if w in self._freq:
          ranking[i] += self._freq[w]
    sents_idx = self._rank(ranking, n)    
    return [sents[j] for j in sents_idx]

  def _rank(self, ranking, n):
    """ return the first n sentences with highest ranking """
    return nlargest(n, ranking, key=ranking.get)
The FrequencySummarizer tokenizes the input into sentences then computes the term frequency map of the words. Then, the frequency map is filtered in order to ignore very low frequency and highly frequent words, this way it is able to discard the noisy words such as determiners, that are very frequent but don't contain much information, or words that occur only few times. And finally, the sentences are ranked according to the frequency of the words they contain and the top sentences are selected for the final summary.

To test the summarizer, let's create a function that extract the natural language from a html page using BeautifulSoup:
import urllib2
from bs4 import BeautifulSoup

def get_only_text(url):
  return the title and the text of the article
  at the specified url
 page = urllib2.urlopen(url).read().decode('utf8')
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text
We can finally apply our summarizer on a set of articles extracted from the BBC news feed:
feed_xml = urllib2.urlopen('').read()
feed = BeautifulSoup(feed_xml.decode('utf8'))
to_summarize = map(lambda p: p.text, feed.find_all('guid'))

fs = FrequencySummarizer()
for article_url in to_summarize[:5]:
  title, text = get_only_text(article_url)
  print '----------------------------------'
  print title
  for s in fs.summarize(text, 2):
   print '*',s
And here are the results:
BBC News - Scottish independence: Campaigns seize on Scotland powers pledge
* Speaking ahead of a visit to apprentices at an engineering firm in Renfrew, Deputy First Minister Nicola Sturgeon said: Only a 'Yes' vote will ensure we have full powers over job creation - enabling us to create more and better jobs across the country.
* Asked if the move smacks of panic, Mr Alexander told BBC Breakfast: I don't think there's any embarrassment about placing policies on the front page of papers with just days two go.
BBC News - US air strike supports Iraqi troops under attack
* Gabriel Gatehouse reports from the front line of Peshmerga-held territory in northern Iraq The air strike south-west of Baghdad was the first taken as part of our expanded efforts beyond protecting our own people and humanitarian missions to hit Isil targets as Iraqi forces go on offence, as outlined in the president's speech last Wednesday, US Central Command said.
* But Iran's Supreme Leader Ayatollah Ali Khamenei said on Monday that the US had requested Iran's co-operation via the US ambassador to Iraq.
BBC News - Passport delay victims deserve refund, say MPs
* British adult passport costs Normal service - £72.50 Check  Send - Post Office staff check application correct and it is sent by Special Delivery - £81.25 Fast-Track - Applicant attends Passport Office in person and passport delivered within one week - £103 Premium - Passport available for collection on same day applicant attends Passport Office - £128 In mid-June it announced that - for people who could prove they were booked to travel within seven days and had submitted passport applications more than three weeks earlier - there would be a free upgrade to its fast-track service.
* The Passport Office has since cut the number of outstanding applications to around 90,000, but the report said: A number of people have ended up out-of-pocket due to HMPO's inability to meet its service standard.
BBC News - UK inflation rate falls to 1.5%
* Howard Archer, chief UK and European economist at IHS Global Insight, said: August's muted consumer price inflation is welcome news for consumers' purchasing power as they currently continue to be hampered by very low earnings growth.
* Consumer Price Index (CPI) inflation fell to 1.5% from 1.6% in August, the Office for National Statistics said.
BBC News - Thailand deaths: Police have 'number of suspects'
* The BBC's Jonathan Head, on Koh Tao, says police are focussing on the island's Burmese community BBC south-east Asia correspondent Jonathan Head said the police's focus on Burmese migrants would be quite controversial as Burmese people were often scapegoated for crimes in Thailand.
* By Jonathan Head, BBC south-east Asia correspondent The shocking death of the two young tourists has cast a pall over this scenic island resort Locals say they can remember nothing like it happening before.
Of course, the evaluation a text summarizer is not an easy task. But, from the results above we note that the summarizer often picked quoted text reported in the original article and that the sentences picked by the summarizer often represent decent insights if we consider the title of the article.