Wednesday, September 11, 2019

Organizing movie covers with Neural Networks

In this post we will see how to organize a set of movie covers by similarity on a 2D grid using a particular type of Neural Network called Self Organizing Map (SOM). First, let's load the movie covers of the top 100 movies according to IMDB (the files can be downloaded here) and convert the images in samples that we can use to feed the Neural Network:
import numpy as np
import imageio
from glob import glob
from sklearn.preprocessing import StandardScaler

# covers of the top 100 movies on www.imdb.com/chart/top 
# (the 13th of August 2019)
# images downloaded from www.themoviedb.org
data = []
all_covers = glob('movie_covers/*.jpg')
for cover_jpg in all_covers:
    cover = imageio.imread(cover_jpg)
    data.append(cover.reshape(np.prod(cover.shape)))
    
original_shape = imageio.imread(all_covers[0]).shape

scaler = StandardScaler()
data = scaler.fit_transform(data)
In the snippet above we load every image and for each of them we stack the color values of each pixel in a one dimensional vector. After loading all the images a standard scaling is applied to have all the values with mean 0 and standard deviation equal to 1. This scaling strategies often turns out to be quite successful when working with SOMs. Now we can train our model:
from minisom import MiniSom

w = 10
h = 10
som = MiniSom(h, w, len(data[0]), learning_rate=0.5,
              sigma=3, neighborhood_function='triangle')

som.train_random(data, 2500, verbose=True)
win_map = som.win_map(data)
Here we use Minisom, a lean implementation of the SOM, to implement a 10-by-10 map of neurons. Each movie cover is mapped in a neuron and we can display the results as follows:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(figsize=(30, 20))
grid = ImageGrid(fig, 111,
                 nrows_ncols=(h, w), axes_pad=0)

def place_image(i, img):
    img = (scaler.inverse_transform(img)).astype(int)
    grid[i].imshow(img.reshape(original_shape))
    grid[i].axis('off')

to_fill = []
collided = []

for i in range(w*h):
    position = np.unravel_index(i, (h, w))
    if position in win_map:
        img = win_map[position][0]
        collided += win_map[position][1:]
        place_image(i, img)
    else:
        to_fill.append(i)

collided = collided[::-1]
for i in to_fill:
    position = np.unravel_index(i, (h, w))
    img = collided.pop()
    place_image(i, img)

plt.show()
Since some images can be mapped in the same neuron, we first draw all the covers picking only one per neuron, then we fill the empty spaces of the map with covers that have been mapped in nearby neurons but have not been plotted yet.

This is the result:



Where to go next:
  • Read more about how Self Organizing Maps work here.
  • Check out how to install Minisom here.

Sunday, August 11, 2019

Visualizing distributions with scatter plots in matplotlib

Let's say that we want to study the time between the end of a marked point and next serve in a tennis game. After gathering our data, the first thing that we can do is to draw a histogram of the variable that we are interested in:

import pandas as pd
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)

plt.hist(event.seconds_before_next_point, bins=10)
plt.xlabel('Seconds before next serve')
plt.show()


The histogram reveals some interesting aspects of the distribution, indeed we can see that data is slightly skewed to the right and that on average the server takes 20 seconds. However, we couldn't tell how many time the serves happens before 10 seconds or after 35. Of course, one could increase the bins of the histogram, but this would lead to a chart which is not particularly elegant and that might hide some other details.

To have a better understanding of the situation we can draw a scatter plot of the variable we are studying:
import numpy as np
from scipy.stats.kde import gaussian_kde

def distribution_scatter(x, symmetric=True, cmap=None, size=None):
    """
    Plot the distribution of x showing all the points.
    The x axis represents the samples in x
    and the y axis is function of the probability of x
    and random assignment.
    
    Returns the position on the y axis.
    """
    pdf = gaussian_kde(x)
    w = np.random.rand(len(x))
    if symmetric:
        w = w*2-1
    pseudo_y = pdf(x) * w
    if cmap:
        plt.scatter(x, pseudo_y, c=x, cmap=cmap, s=size)
    else:
        plt.scatter(x, pseudo_y, s=size)
    return pseudo_y


In this chart each sample is represented with a point and the spread of the points in the y direction depends on the probability of occurrence. In this case we can easily see that 4 serves happened before 10 seconds and 3 after 35.

Since we're not really interested on the values on y axis but only on the spread, we can remove the axis and add few details on the outliers to enrich the chart:

url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)

plt.figure(figsize=(7, 11))
title = 'Time in seconds between'
title += '\nend of marked point and next serve'
title += '\nat 2015 French Open'
plt.title(title, loc='left', fontsize=18, color='gray')
py = distribution_scatter(event.seconds_before_next_point, cmap='cool');


cut_h = np.percentile(event.seconds_before_next_point, 98)
outliers = event.seconds_before_next_point> cut_h


ha = {True: 'right', False: 'left'}
for x, y, c in zip(event[outliers].seconds_before_next_point,
                   py[outliers],
                   event[outliers].server):
    plt.text(x, y+.0005, c,
             ha=ha[x<0], va='bottom', fontsize=12)

plt.xlabel('Seconds before next serve', fontsize=15)
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.yticks([])
plt.xticks(np.arange(5, 41, 5))
plt.xlim([5, 40])
plt.show()


Where to go next:

Friday, June 7, 2019

Exporting Decision Trees in textual format with sklearn

In the past we have covered Decision Trees showing how interpretable these models can be (see the tutorials here). In the previous tutorials we have exported the rules of the models using the function export_graphviz from sklearn and visualized the output of this function in a graphical way with an external tool which is not easy to install in some cases. Luckily, since version 0.21.2, scikit-learn offers the possibility to export Decision Trees in a textual format (I implemented this feature personally ^_^) and in this post we will see an example how of to use this new feature.

Let's train a tree with 2 layers on the famous iris dataset using all the data and print the resulting rules using the brand new function export_text:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn.datasets import load_iris

iris = load_iris()
X = iris['data']
y = ['setosa']*50+['versicolor']*50+['virginica']*50
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)
|--- petal width (cm) <= 0.80
|   |--- class: setosa
|--- petal width (cm) >  0.80
|   |--- petal width (cm) <= 1.75
|   |   |--- class: versicolor
|   |--- petal width (cm) >  1.75
|   |   |--- class: virginica
Reading the them we note that if the feature petal width is less or equal than 80mm the samples are always classified as setosa. Otherwise if the petal width is less or equal than 1.75cm they're classified as versicolor or as virginica if the petal width is more than 1.75cm. This model might well suffer from overfitting but tells us some important details of the data. It's easy to note that the petal width is the only feature used, we could even say that the petal width is small for setosa samples, medium for versicolor and large for virginica.

To understand how the rules separate the labels we can also print the number of samples from each class (class weights) on the leaves:
r = export_text(decision_tree, feature_names=iris['feature_names'],
                decimals=0, show_weights=True)
print(r)
|--- petal width (cm) <= 1
|   |--- weights: [50, 0, 0] class: setosa
|--- petal width (cm) >  1
|   |--- petal width (cm) <= 2
|   |   |--- weights: [0, 49, 5] class: versicolor
|   |--- petal width (cm) >  2
|   |   |--- weights: [0, 1, 45] class: virginica
Here we have the number of samples per class among square brackets. Recalling that we have 50 samples per class, we see that all the samples labeled as setosa are correctly modelled by the tree while for 5 virginica and 1 versicolor the model fails to capture the information given by the label.

Check out the documentation of the function export_text to discover all its capabilities here.

Friday, May 17, 2019

Feelings toward immigration of people from other EU Member States in November 2018

In this post we will see a snippet about how to plot a part of the results of the eurobarometer survey released last March. In particular, we will focus on the responses to the following question:
Please tell me whether the following statement evokes a positive or negative feeling for you: Immigration of people from other EU Member States.
The data from the main spreadsheet reporting the results country by country was isolated in a csv file (then uploaded on github) so that it could be easily loaded in Pandas as follows:
import pandas as pd

# github gist
gist = 'https://gist.githubusercontent.com/JustGlowing/'
gist += '2c25b9b153192baf573ce3b744ea6a65/raw/'
gist += '5f3888f7f42caca58b2418ec5822425083b6d559/'
gist += 'immigration_from_EU_eurobarometer_2018.csv'
df = pd.read_csv(gist, index_col=0)
df = df[df.index.map(lambda x: not '\n' in x)]
df.sort_values(by=["Total 'Positive'"], inplace=True)

# from https://ec.europa.eu/eurostat/statistics-explained/index.php
country_names = {'BE' : 'Belgium',
'BG' : 'Bulgaria',
'CZ' : 'Czechia',
'DK' : 'Denmark',
'DE' : 'Germany',
'EE' : 'Estonia',
'IE' : 'Ireland',
'EL' : 'Greece',
'ES' : 'Spain',
'FR' : 'France',
'HR' : 'Croatia',
'IT' : 'Italy',
'CY' : 'Cyprus',
'LV' : 'Latvia',
'LT' : 'Lithuania',
'LU' : 'Luxembourg',
'HU' : 'Hungary',
'MT' : 'Malta',
'NL' : 'Netherlands',
'AT' : 'Austria',
'PL' : 'Poland',
'PT' : 'Portugal',
'RO' : 'Romania',
'SI' : 'Slovenia',
'SK' : 'Slovakia',
'FI' : 'Finland',
'SE' : 'Sweden',
'UK' : 'United Kingdom'}

df.index = df.index.map(country_names.get)
The idea is to create a bar chart with two sides, positive responses on the right and negative on the left. To do this, we can use the function barh and the attribute left can be used to stack the two subsets of responses ("Fairly positive/ negative" and "Very positive/negative"). The xticks also need to be adapted to reflect that the left side of the axis doesn't report values below zero. Here's the snippet:
import matplotlib.pyplot as plt
import numpy as np

country_idx = range(len(df))

plt.figure(figsize=(11, 14))
plt.barh(country_idx, df['Fairly positive'],
         color='deepskyblue',label='Fairly positive')
plt.barh(country_idx, df['Very positive'], left=df['Fairly positive'],
         color='dodgerblue', label='Very positive')
plt.barh(country_idx, -df['Fairly negative'],
         color='tomato', label='Fairly negative')
plt.barh(country_idx, -df['Very negative'], left=-df['Fairly negative'],
         color='firebrick', label='Very negative')

plt.yticks(country_idx, df.index)
plt.xlim([-100, 100])
plt.xticks(np.arange(-100, 101, 25), np.abs(np.arange(-100, 101, 25)))
plt.ylim([-.5, len(df)-.5])
title = 'Feelings toward immigration of people from\n'
title += 'other EU Member States in November 2018'
plt.title(title)
xlbl = 'negative            <<<       % responses       >>>            positive'
plt.xlabel(xlbl)
plt.legend(loc='lower right')

bbox_props = dict(fc="white", ec="k", lw=2) 
plt.text(-95, 27, 'twitter: @justglowing \nhttps://glowingpython.blogspot.com',
         ha="left", va="center", size=11, bbox=bbox_props)
plt.show()


From the chart we note that the percentage of positive responses per country is mostly above 50% while the negative ones reach 50% only in two cases. We also see that Ireland and Sweden are the countries with the most positive responses, while Czechia (yes, that's Chech Republic :) is the country with most negative responses, though Cypriots also gave a similar number of "Very negative" responses.

Wednesday, April 17, 2019

Visualizing atmospheric carbon dioxide

Let's have a look at how to create a visualization that shows how CO2 concentrations evolved in the atmosphere. First, we fetched from the Earth System Research Laboratory website like follows:
import pandas as pd

data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
                       names=['year', 'month', 'day', 'decimal', 'ppm', 
                       'days', '1_yr_ago',  '10_yr_ago', 'since_1800'])

co2_data['timestamp'] = co2_data.apply(lambda x: pd.Timestamp(year=int(x.year),
                                                             month=int(x.month),
                                                             day=int(x.day)),
                                       axis=1)
co2_data = co2_data[['timestamp', 'ppm']].set_index('timestamp').ffill()
Then, we group the it by year and month at the same time storing the result in a matrix where each element represents the concentration in a specific year and month:
import numpy as np
import matplotlib.pyplot as plt
from calendar import month_abbr

co2_data = co2_data['1975':'2018']
n_years = co2_data.index.year.max() - co2_data.index.year.min()
z = np.ones((n_years +1 , 12)) * np.min(co2_data.ppm)
for d, y in co2_data.groupby([co2_data.index.year, co2_data.index.month]):
  z[co2_data.index.year.max() - d[0], d[1] - 1] = y.mean()[0]
  
plt.figure(figsize=(10, 14))
plt.pcolor(np.flipud(z), cmap='hot_r')
plt.yticks(np.arange(0, n_years+1)+.5,
           range(co2_data.index.year.min(), co2_data.index.year.max()+1));
plt.xticks(np.arange(13)-.5, month_abbr)
plt.xlim((0, 12))
plt.colorbar().set_label('Atmospheric Carbon Dioxide in ppm')
plt.show()


This visualization makes us able to compare the CO2 levels month by month with single glance. For example, we see that the period from April to June gets dark quicker than other periods, meaning that it contains the highest levels every year. Conversely, the period that goes from September to October gets darker more slowly, meaning that it's the period with the lowest CO2 levels. Also, looking at the color bar we note that in 43 years there was a 80 ppm increase.

Is this bad for the planet earth? Reading Hansen et al. (2008) we can classify CO2 levels less than 300 ppm as safe, levels between 300 and 350 ppm as dangerous, while levels beyond 350 ppm are considered catastrophic. According to this, the chart is a sad picture of how the levels transitioned from dangerous to catastrophic!

Concerned by this situation I created the CO2 Forecast twitter account where I'll publish short and long term forecasts of CO2 levels in the atmosphere.

Thursday, March 28, 2019

Speeding up the Sieve of Eratosthenes with Numba

Lately, on invitation of my right honourable friend Michal, I've been trying to solve some problems from the Euler project and felt the need to have a good way to find prime numbers. So implemented the the Sieve of Eratosthenes. The algorithm is simple and efficient. It creates a list of all integers below a number n then filters out the multiples of all primes less than or equal to the square root of n, the remaining numbers are the eagerly-awaited primes. Here's the first version of the implementation I came up with:
def sieve_python(limit):
    is_prime = [True]*limit
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(limit**0.5) + 1):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime
This returns a list is_prime where is_prime[n] is True n is a prime number. The code is straightforward but it wasn't fast enough for my taste so I decided to time it:
from timeit import timeit

def elapse_time(s):
    s = timeit(s, number=100, globals=globals())
    return f'{s:.3f} seconds'

print(elapse_time('sieve_python(100000)'))
1.107 seconds
1.1 seconds to check 100000 values sounded indeed too slow so I decided to precompile the function with Numba:
from numba import njit

@njit
def sieve_python_jit(limit):
    is_prime = [True]*limit
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(limit**0.5) + 1):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime

sieve_python_jit(10) # compilation
print(elapse_time('sieve_python_jit(100000)'))
0.103 seconds
The only addition to the previous version is the decorator @njit and this simple change resulted in a whopping 10x speed up! However, Michal shared with me some code making me notice that combining Numba with the appropriate Numpy data structures leads to impressive results so this implementation materialized:
import numpy as np

@njit
def sieve_numpy_jit(limit):
    is_prime = np.full(limit, True)
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(np.sqrt(limit) + 1)):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime

sieve_numpy_jit(10) # compilation
print(elapse_time('sieve_numpy_jit(100000)'))
0.018 seconds
The speed up respect to the first version is 61x!

Lessons learned:
  • Using Numba is very straightforward and a Python function written in a decent manner can be speeded up with little effort.
  • Python lists are too heavy in some cases. Even with pre-allocation of the memory they can't beat Numpy arrays for this specific task.
  • Assigning types correctly is key. Using a Numpy array of integers instead of bools in the function sieve_numpy_jit would result in a slow down.
Update: Thanks to gwillicoder who made me realize the code could be speed up checking if the divisor is a prime and providing a very efficient numpy implementation here.

Saturday, March 23, 2019

Visualizing the trend of a time series with Pandas

The trend of time series is the general direction in which the values change. In this post we will focus on how to use rolling windows to isolate it. Let's download from Google Trends the interest of the search term Pancakes and see what we can do with it:
import pandas as pd
import matplotlib.pyplot as plt
url = './data/pancakes.csv' # downloaded from https://trends.google.com
data = pd.read_csv(url, skiprows=2, parse_dates=['Month'], index_col=['Month'])
plt.plot(data)


Looking at the data we notice that there's some seasonality (Pancakes day! yay!) and an increasing trend. What if we want to visualize just the trend of this curve? We only need to slide a rolling window through the data and compute the average at each step. This can be done in just one line if we use the method rolling:

y_mean = data.rolling('365D').mean()
plt.plot(y_mean)


The parameter passed to rolling '365D' means that our rolling window will have size 365 days. Check out the documentation of the method to know more.
We can also add highlight the variation each year adding to the chart a shade with the amplitude of the standard deviation:

y_std = data.rolling('365D').std()
plt.plot(y_mean)
plt.fill_between(y_mean.index,
                 (y_mean - y_std).values.T[0],
                 (y_mean + y_std).values.T[0], alpha=.5)


Warning: the visualization above assumes that the distribution of the data each year follows a normal distribution, which is not entirely true.

Wednesday, March 20, 2019

Ravel and unravel with numpy

Raveling and unraveling are common operations when working with matricies. With a ravel operation we go from matrix coordinate to index coordinates, while with an unravel operation we go the opposite way. In this post we will through an example how they can be done with numpy in a very easy way. Let's assume that we have a matrix of dimensions 4-by-4, and that we want to index of the element (1, 1) counting from the top right corner of the matrix. Using ravel_multi_index the solution is easy:
import numpy as np
coordinates = [[1], [1]]
shape = (4, 4)
idx = np.ravel_multi_index(coordinates, shape)
print(idx)

array([5])
What if we want to go back to the original coordinates? In this case we can use unravel_index:
np.unravel_index(idx, shape)

(array([1]), array([1]))
So now we know that the elements (1, 1) has index 5 ;-)

Tuesday, January 22, 2019

A visual introduction to the Gap Statistics

We have previously seen how to implement KMeans. However, the results of this algorithm strongly rely on the choice of the parameter K. According to statistical folklore the best K is located at the 'elbow' of the clusters inertia while K increases. This heuristic has been translated into a more formalized procedure by the Gap Statistics and in this post we'll see how to pick K in an optimal way using the Gap Statistics. The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of KMeans increasing K:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans


reference = np.random.rand(100, 2)
plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(reference)
    plt.subplot(1,5,k)
    plt.scatter(reference[:, 0], reference[:, 1], c=a)
    plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()


From the figure above we can see that the algorithm evenly splits the points K clusters even if there's no separation between them. Let’s now do the same on a target dataset with 3 natural clusters:

plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(X)
    plt.subplot(1,5,k)
    plt.scatter(X[:, 0], X[:, 1], c=a)
    plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()


Here we note that the algorithm, with K=2, correctly isolates one of the clusters grouping the other two together. Then, with K=3, correctly identifies the natural clusters. But, with K=4 and K=5 some of the natural clusters are split in two. If we plot the inertia in both cases we'll see something interesting:

def compute_inertia(a, X):
    W = [np.mean(pairwise_distances(X[a == c, :])) for c in np.unique(a)]
    return np.mean(W)

def compute_gap(clustering, data, k_max=5, n_references=5):
    if len(data.shape) == 1:
        data = data.reshape(-1, 1)
    reference = np.random.rand(*data.shape)
    reference_inertia = []
    for k in range(1, k_max+1):
        local_inertia = []
        for _ in range(n_references):
            clustering.n_clusters = k
            assignments = clustering.fit_predict(reference)
            local_inertia.append(compute_inertia(assignments, reference))
        reference_inertia.append(np.mean(local_inertia))
    
    ondata_inertia = []
    for k in range(1, k_max+1):
        clustering.n_clusters = k
        assignments = clustering.fit_predict(data)
        ondata_inertia.append(compute_inertia(assignments, data))
        
    gap = np.log(reference_inertia)-np.log(ondata_inertia)
    return gap, np.log(reference_inertia), np.log(ondata_inertia)

gap, reference_inertia, ondata_inertia = compute_gap(KMeans())


plt.plot(range(1, k_max+1), reference_inertia,
         '-o', label='reference')
plt.plot(range(1, k_max+1), ondata_inertia,
         '-o', label='data')
plt.xlabel('k')
plt.ylabel('log(inertia)')
plt.show()


On the reference dataset the inertia goes down’ very slowly while on the target dataset it assumes the shape of an elbow! We can now compute the Gap Statistics for each K computing the difference of the two curves showed above:

plt.plot(range(1, k_max+1), gap, '-o')
plt.ylabel('gap')
plt.xlabel('k')


It’s easy to see that the Gap is maximum for K=3, just the right choice for our target dataset.

For a more formal introduction you can check out the following paper: Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistics Society 2001.