The Glowing Python: machine learning

Showing posts with label machine learning. Show all posts

Monday, June 29, 2020

Solving the Travelling Salesman Problem with MiniSom

Have you ever heard of the Travelling Salesman Problem? I'm pretty sure you do, but let's refresh our mind looking at its formulation: "Given a list of points and the distances between each pair of points, what is the shortest possible path that visits each point and returns to the starting point?".
What makes this problem so famous and so studied is the fact that it has no "quick" solution as the complexity of calculating the best path increases adding more points. And the complexity increases so fast that, even with modern hardware, it can be impossible to compute an exact solution in a reasonable time. In more rigorous terms, it is an NP-hard problem. Many heuristics are known to solve this problem and in this post we will see a solution based on Self-organizing Maps (SOM). A SOM is a Neural Network that is capable of mapping an input point into a bi-dimnsional space placing points that are close to each other into the same area. Hence, the idea to solve our problem is to train the SOM in order to map the points to visit in single dimension map and visit the points from the one mapped to the first cell (the one on the left) to the last cell (on the right). Points that are mapped to the same cell are visited consecutively.

Let's generate a set of points to test this idea:

import numpy as np
import matplotlib.pyplot as plt

np.random.RandomState(10)
N_points = 20
N_neurons = N_points*2
t = np.linspace(0, np.pi*2, N_points)
x = np.cos(t)+(np.random.rand(N_points)-.5)*.3
y = np.sin(t)*.8+(np.random.rand(N_points)-.5)*.2
points = np.array([x,y]).T
plt.scatter(x, y)

We can now import MiniSom, our favorite implementation of the Self_Organizing Maps, and see what path it's able to produce:

from minisom import MiniSom

som = MiniSom(1, N_neurons*2, 2, sigma=10,
              neighborhood_function='gaussian', random_seed=50)
max_iter = 2000
som.pca_weights_init(points)

paths_x = []
paths_y = []
for i in np.arange(max_iter):
    i_point = i % len(points)
    som.update(points[i_point], som.winner(points[i_point]), i, max_iter)
    visit_order = np.argsort([som.winner(p)[1] for p in points])
    visit_order = np.concatenate((visit_order, [visit_order[0]]))
    paths_x.append(points[visit_order][:,0])
    paths_y.append(points[visit_order][:,1])
    
plt.scatter(x, y, label='point to visit')
plt.plot(paths_x[-1], paths_y[-1],
         'C3', linewidth=2, label='path')

In the snippet above we initialized the SOM and run 2000 training iterations (check this out to discover how that works). At each iteration we have saved the path found and visualized the last solution. As we can see, the line covers all the points and it's easy to see that it's the best possible path with just a glance. However, it's interesting to see how the solution evolves at each iteration:

from matplotlib.animation import FuncAnimation
from IPython.display import HTML

fig, ax = plt.subplots()
plt.scatter(x, y, label='point to visit')
ln, = plt.plot([], [], 'C3', linewidth=2, label='path')
plt.legend()

def update(frame):
    ln.set_data(paths_x[frame], paths_y[frame])
    plt.title('iteration = %d' % frame)
    return ln,

ani = FuncAnimation(fig, update, frames=np.arange(max_iter),
                    interval=10, repeat=False, blit=False)
HTML(ani.to_html5_video())

Here we note that the initial path is very messy and presents various loops and that the more the network is trained the more optimal the solution becomes. Notice that the snippet above uses the object HTML from the IPython library and it will automatically display the video if a Jupyter notebook is used. The video can be saved in a specific location using ani.save(filename.mp4).

Wednesday, September 11, 2019

Organizing movie covers with Neural Networks

In this post we will see how to organize a set of movie covers by similarity on a 2D grid using a particular type of Neural Network called Self Organizing Map (SOM). First, let's load the movie covers of the top 100 movies according to IMDB (the files can be downloaded here) and convert the images in samples that we can use to feed the Neural Network:

import numpy as np
import imageio
from glob import glob
from sklearn.preprocessing import StandardScaler

# covers of the top 100 movies on www.imdb.com/chart/top 
# (the 13th of August 2019)
# images downloaded from www.themoviedb.org
data = []
all_covers = glob('movie_covers/*.jpg')
for cover_jpg in all_covers:
    cover = imageio.imread(cover_jpg)
    data.append(cover.reshape(np.prod(cover.shape)))
    
original_shape = imageio.imread(all_covers[0]).shape

scaler = StandardScaler()
data = scaler.fit_transform(data)

In the snippet above we load every image and for each of them we stack the color values of each pixel in a one dimensional vector. After loading all the images a standard scaling is applied to have all the values with mean 0 and standard deviation equal to 1. This scaling strategies often turns out to be quite successful when working with SOMs. Now we can train our model:

from minisom import MiniSom

w = 10
h = 10
som = MiniSom(h, w, len(data[0]), learning_rate=0.5,
              sigma=3, neighborhood_function='triangle')

som.train_random(data, 2500, verbose=True)
win_map = som.win_map(data)

Here we use Minisom, a lean implementation of the SOM, to implement a 10-by-10 map of neurons. Each movie cover is mapped in a neuron and we can display the results as follows:

import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(figsize=(30, 20))
grid = ImageGrid(fig, 111,
                 nrows_ncols=(h, w), axes_pad=0)

def place_image(i, img):
    img = (scaler.inverse_transform(img)).astype(int)
    grid[i].imshow(img.reshape(original_shape))
    grid[i].axis('off')

to_fill = []
collided = []

for i in range(w*h):
    position = np.unravel_index(i, (h, w))
    if position in win_map:
        img = win_map[position][0]
        collided += win_map[position][1:]
        place_image(i, img)
    else:
        to_fill.append(i)

collided = collided[::-1]
for i in to_fill:
    position = np.unravel_index(i, (h, w))
    img = collided.pop()
    place_image(i, img)

plt.show()

Since some images can be mapped in the same neuron, we first draw all the covers picking only one per neuron, then we fill the empty spaces of the map with covers that have been mapped in nearby neurons but have not been plotted yet.

This is the result:

Where to go next:

Read more about how Self Organizing Maps work here.
Check out how to install Minisom here.

Friday, June 7, 2019

Exporting Decision Trees in textual format with sklearn

In the past we have covered Decision Trees showing how interpretable these models can be (see the tutorials here). In the previous tutorials we have exported the rules of the models using the function export_graphviz from sklearn and visualized the output of this function in a graphical way with an external tool which is not easy to install in some cases. Luckily, since version 0.21.2, scikit-learn offers the possibility to export Decision Trees in a textual format (I implemented this feature personally ^_^) and in this post we will see an example how of to use this new feature.

Let's train a tree with 2 layers on the famous iris dataset using all the data and print the resulting rules using the brand new function export_text:

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn.datasets import load_iris

iris = load_iris()
X = iris['data']
y = ['setosa']*50+['versicolor']*50+['virginica']*50
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

|--- petal width (cm) <= 0.80
|   |--- class: setosa
|--- petal width (cm) >  0.80
|   |--- petal width (cm) <= 1.75
|   |   |--- class: versicolor
|   |--- petal width (cm) >  1.75
|   |   |--- class: virginica

Reading the them we note that if the feature petal width is less or equal than 80mm the samples are always classified as setosa. Otherwise if the petal width is less or equal than 1.75cm they're classified as versicolor or as virginica if the petal width is more than 1.75cm. This model might well suffer from overfitting but tells us some important details of the data. It's easy to note that the petal width is the only feature used, we could even say that the petal width is small for setosa samples, medium for versicolor and large for virginica.

To understand how the rules separate the labels we can also print the number of samples from each class (class weights) on the leaves:

r = export_text(decision_tree, feature_names=iris['feature_names'],
                decimals=0, show_weights=True)
print(r)

|--- petal width (cm) <= 1
|   |--- weights: [50, 0, 0] class: setosa
|--- petal width (cm) >  1
|   |--- petal width (cm) <= 2
|   |   |--- weights: [0, 49, 5] class: versicolor
|   |--- petal width (cm) >  2
|   |   |--- weights: [0, 1, 45] class: virginica

Here we have the number of samples per class among square brackets. Recalling that we have 50 samples per class, we see that all the samples labeled as setosa are correctly modelled by the tree while for 5 virginica and 1 versicolor the model fails to capture the information given by the label.

Check out the documentation of the function export_text to discover all its capabilities here.

Tuesday, January 22, 2019

A visual introduction to the Gap Statistics

We have previously seen how to implement KMeans. However, the results of this algorithm strongly rely on the choice of the parameter K. According to statistical folklore the best K is located at the 'elbow' of the clusters inertia while K increases. This heuristic has been translated into a more formalized procedure by the Gap Statistics and in this post we'll see how to pick K in an optimal way using the Gap Statistics. The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of KMeans increasing K:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans


reference = np.random.rand(100, 2)
plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(reference)
    plt.subplot(1,5,k)
    plt.scatter(reference[:, 0], reference[:, 1], c=a)
    plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()

From the figure above we can see that the algorithm evenly splits the points K clusters even if there's no separation between them. Let’s now do the same on a target dataset with 3 natural clusters:

X = make_blobs(n_samples=100, n_features=2,
               centers=3, cluster_std=.8,)[0]

plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(X)
    plt.subplot(1,5,k)
    plt.scatter(X[:, 0], X[:, 1], c=a)
    plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()

Here we note that the algorithm, with K=2, correctly isolates one of the clusters grouping the other two together. Then, with K=3, correctly identifies the natural clusters. But, with K=4 and K=5 some of the natural clusters are split in two. If we plot the inertia in both cases we'll see something interesting:

def compute_inertia(a, X):
    W = [np.mean(pairwise_distances(X[a == c, :])) for c in np.unique(a)]
    return np.mean(W)

def compute_gap(clustering, data, k_max=5, n_references=5):
    if len(data.shape) == 1:
        data = data.reshape(-1, 1)
    reference = np.random.rand(*data.shape)
    reference_inertia = []
    for k in range(1, k_max+1):
        local_inertia = []
        for _ in range(n_references):
            clustering.n_clusters = k
            assignments = clustering.fit_predict(reference)
            local_inertia.append(compute_inertia(assignments, reference))
        reference_inertia.append(np.mean(local_inertia))
    
    ondata_inertia = []
    for k in range(1, k_max+1):
        clustering.n_clusters = k
        assignments = clustering.fit_predict(data)
        ondata_inertia.append(compute_inertia(assignments, data))
        
    gap = np.log(reference_inertia)-np.log(ondata_inertia)
    return gap, np.log(reference_inertia), np.log(ondata_inertia)

k_max = 5
gap, reference_inertia, ondata_inertia = compute_gap(KMeans(), X, k_max)


plt.plot(range(1, k_max+1), reference_inertia,
         '-o', label='reference')
plt.plot(range(1, k_max+1), ondata_inertia,
         '-o', label='data')
plt.xlabel('k')
plt.ylabel('log(inertia)')
plt.show()

On the reference dataset the inertia goes down’ very slowly while on the target dataset it assumes the shape of an elbow! We can now compute the Gap Statistics for each K computing the difference of the two curves showed above:

plt.plot(range(1, k_max+1), gap, '-o')
plt.ylabel('gap')
plt.xlabel('k')

It’s easy to see that the Gap is maximum for K=3, just the right choice for our target dataset.

For a more formal introduction you can check out the following paper: Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistics Society 2001.

Monday, October 9, 2017

Spotting outliers with Isolation Forest using sklearn

Isolation Forest is an algorithm to detect outliers. It partitions the data using a set of trees and provides an anomaly scores looking at how isolated is the point in the structure found, the anomaly score is then used to tell apart outliers from normal observations. In this post we will see an example of how IsolationForest behaves in simple case. First, we will generate 1-dimensional data from bimodal distribution, then we will compare the anomaly score with the distribution of the data and highlighting the regions considered where the outliers fall.

To start, let's generate the data and plot the histogram:

import numpy as np
import matplotlib.pyplot as plt

x = np.concatenate((np.random.normal(loc=-2, scale=.5,size=500), 
                    np.random.normal(loc=2, scale=.5, size=500)))

plt.hist(x, normed=True)
plt.xlim([-5, 5])
plt.show()

Here we note that there are three regions where the data has low probability to appear. One on the right side of the distribution, another one and the left and another around zero. Let's see if using IsolationForest we are able to identify these three regions:

from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(x.reshape(-1, 1))

xx = np.linspace(-6, 6, 100).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)

plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), 
                 where=outlier==-1, color='r', 
                 alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('x')
plt.xlim([-5, 5])
plt.show()

In the snippet above we have trained our IsolationForest using the data generated, computed the anomaly score for each observation and classified each observation as outlier or non outlier. The chart shows, the anomaly scores and the regions where the outliers are. As expected, the anomaly score reflects the shape of the underlying distribution and the outlier regions correspond to low probability areas.

Thursday, April 27, 2017

Solving the Two Spirals problem with Keras

In this post we will see how to create a Multi Layer Perceptron (MLP), one of the most common Neural Network architectures, with Keras. Then, we'll train the MLP to tell apart points from two different spirals in the same space.
To have a sense of the problem, let's first generate the data to train the network:

import numpy as np
import matplotlib.pyplot as plt

def twospirals(n_points, noise=.5):
    """
     Returns the two spirals dataset.
    """
    n = np.sqrt(np.random.rand(n_points,1)) * 780 * (2*np.pi)/360
    d1x = -np.cos(n)*n + np.random.rand(n_points,1) * noise
    d1y = np.sin(n)*n + np.random.rand(n_points,1) * noise
    return (np.vstack((np.hstack((d1x,d1y)),np.hstack((-d1x,-d1y)))), 
            np.hstack((np.zeros(n_points),np.ones(n_points))))

X, y = twospirals(1000)

plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.', label='class 1')
plt.plot(X[y==1,0], X[y==1,1], '.', label='class 2')
plt.legend()
plt.show()

As we can see, this dataset contains two different spirals. This kind of dataset has been named as Worst Dataset Ever!, indeed telling apart the points from the two spirals is not an easy part if your MLP is not sophisticated enough. Let's build a simple MLP with Keras and see what we can achieve:

from keras.models import Sequential
from keras.layers import Dense

mymlp = Sequential()
mymlp.add(Dense(12, input_dim=2, activation='tanh'))
mymlp.add(Dense(1, activation='sigmoid'))

mymlp.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# trains the model
mymlp.fit(X, y, epochs=150, batch_size=10,  verbose=0)

Here we created a Neural Network with the following structure: 2 inputs (the data is in a 2D space) fully connected to 12 hidden neurons and 1 output. Let's generate some test data and see if our model is able to classify them:

X_test, y_test = twospirals(1000)

yy = np.round(mymlp.predict(X_test).T[0])

plt.subplot(1,2,1)
plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.')
plt.plot(X[y==1,0], X[y==1,1], '.')
plt.subplot(1,2,2)
plt.title('Neural Network result')
plt.plot(X_test[yy==0,0], X_test[yy==0,1], '.')
plt.plot(X_test[yy==1,0], X_test[yy==1,1], '.')
plt.show()

We have the original train set on the left and the results of the Neural Network on the right. It's easy to note that the model misclassified most of the points on the test data. Let's add two hidden layers to our model and see what happens:

mymlp = Sequential()
mymlp.add(Dense(12, input_dim=2, activation='tanh'))
mymlp.add(Dense(12, activation='tanh'))
mymlp.add(Dense(12, activation='tanh'))
mymlp.add(Dense(1, activation='sigmoid'))

mymlp.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Fit the model
mymlp.fit(X, y, epochs=150, batch_size=10,  verbose=0)

yy = np.round(mymlp.predict(X_test).T[0])

plt.subplot(1,2,1)
plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.')
plt.plot(X[y==1,0], X[y==1,1], '.')
plt.subplot(1,2,2)
plt.title('Neural Network result')
plt.plot(X_test[yy==0,0], X_test[yy==0,1], '.')
plt.plot(X_test[yy==1,0], X_test[yy==1,1], '.')
plt.show()

The structure of our Network is now more suited to solve the problem and we see that most of the points used for the test were correctly classified.

Saturday, May 21, 2016

An intro to Regression Analysis with Decision Trees

It's a while that there are no posts on this blog, but the Glowing Python is still active and strong! I just decided to publish some of my post on the Cambridge Coding Academy blog. Here are the links to a series of two posts about Regression Analysis with Decision Trees:

In this introduction to Regression Analysis we will see how to user scikit-learn to train Decision Trees to solve a specific problem: "How to predict the number of bikes hired in a bike sharing system in a given day?"

In the first post, we will see how to train a simple Decision Tree to exploit the relation between temperature and bikes hired, this tree will be analysed to explain the result of the training process and gain insights about the data. In the second, we will see how to learn more complex decision trees and how to assess the accuracy of the prediction using cross validation.

Here's a sneak peak of the figures that we will generate: