The Glowing Python: data mining

Showing posts with label data mining. Show all posts

Sunday, September 15, 2013

Self Organizing Maps

Notice: For an update tutorial on how to use minisom refere to the examples in the official documentation.

The Self Organizing Maps (SOM), also known as Kohonen maps, are a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display. In a SOM the neurons are organized in a bidimensional lattice and each neuron is fully connected to all the source nodes in the input layer. An illustration of the SOM by Haykin (1999) is the following

Each neuron n has a vector w_n of weights associated. The process for training a SOM involves stepping through several training iteration until the item in your dataset are learnt by the SOM. For each pattern x one neuron n will "win" (which means that w_n is the weights vector more similar to x) and this winning neuron will have its weights adjusted so that it will have a stronger response to the input the next time it sees it (which means that the distance between x and w_n will be smaller). As different neurons win for different patterns, their ability to recognize that particular pattern will increase. The training algorithm can be summarized as follows:

Initialize the weights of each neuron.
Initialize t = 0
Randomly pick an input x from the dataset
Determine the winning neuron i as the neuron such that
Adapt the weights of each neuron n according to the following rule
Increment t by 1
if t < t_max go to step 3

We have that η(t) is called learning rate and that h(i) is called neighborhood function which has high values for i and the neurons close to i on the lattice (a Gaussian centered on i is a good example of neighborhood function). And, when t increases η also decrease and h decrease its spread. This way at each training step the weights of the neurons close to the winning one are adjusted to have a stronger response to the current pattern. After the training process, we have that the locations of the neurons become ordered and a meaningful coordinate system for the input features is created on the lattice. So, if we consider the coordinates of the associated winning neuron for each patter the SOM forms a topographic map of the input patterns.

MiniSom is a minimalistic and Numpy based implementation of the SOM. I made it during the experiments for my thesis in order to have fully hackable SOM algorithm and lately I decided to release it on GitHub. The next part of this post will show how to train MiniSom on the Iris Dataset and how to visualize the result. The first step is to import and normalize the data:

from numpy import genfromtxt,array,linalg,zeros,apply_along_axis

# reading the iris dataset in the csv format    
# (downloaded from http://aima.cs.berkeley.edu/data/iris.csv)
data = genfromtxt('iris.csv', delimiter=',',usecols=(0,1,2,3))
# normalization to unity of each pattern in the data
data = apply_along_axis(lambda x: x/linalg.norm(x),1,data)

The snippet above reads the dataset from a CSV and creates a matrix where each row corresponds to a pattern. In this case, we have that each pattern has 4 dimensions. (Note that only the first 4 columns of the file are used because the fifth column contains the labels). The training process can be started as follows:

from minisom import MiniSom
### Initialization and training ###
som = MiniSom(7,7,4,sigma=1.0,learning_rate=0.5)
som.random_weights_init(data)
print("Training...")
som.train_random(data,100) # training with 100 iterations
print("\n...ready!")

Now we have a 7-by-7 SOM trained on our dataset. MiniSom uses a Gaussian as neighborhood function and its initial spread is specified with the parameter sigma. While with the parameter learning_rate we can specify the initial learning rate. The training algorithm implemented decreases both parameters as training progresses. This allows rapid initial training of the neural network that is then "fine tuned" as training progresses.
To visualize the result of the training we can plot the average distance map of the weights on the map and the coordinates of the associated winning neuron for each patter:

from pylab import plot,axis,show,pcolor,colorbar,bone
bone()
pcolor(som.distance_map().T) # distance map as background
colorbar()
# loading the labels
target = genfromtxt('iris.csv',
                    delimiter=',',usecols=(4),dtype=str)
t = zeros(len(target),dtype=int)
t[target == 'setosa'] = 0
t[target == 'versicolor'] = 1
t[target == 'virginica'] = 2
# use different colors and markers for each label
markers = ['o','s','D']
colors = ['r','g','b']
for cnt,xx in enumerate(data):
 w = som.winner(xx) # getting the winner
 # palce a marker on the winning position for the sample xx
 plot(w[0]+.5,w[1]+.5,markers[t[cnt]],markerfacecolor='None',
   markeredgecolor=colors[t[cnt]],markersize=12,markeredgewidth=2)
axis([0,som.weights.shape[0],0,som.weights.shape[1]])
show() # show the figure

The result should be like the following:

For each pattern in the dataset the corresponding winning neuron have been marked. Each type of marker represents a class of the iris data ( the classes are setosa, versicolor and virginica and they are respectively represented with red, green and blue colors). The average distance map of the weights is used as background (the values are showed in the colorbar on the right). As expected from previous studies on this dataset, the patterns are grouped according to the class they belong and a small fraction of Iris virginica is mixed with Iris versicolor.

For a more detailed explanation of the SOM algorithm you can look at its inventor's paper.

Friday, May 3, 2013

A new RefCard from the GlowingPython!

Check out the DZone RefCard from the GlowingPython:

Discovering and Visualizing Patterns with Python!

This Refcard is a collection of code examples that introduces the reader to the principal Data Mining tasks using Python. In the RefCard you will find the following contents:

How to import and visualize data.
How to classify and cluster data.
How to discover relationships in the data using regression and correlation measures.
How to reduce the dimensionality of the data in order to compress and visualize the information it brings.
How to analyze structured data with networkx.

Each topic is covered with code examples based on four of the major Python libraries for data analysis and manipulation: numpy, matplotlib,sklearn and networkx. Here is a preview of the first two pages:

Click on the preview to get the RefCard!

Thursday, April 26, 2012

K-Nearest Neighbour Classifier

The Nearest Neighbour Classifier is one of the most straightforward classifier in the arsenal of machine learning techniques. It performs the classiﬁcation by identifying the nearest neighbours to a query pattern and using those neighbors to determine the label of the query. The idea behind the algorithm is simple: Assign the query pattern to the class which occurs the most in the k nearest neighbors. In this post we'll use the function knn_search(...) that we have seen in the last post to implement a K-Nearest Neighbour Classifier. The implementation of the classifier is as follows:

from numpy import random,argsort,argmax,bincount,int_,array,vstack,round
from pylab import scatter,show

def knn_classifier(x, D, labels, K):
 """ Classify the vector x
     D - data matrix (each row is a pattern).
     labels - class of each pattern.
     K - number of neighbour to use.
     Returns the class label and the neighbors indexes.
 """
 neig_idx = knn_search(x,D,K)
 counts = bincount(labels[neig_idx]) # voting
 return argmax(counts),neig_idx

Let's test the classifier on some random data:

 # generating a random dataset with random labels
data = random.rand(2,150) # random points
labels = int_(round(random.rand(150)*1)) # random labels 0 or 1
x = random.rand(2,1) # random test point

# label assignment using k=5
result,neig_idx = knn_classifier(x,data,labels,5)
print 'Label assignment:', result

# plotting the data and the input pattern
# class 1, red points, class 0 blue points
scatter(data[0,:],data[1,:], c=labels,alpha=0.8)
scatter(x[0],x[1],marker='o',c='g',s=40)
# highlighting the neighbours
plot(data[0,neig_idx],data[1,neig_idx],'o',
  markerfacecolor='None',markersize=15,markeredgewidth=1)
show()

The script will show the following graph:

The query vector is represented with a green point and we can see that the 3 out of 5 nearest neighbors are red points (label 1) while the remaining 2 are blue (label 2).
The result of the classification will be printed on the console:

Label assignment: 1

As we expected, the green point have been assigned to the class with red markers.

Saturday, April 14, 2012

k-nearest neighbor search

A k-nearest neighbor search identifies the top k nearest neighbors to a query. The problem is: given a dataset D of vectors in a d-dimensional space and a query point x in the same space, find the closest point in D to x. The following function performs a k-nearest neighbor search using the euclidean distance:

from numpy import random,argsort,sqrt
from pylab import plot,show

def knn_search(x, D, K):
 """ find K nearest neighbours of data among D """
 ndata = D.shape[1]
 K = K if K < ndata else ndata
 # euclidean distances from the other points
 sqd = sqrt(((D - x[:,:ndata])**2).sum(axis=0))
 idx = argsort(sqd) # sorting
 # return the indexes of K nearest neighbours
 return idx[:K]

The function computes the euclidean distance between every point of D and x then returns the indexes of the points for which the distance is smaller.
Now, we will test this function on a random bidimensional dataset:

# knn_search test
data = random.rand(2,200) # random dataset
x = random.rand(2,1) # query point

# performing the search
neig_idx = knn_search(x,data,10)

# plotting the data and the input point
plot(data[0,:],data[1,:],'ob',x[0,0],x[1,0],'or')
# highlighting the neighbours
plot(data[0,neig_idx],data[1,neig_idx],'o',
  markerfacecolor='None',markersize=15,markeredgewidth=1)
show()

The result is as follows:

The red point is the query vector and the blue ones represent the data. The blue points surrounded by a black circle are the nearest neighbors.

Thursday, April 5, 2012

K- means clustering with scipy

K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm alternates the two steps:

for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.

These two steps are iterated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster of the closest prototype.
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

The result should be as follows:

In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:

# now with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or',
     data[idx==2,0],data[idx==2,1],'og') # third cluster points
plot(centroids[:,0],centroids[:,1],'sm',markersize=8)
show()

This time the the result is as follows: