Thursday, April 5, 2012

K- means clustering with scipy

K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm alternates the two steps:
• for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
• the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.
These two steps are iterated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster of the closest prototype.
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:
```from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
```
The result should be as follows:

In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:
```# now with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or',
data[idx==2,0],data[idx==2,1],'og') # third cluster points
plot(centroids[:,0],centroids[:,1],'sm',markersize=8)
show()
```
This time the the result is as follows:

1. Your blog is awesome and has helped me discover all kinds of useful tidbits in scipy and numpy. Thank you!

2. Thanks for the post. I have used kmeans to identify clusters (rings) in a matrix of sea surface height. The objective is to identify the rings and to determine their centroids. But kmeans requires as input parameter the number of clusters to be sought. That is a problem because I usually do not know previously how many rings will be present in the area. So, I was wondering how to avoid this kmeans limitation. Do you have any idea?
Regards,
Alex

1. Hello Alexandre,
how to find a good choice for k is a studied problem in machine learning. I suggest you to read this paper: http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf

3. from where did you get the module "from scipy.cluster.vq import kmeans,vq"? where can I get it?

this stuff looks great, btw.

1. It come with the scipy installation:

http://docs.scipy.org/doc/scipy/reference/cluster.vq.html#module-scipy.cluster.vq

4. Excellent blog, JG. I love it. I use it and recommend it to others. One question. Many of the k-means tutorials that I've found rely on self-made, perfectly configured data -- i.e., they'll "generate" numbers in just the form necessary to get an easy k-means clustering. Whereas in the real world many of us are using k-means to cluster documents (using NLP) or other information that requires a great deal of work/formatting before we can even use k-means. For example, on documents first one must create a vector (tf-idf for instance) and then complete a similarity measurement (euclidean for instance).

Thus, the greatest challenge to performing a k-means is often just getting the data into a format that the kmeans calls can work with. For instance, tutorials always seem to have their data in the following list-within-a-list format:[[number1, number2], [number1, number2], [number1, number2], [number1, number2]... ]. How would you recommend getting documents (i.e., natural language) into that format? Note that cosine similarity performed on a tf-idf vector always returns a single list, which isn't "clusterable" (to my knowledge).

Any help appreciated and thank you in advance for all of your work.

1. Hello,

You could take a look to the vector space models. Those models make you able to build a fixed length vector using the frequencies of some words you're interested in. This post is a good introduction:

http://pyevolve.sourceforge.net/wordpress/?p=1589

2. I'll check it out. Thanks much, JG.

5. i am not able to view the result what can be done..
show() command is not working

6. Excellent! Help me much to understand k-means clustering.

Is it possible to use this in a capacitated-VRP (vehicle routing problem) ?
In which, each node has "demand", and there is a fixed vehicle's "capacity".
Subject to: sum of all node's demands (in a cluster) is smaller-or-equal to vehicle's capacity.

Any helps very much appreciated. Thank you.

1. I mean as explained in here, "Cluster-First Route-Second Method"
http://neo.lcc.uma.es/vrp/solution-methods/heuristics/cluster-first-route-second-method/

Thanks...

7. This is excellent material, and your code explaining how to use the scipy implementation is beautiful and clear. Just recently I've implemented the algorithm in python myself, it's a lot of fun to play around with configurations to see the clustering in action: http://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

8. Could you please discuss a bit the role of 'whitening' which seems to be kind of highly reccommended by the scipy tutorial?

1. Hi, you can find some general information about the whitening transformation here: http://en.wikipedia.org/wiki/Whitening_transformation

Usually, when you have a data matrix, you use whitening in order to have unit variance across all the samples. This operation could improve the result of the k-means algorithm.

9. Thank you very much for your post, it's very helpful. Here, your dataset has two variables that you partition into 2 and then 3 clusters, so it makes sense to plot the k-means clusters like this, with 1 variable on X-axis and one on the Y-axis. but what about if your dataset has more dimensions? Do you have any suggestions about ways to look at the output under these circumstances?

Thanks,

SS

1. Hi, if you have more than two variables and you need to visualize your data in a 2D space, you need a dimensionality reduction method. Here's an example on how to use Isomap: http://glowingpython.blogspot.co.uk/2012/05/manifold-learning-on-handwritten-digits.html

2. Interesting, I'll check it out. Thanks for the speedy response!

10. This comment has been removed by the author.

11. This post has been very helpful. Thank you very much

12. If I would like say 100 clusters from k-means, too many to do the plotting by hand as you have in this example, how would you visualise/plot the results from the clustering?

1. You have to defined a list with a color for each cluster and plot each cluster separately with a for loop choosing the appropriate color each time.

13. Great post.

Do you know of anything similar that can find clusters in 3D space? For example, I have a set of particles each with x, y, z co-ordinates. How would I go about finding clusters (of various densities) in such a dataset?

Any thoughts or ideas?

1. kmeans works in 3D spaces. You just have to modify the input matrix.

2. Fantastic. Will investigate it further. Thanks.

14. Nice demonstration. But can you tell me how to use this scipy.cluster.vq module to generate codebook for an array of mfcc feature vectors. I've extracted the MFCC feature vectors (13 coefficients) ...now i wish to use vq to perform pattern matching stuff. any idea ?

1. Of course, you could cluster your MFCC vectors in order to find similarities in some parts of your signal. But the applications that you can realize are related to the kind of signal you have (speech, music, ...).

15. This comment has been removed by a blog administrator.

16. Hi I have a problem using k-mean clustering with scipy.

I have a set of data as x-axis and y-axis

[[-0.0365, 0.0121],
[ 0.0623, -0.0019],
[ 0.0352, -0.0007],
[ 0.0609, -0.0096]]

If i use the k-mean function from matlab it clusters it properly(i mean as it is expected) i.e 1st row and last row comes under one cluster and middle two rows comes under one cluster.

But, when i use scipy as it is told in this blog, results are not as expected i.e 1st row comes under one cluster and last 3 rows comes under another cluster. Can any one pls tell me why is it so?

1. Hi Shambhulingayya, just try to plot your data. You'll see that your first sample is an outlier (it's even the only one with a negative x and a positive y) and that the other 3 samples are close each other. The clustering provided by scipy makes more sense to me.

2. Well, BTW FYI, I am using it for pattern recognition. The data sets i mentioned above are the output from the PCA(reduced dimensional dataset).

But, in reality my patterns which i know by it's appearances are, (1st and 4th) are similar and (2nd and 4th are similar).

Why i am doubting this because, if i cluster the same data using matlab then, clustering is happening perfectly right i.e 1st and 4th under one cluster and 2nd and 3rd under another cluster.

Any thoughts in this...?

3. FYI, here r the K-mean output of matlab and python which differ each other for same data set

4. Try to use the same starting centroids and the same number of iterations on both the implementations.

17. This comment has been removed by the author.

18. Very Informative Blog! Could you please tell me how to identify which cluster stands for what? For eg, if I have 4 features for t-shirt sizes (age, weight, height, gender) and if I get 3 clusters, how to find which cluster out of the 3 stands for small size, medium size and large size?

1. The values of the centroids represent your variables. If you look at them you will be able to understand which centroid represents the size of the related cluster.

19. I am new to scipy. Can any one tell me how do I extract the data points belonging to each cluster? Here is the code-

data = vstack(arr1)
centroids,_ = kmeans(data,4)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',data[idx==1,0],data[idx==1,1],'or',data[idx==2,0],data[idx==2,1],'og',data[idx==3,0],data[idx==3,1],'oy')

plot(centroids[:,0],centroids[:,1],'sg',markersize=8)

20. Thank you so much for this writeup. It has helped me dip my toes into kmeans and scipy. I look forward to continuing.

I had one quick question about labeling points. I have k=8 as my best fit for my data, and can differentiate the clusters well. I imported data from a pandas dataframe with an index, then subsequently into a numpy array to perform clustering and plotting-by-idx. Can you suggest a method to take the index.values from the dataframe and label the plot accordingly so I can associate the specific points with their sample of origin?

1. I BDavis, DataFrame has the attribute index which gives what you probably want.

2. Thank you for the prompt reply!

I am able to obtain the ordered index, no problem. I am having difficulty integrating the sample titles contained within with the logical index plotting methods you implement. I was trying to iterate over it in a list, but associating it with the coordinates within the cluster data is elusive.

21. What is the similarity measure of this implementation of k-means? Thank you.

1. Hi, this algorithm uses the euclidian distance.

2. Hey.

Is it possible to use some other metrics like Pearson correlation?
Or can I use precalculated distance matrix for the clustering?

3. Hi L, not with this implementation.

22. Works perfectly for me. thanks

23. hey
Every time you perform the algo, the centroid number happens to change making the plots colouring different at every run. Does anybody knows how to fix that?
Thank you

1. The initial centroid are randomly chosen. You can fix the initial random seed using this function: numpy.random.seed

2. Thanks a lot, It's a nice trick.
Another question, when you get the centroids from the "whiten" data, how do you get back the real values for these centroids? ... maybe not clear as a question

3. It depends on how you "whithen" your data. If you simply removed the mean and divided by the standard deviation, you can just multiply by the old standard deviation and add the old mean again.

4. Ok, I used python whiten which just divide each column by its std ... I thought it subtract the mean firstly to get a standard score... Thats why I couldnt get back to my real data. Thanks a lot !

24. Can you teach me how to do a Texture-­based image segmentation using K­means clustering

1. Hi, have a look at this post: http://glowingpython.blogspot.co.uk/2012/07/color-quantization.html

2. is there any posts about how to create a filter bank by using Scipy function?

3. No, sorry :)

25. Is it possible to use Mahalanobis distance instead euclidean distance for K-means clustering ?

26. Is it possible to use Mahalanobis distance instead euclidean distance for K-means clustering ?

1. Hi pra, scipy doesn't allow you to specify a custom distance function, but you may want to give a look to this answer: http://stackoverflow.com/questions/16274788/k-means-and-mahalanobis-distance

27. Such a great example! Anyway, i have one question. Can we use k-means for clustering a connected undirected graph?

1. There are variants of k-means that work on graphs, I suggestion you to have a look at the library networkx.