Thursday, April 5, 2012

K- means clustering with scipy

K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm alternates the two steps:
  • for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
  • the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.
These two steps are iterated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster of the closest prototype.
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
The result should be as follows:


In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:
# now with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or',
     data[idx==2,0],data[idx==2,1],'og') # third cluster points
plot(centroids[:,0],centroids[:,1],'sm',markersize=8)
show()
This time the the result is as follows:

59 comments:

  1. Your blog is awesome and has helped me discover all kinds of useful tidbits in scipy and numpy. Thank you!

    ReplyDelete
  2. Thanks for the post. I have used kmeans to identify clusters (rings) in a matrix of sea surface height. The objective is to identify the rings and to determine their centroids. But kmeans requires as input parameter the number of clusters to be sought. That is a problem because I usually do not know previously how many rings will be present in the area. So, I was wondering how to avoid this kmeans limitation. Do you have any idea?
    Regards,
    Alex

    ReplyDelete
    Replies
    1. Hello Alexandre,
      how to find a good choice for k is a studied problem in machine learning. I suggest you to read this paper: http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf

      Thanks for your comment!

      Delete
  3. from where did you get the module "from scipy.cluster.vq import kmeans,vq"? where can I get it?

    this stuff looks great, btw.

    ReplyDelete
    Replies
    1. It come with the scipy installation:

      http://docs.scipy.org/doc/scipy/reference/cluster.vq.html#module-scipy.cluster.vq

      Delete
  4. Excellent blog, JG. I love it. I use it and recommend it to others. One question. Many of the k-means tutorials that I've found rely on self-made, perfectly configured data -- i.e., they'll "generate" numbers in just the form necessary to get an easy k-means clustering. Whereas in the real world many of us are using k-means to cluster documents (using NLP) or other information that requires a great deal of work/formatting before we can even use k-means. For example, on documents first one must create a vector (tf-idf for instance) and then complete a similarity measurement (euclidean for instance).

    Thus, the greatest challenge to performing a k-means is often just getting the data into a format that the kmeans calls can work with. For instance, tutorials always seem to have their data in the following list-within-a-list format:[[number1, number2], [number1, number2], [number1, number2], [number1, number2]... ]. How would you recommend getting documents (i.e., natural language) into that format? Note that cosine similarity performed on a tf-idf vector always returns a single list, which isn't "clusterable" (to my knowledge).

    Any help appreciated and thank you in advance for all of your work.

    ReplyDelete
    Replies
    1. Hello,

      You could take a look to the vector space models. Those models make you able to build a fixed length vector using the frequencies of some words you're interested in. This post is a good introduction:

      http://pyevolve.sourceforge.net/wordpress/?p=1589

      Delete
    2. I'll check it out. Thanks much, JG.

      Delete
  5. i am not able to view the result what can be done..
    show() command is not working

    ReplyDelete
  6. Excellent! Help me much to understand k-means clustering.

    Is it possible to use this in a capacitated-VRP (vehicle routing problem) ?
    In which, each node has "demand", and there is a fixed vehicle's "capacity".
    Subject to: sum of all node's demands (in a cluster) is smaller-or-equal to vehicle's capacity.

    Any helps very much appreciated. Thank you.

    ReplyDelete
    Replies
    1. I mean as explained in here, "Cluster-First Route-Second Method"
      http://neo.lcc.uma.es/vrp/solution-methods/heuristics/cluster-first-route-second-method/

      Thanks...

      Delete
  7. This is excellent material, and your code explaining how to use the scipy implementation is beautiful and clear. Just recently I've implemented the algorithm in python myself, it's a lot of fun to play around with configurations to see the clustering in action: http://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

    ReplyDelete
  8. Could you please discuss a bit the role of 'whitening' which seems to be kind of highly reccommended by the scipy tutorial?

    ReplyDelete
    Replies
    1. Hi, you can find some general information about the whitening transformation here: http://en.wikipedia.org/wiki/Whitening_transformation

      Usually, when you have a data matrix, you use whitening in order to have unit variance across all the samples. This operation could improve the result of the k-means algorithm.

      Delete
  9. Thank you very much for your post, it's very helpful. Here, your dataset has two variables that you partition into 2 and then 3 clusters, so it makes sense to plot the k-means clusters like this, with 1 variable on X-axis and one on the Y-axis. but what about if your dataset has more dimensions? Do you have any suggestions about ways to look at the output under these circumstances?

    Thanks,

    SS

    ReplyDelete
    Replies
    1. Hi, if you have more than two variables and you need to visualize your data in a 2D space, you need a dimensionality reduction method. Here's an example on how to use Isomap: http://glowingpython.blogspot.co.uk/2012/05/manifold-learning-on-handwritten-digits.html

      Delete
    2. Interesting, I'll check it out. Thanks for the speedy response!

      Delete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This post has been very helpful. Thank you very much

    ReplyDelete
  12. If I would like say 100 clusters from k-means, too many to do the plotting by hand as you have in this example, how would you visualise/plot the results from the clustering?

    ReplyDelete
    Replies
    1. You have to defined a list with a color for each cluster and plot each cluster separately with a for loop choosing the appropriate color each time.

      Delete
  13. Great post.

    Do you know of anything similar that can find clusters in 3D space? For example, I have a set of particles each with x, y, z co-ordinates. How would I go about finding clusters (of various densities) in such a dataset?

    Any thoughts or ideas?

    ReplyDelete
    Replies
    1. kmeans works in 3D spaces. You just have to modify the input matrix.

      Delete
    2. Fantastic. Will investigate it further. Thanks.

      Delete
  14. Nice demonstration. But can you tell me how to use this scipy.cluster.vq module to generate codebook for an array of mfcc feature vectors. I've extracted the MFCC feature vectors (13 coefficients) ...now i wish to use vq to perform pattern matching stuff. any idea ?

    ReplyDelete
    Replies
    1. Of course, you could cluster your MFCC vectors in order to find similarities in some parts of your signal. But the applications that you can realize are related to the kind of signal you have (speech, music, ...).

      Delete
  15. This comment has been removed by a blog administrator.

    ReplyDelete
  16. Hi I have a problem using k-mean clustering with scipy.

    I have a set of data as x-axis and y-axis

    [[-0.0365, 0.0121],
    [ 0.0623, -0.0019],
    [ 0.0352, -0.0007],
    [ 0.0609, -0.0096]]

    If i use the k-mean function from matlab it clusters it properly(i mean as it is expected) i.e 1st row and last row comes under one cluster and middle two rows comes under one cluster.

    But, when i use scipy as it is told in this blog, results are not as expected i.e 1st row comes under one cluster and last 3 rows comes under another cluster. Can any one pls tell me why is it so?

    Tnx in advance:)

    ReplyDelete
    Replies
    1. Hi Shambhulingayya, just try to plot your data. You'll see that your first sample is an outlier (it's even the only one with a negative x and a positive y) and that the other 3 samples are close each other. The clustering provided by scipy makes more sense to me.

      Delete
    2. Well, BTW FYI, I am using it for pattern recognition. The data sets i mentioned above are the output from the PCA(reduced dimensional dataset).

      But, in reality my patterns which i know by it's appearances are, (1st and 4th) are similar and (2nd and 4th are similar).

      Why i am doubting this because, if i cluster the same data using matlab then, clustering is happening perfectly right i.e 1st and 4th under one cluster and 2nd and 3rd under another cluster.

      Any thoughts in this...?

      Delete
    3. FYI, here r the K-mean output of matlab and python which differ each other for same data set

      Matlab - https://drive.google.com/file/d/0B4XLQnqhjPJiUWp2MFRLU3F4aWM/view?usp=sharing

      Python - https://drive.google.com/file/d/0B4XLQnqhjPJiaWFGc0hvNDdGM3c/view?usp=sharing

      Delete
    4. Try to use the same starting centroids and the same number of iterations on both the implementations.

      Delete
  17. This comment has been removed by the author.

    ReplyDelete
  18. Very Informative Blog! Could you please tell me how to identify which cluster stands for what? For eg, if I have 4 features for t-shirt sizes (age, weight, height, gender) and if I get 3 clusters, how to find which cluster out of the 3 stands for small size, medium size and large size?

    ReplyDelete
    Replies
    1. The values of the centroids represent your variables. If you look at them you will be able to understand which centroid represents the size of the related cluster.

      Delete
  19. I am new to scipy. Can any one tell me how do I extract the data points belonging to each cluster? Here is the code-

    data = vstack(arr1)
    centroids,_ = kmeans(data,4)
    idx,_ = vq(data,centroids)

    plot(data[idx==0,0],data[idx==0,1],'ob',data[idx==1,0],data[idx==1,1],'or',data[idx==2,0],data[idx==2,1],'og',data[idx==3,0],data[idx==3,1],'oy')

    plot(centroids[:,0],centroids[:,1],'sg',markersize=8)

    ReplyDelete
  20. Thank you so much for this writeup. It has helped me dip my toes into kmeans and scipy. I look forward to continuing.

    I had one quick question about labeling points. I have k=8 as my best fit for my data, and can differentiate the clusters well. I imported data from a pandas dataframe with an index, then subsequently into a numpy array to perform clustering and plotting-by-idx. Can you suggest a method to take the index.values from the dataframe and label the plot accordingly so I can associate the specific points with their sample of origin?

    ReplyDelete
    Replies
    1. I BDavis, DataFrame has the attribute index which gives what you probably want.

      Delete
    2. Thank you for the prompt reply!

      I am able to obtain the ordered index, no problem. I am having difficulty integrating the sample titles contained within with the logical index plotting methods you implement. I was trying to iterate over it in a list, but associating it with the coordinates within the cluster data is elusive.

      Delete
  21. What is the similarity measure of this implementation of k-means? Thank you.

    ReplyDelete
    Replies
    1. Hi, this algorithm uses the euclidian distance.

      Delete
    2. Hey.

      Is it possible to use some other metrics like Pearson correlation?
      Or can I use precalculated distance matrix for the clustering?

      Delete
    3. Hi L, not with this implementation.

      Delete
  22. hey
    Every time you perform the algo, the centroid number happens to change making the plots colouring different at every run. Does anybody knows how to fix that?
    Thank you

    ReplyDelete
    Replies
    1. The initial centroid are randomly chosen. You can fix the initial random seed using this function: numpy.random.seed

      Delete
    2. Thanks a lot, It's a nice trick.
      Another question, when you get the centroids from the "whiten" data, how do you get back the real values for these centroids? ... maybe not clear as a question

      Delete
    3. It depends on how you "whithen" your data. If you simply removed the mean and divided by the standard deviation, you can just multiply by the old standard deviation and add the old mean again.

      Delete
    4. Ok, I used python whiten which just divide each column by its std ... I thought it subtract the mean firstly to get a standard score... Thats why I couldnt get back to my real data. Thanks a lot !

      Delete
  23. Can you teach me how to do a Texture-­based image segmentation using K­means clustering

    ReplyDelete
    Replies
    1. Hi, have a look at this post: http://glowingpython.blogspot.co.uk/2012/07/color-quantization.html

      Delete
    2. is there any posts about how to create a filter bank by using Scipy function?

      Delete
  24. Is it possible to use Mahalanobis distance instead euclidean distance for K-means clustering ?

    ReplyDelete
  25. Is it possible to use Mahalanobis distance instead euclidean distance for K-means clustering ?

    ReplyDelete
    Replies
    1. Hi pra, scipy doesn't allow you to specify a custom distance function, but you may want to give a look to this answer: http://stackoverflow.com/questions/16274788/k-means-and-mahalanobis-distance

      Delete
  26. Such a great example! Anyway, i have one question. Can we use k-means for clustering a connected undirected graph?

    ReplyDelete
    Replies
    1. There are variants of k-means that work on graphs, I suggestion you to have a look at the library networkx.

      Delete

Note: Only a member of this blog may post a comment.