The Glowing Python: K- means clustering with scipy

Thursday, April 5, 2012

K- means clustering with scipy

K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm alternates the two steps:

for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.

These two steps are iterated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster of the closest prototype.
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

The result should be as follows:

In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:

# now with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or',
     data[idx==2,0],data[idx==2,1],'og') # third cluster points
plot(centroids[:,0],centroids[:,1],'sm',markersize=8)
show()

This time the the result is as follows:

59 comments:

EmiDecember 14, 2012 at 2:10 AM
Your blog is awesome and has helped me discover all kinds of useful tidbits in scipy and numpy. Thank you!
ReplyDelete
Replies
Alexandre GuerraMarch 15, 2013 at 12:42 PM
Thanks for the post. I have used kmeans to identify clusters (rings) in a matrix of sea surface height. The objective is to identify the rings and to determine their centroids. But kmeans requires as input parameter the number of clusters to be sought. That is a problem because I usually do not know previously how many rings will be present in the area. So, I was wondering how to avoid this kmeans limitation. Do you have any idea?
Regards,
Alex
ReplyDelete
Replies
AnonymousMay 4, 2013 at 4:20 PM
from where did you get the module "from scipy.cluster.vq import kmeans,vq"? where can I get it?

this stuff looks great, btw.
ReplyDelete
Replies
AnonymousMay 13, 2013 at 11:10 PM
Excellent blog, JG. I love it. I use it and recommend it to others. One question. Many of the k-means tutorials that I've found rely on self-made, perfectly configured data -- i.e., they'll "generate" numbers in just the form necessary to get an easy k-means clustering. Whereas in the real world many of us are using k-means to cluster documents (using NLP) or other information that requires a great deal of work/formatting before we can even use k-means. For example, on documents first one must create a vector (tf-idf for instance) and then complete a similarity measurement (euclidean for instance).

Thus, the greatest challenge to performing a k-means is often just getting the data into a format that the kmeans calls can work with. For instance, tutorials always seem to have their data in the following list-within-a-list format:[[number1, number2], [number1, number2], [number1, number2], [number1, number2]... ]. How would you recommend getting documents (i.e., natural language) into that format? Note that cosine similarity performed on a tf-idf vector always returns a single list, which isn't "clusterable" (to my knowledge).

Any help appreciated and thank you in advance for all of your work.
ReplyDelete
Replies
AnonymousJune 7, 2013 at 11:12 AM
i am not able to view the result what can be done..
show() command is not working
ReplyDelete
Replies
AnonymousJuly 28, 2013 at 1:33 PM
Excellent! Help me much to understand k-means clustering.

Is it possible to use this in a capacitated-VRP (vehicle routing problem) ?
In which, each node has "demand", and there is a fixed vehicle's "capacity".
Subject to: sum of all node's demands (in a cluster) is smaller-or-equal to vehicle's capacity.

Any helps very much appreciated. Thank you.
ReplyDelete
Replies
AnonymousDecember 15, 2013 at 9:55 PM
This is excellent material, and your code explaining how to use the scipy implementation is beautiful and clear. Just recently I've implemented the algorithm in python myself, it's a lot of fun to play around with configurations to see the clustering in action: http://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
ReplyDelete
Replies
Sergeant HartmanMarch 25, 2014 at 9:17 AM
Could you please discuss a bit the role of 'whitening' which seems to be kind of highly reccommended by the scipy tutorial?
ReplyDelete
Replies
way112April 8, 2014 at 6:03 PM
Thank you very much for your post, it's very helpful. Here, your dataset has two variables that you partition into 2 and then 3 clusters, so it makes sense to plot the k-means clusters like this, with 1 variable on X-axis and one on the Y-axis. but what about if your dataset has more dimensions? Do you have any suggestions about ways to look at the output under these circumstances?

Thanks,

SS
ReplyDelete
Replies
way112April 8, 2014 at 6:24 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJune 9, 2014 at 10:04 PM
This post has been very helpful. Thank you very much
ReplyDelete
Replies
AnonymousAugust 21, 2014 at 5:54 PM
If I would like say 100 clusters from k-means, too many to do the plotting by hand as you have in this example, how would you visualise/plot the results from the clustering?
ReplyDelete
Replies
Carl JosephSeptember 5, 2014 at 12:46 PM
Great post.

Do you know of anything similar that can find clusters in 3D space? For example, I have a set of particles each with x, y, z co-ordinates. How would I go about finding clusters (of various densities) in such a dataset?

Any thoughts or ideas?
ReplyDelete
Replies
AnonymousOctober 7, 2014 at 7:00 PM
Nice demonstration. But can you tell me how to use this scipy.cluster.vq module to generate codebook for an array of mfcc feature vectors. I've extracted the MFCC feature vectors (13 coefficients) ...now i wish to use vq to perform pattern matching stuff. any idea ?
ReplyDelete
Replies
AnonymousOctober 24, 2014 at 6:59 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Shambhulingayya.N.DOctober 24, 2014 at 7:03 PM
Hi I have a problem using k-mean clustering with scipy.

I have a set of data as x-axis and y-axis

[[-0.0365, 0.0121],
[ 0.0623, -0.0019],
[ 0.0352, -0.0007],
[ 0.0609, -0.0096]]

If i use the k-mean function from matlab it clusters it properly(i mean as it is expected) i.e 1st row and last row comes under one cluster and middle two rows comes under one cluster.

But, when i use scipy as it is told in this blog, results are not as expected i.e 1st row comes under one cluster and last 3 rows comes under another cluster. Can any one pls tell me why is it so?

Tnx in advance:)
ReplyDelete
Replies
Shambhulingayya.N.DOctober 25, 2014 at 8:05 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousFebruary 26, 2015 at 4:51 PM
Very Informative Blog! Could you please tell me how to identify which cluster stands for what? For eg, if I have 4 features for t-shirt sizes (age, weight, height, gender) and if I get 3 clusters, how to find which cluster out of the 3 stands for small size, medium size and large size?
ReplyDelete
Replies
UnknownFebruary 28, 2015 at 6:43 AM
I am new to scipy. Can any one tell me how do I extract the data points belonging to each cluster? Here is the code-

data = vstack(arr1)
centroids,_ = kmeans(data,4)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',data[idx==1,0],data[idx==1,1],'or',data[idx==2,0],data[idx==2,1],'og',data[idx==3,0],data[idx==3,1],'oy')

plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
ReplyDelete
Replies
BDavisApril 23, 2015 at 9:29 PM
Thank you so much for this writeup. It has helped me dip my toes into kmeans and scipy. I look forward to continuing.

I had one quick question about labeling points. I have k=8 as my best fit for my data, and can differentiate the clusters well. I imported data from a pandas dataframe with an index, then subsequently into a numpy array to perform clustering and plotting-by-idx. Can you suggest a method to take the index.values from the dataframe and label the plot accordingly so I can associate the specific points with their sample of origin?
ReplyDelete
Replies
AnonymousJuly 17, 2015 at 10:47 PM
What is the similarity measure of this implementation of k-means? Thank you.
ReplyDelete
Replies
UnknownSeptember 20, 2015 at 12:15 AM
Works perfectly for me. thanks
ReplyDelete
Replies
UnknownOctober 19, 2015 at 5:14 PM
hey
Every time you perform the algo, the centroid number happens to change making the plots colouring different at every run. Does anybody knows how to fix that?
Thank you
ReplyDelete
Replies
UnknownFebruary 25, 2016 at 6:42 PM
Can you teach me how to do a Texture-based image segmentation using Kmeans clustering
ReplyDelete
Replies
pra-July 8, 2016 at 5:02 PM
Is it possible to use Mahalanobis distance instead euclidean distance for K-means clustering ?
ReplyDelete
Replies
pra-July 8, 2016 at 5:03 PM
Is it possible to use Mahalanobis distance instead euclidean distance for K-means clustering ?
ReplyDelete
Replies
AnonymousJune 15, 2017 at 10:47 AM
Such a great example! Anyway, i have one question. Can we use k-means for clustering a connected undirected graph?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Thursday, April 5, 2012

K- means clustering with scipy

59 comments:

Quote