import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.metrics import pairwise_distances from sklearn.cluster import KMeans reference = np.random.rand(100, 2) plt.figure(figsize=(12, 3)) for k in range(1,6): kmeans = KMeans(n_clusters=k) a = kmeans.fit_predict(reference) plt.subplot(1,5,k) plt.scatter(reference[:, 0], reference[:, 1], c=a) plt.xlabel('k='+str(k)) plt.tight_layout() plt.show()

From the figure above we can see that the algorithm evenly splits the points K clusters even if there's no separation between them. Let’s now do the same on a target dataset with 3 natural clusters:

X = make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=.8,)[0] plt.figure(figsize=(12, 3)) for k in range(1,6): kmeans = KMeans(n_clusters=k) a = kmeans.fit_predict(X) plt.subplot(1,5,k) plt.scatter(X[:, 0], X[:, 1], c=a) plt.xlabel('k='+str(k)) plt.tight_layout() plt.show()

Here we note that the algorithm, with K=2, correctly isolates one of the clusters grouping the other two together. Then, with K=3, correctly identifies the natural clusters. But, with K=4 and K=5 some of the natural clusters are split in two. If we plot the inertia in both cases we'll see something interesting:

def compute_inertia(a, X): W = [np.mean(pairwise_distances(X[a == c, :])) for c in np.unique(a)] return np.mean(W) def compute_gap(clustering, data, k_max=5, n_references=5): if len(data.shape) == 1: data = data.reshape(-1, 1) reference = np.random.rand(*data.shape) reference_inertia = [] for k in range(1, k_max+1): local_inertia = [] for _ in range(n_references): clustering.n_clusters = k assignments = clustering.fit_predict(reference) local_inertia.append(compute_inertia(assignments, reference)) reference_inertia.append(np.mean(local_inertia)) ondata_inertia = [] for k in range(1, k_max+1): clustering.n_clusters = k assignments = clustering.fit_predict(data) ondata_inertia.append(compute_inertia(assignments, data)) gap = np.log(reference_inertia)-np.log(ondata_inertia) return gap, np.log(reference_inertia), np.log(ondata_inertia) k_max = 5 gap, reference_inertia, ondata_inertia = compute_gap(KMeans(), X, k_max) plt.plot(range(1, k_max+1), reference_inertia, '-o', label='reference') plt.plot(range(1, k_max+1), ondata_inertia, '-o', label='data') plt.xlabel('k') plt.ylabel('log(inertia)') plt.show()

On the reference dataset the inertia goes down’ very slowly while on the target dataset it assumes the shape of an elbow! We can now compute the Gap Statistics for each K computing the difference of the two curves showed above:

plt.plot(range(1, k_max+1), gap, '-o') plt.ylabel('gap') plt.xlabel('k')

It’s easy to see that the Gap is maximum for K=3, just the right choice for our target dataset.

For a more formal introduction you can check out the following paper: Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistics Society 2001.

---------------------------------------------------------------------------

ReplyDeleteTypeError Traceback (most recent call last)

in ()

50 return gap, np.log(reference_inertia), np.log(ondata_inertia)

51

---> 52 gap, reference_inertia, ondata_inertia = compute_gap(KMeans())

53

54

TypeError: compute_gap() missing 1 required positional argument: 'data'

Hi Shreeram,

DeleteThanks for reporting this. I've updated the code and you should be now able to run it.

Let me know if you still have problems.

G

thank you so much,it helps me a lot

ReplyDeleteShouldn't the following line

ReplyDelete>> reference = np.random.rand(*data.shape)

be moved into inside of

>> for _ in range(n_references)

loop? Otherwise, taking a mean out of local_interia list is useless.

This was super helpful, thank you so much! Also, the only thing I've found that is easily convertible for other non-KMeans algorithms, as long as we change the n_clusters argument. This was great.

ReplyDeleteI think the reference_inertia should be log first and then mean rather than mean first then log?

ReplyDeleteThank you very much for this code and explanation !

ReplyDeleteI have been looking at the paper you are refering to (Tibshirani et al.), and it seems your code stops too early. The papers states "Finally choose the number of clusters via k such that Gapk(k) >= Gap(k+1) - sk+1." (page 5/13). But, according to your code and explanations, the reader should only find the maximum of the gap. This seems incorrect according to Tibshirani et al.

BR,

Assuming that the curve is elbow shaped, it's the same.

DeleteThank you for your answer.

Deleteyou are are right, when the curved in well "elbow shaped", it works, because it is an easy case where the data are not blured. But it seems to me gap statistics is sophisticated enough to target more complicated cases.

If you reduce the number of data per cluster, or increase the cluster_std, you might obtain elbows not so prominent. However, looking directly at the data will allow your eye to distinguish the clusters easily. The last term in "Gapk(k) >= Gap(k+1) - sk+1." seems to be the keystone of gap statistics, in enabling the algo to precisely distinguish clusters when they are a bit blured.

The meaning of this last term was not easy to get (for me, at least). Reading the answer on the following forum helped me a lot : https://stats.stackexchange.com/questions/95290/how-should-i-interpret-gap-statistic

Anyway, thank you very much for your article !

Glad it helped!

DeleteIt seems that the code is calculating log(average of pairwise distance), vs the paper (https://hastie.su.domains/Papers/gap.pdf) is calculating average(log of W), where W is the sum of square pairwise distance divided by 2*size of the clusters.

ReplyDelete