The Glowing Python: A visual introduction to the Gap Statistics

Tuesday, January 22, 2019

A visual introduction to the Gap Statistics

We have previously seen how to implement KMeans. However, the results of this algorithm strongly rely on the choice of the parameter K. According to statistical folklore the best K is located at the 'elbow' of the clusters inertia while K increases. This heuristic has been translated into a more formalized procedure by the Gap Statistics and in this post we'll see how to pick K in an optimal way using the Gap Statistics. The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of KMeans increasing K:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans


reference = np.random.rand(100, 2)
plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(reference)
    plt.subplot(1,5,k)
    plt.scatter(reference[:, 0], reference[:, 1], c=a)
    plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()

From the figure above we can see that the algorithm evenly splits the points K clusters even if there's no separation between them. Let’s now do the same on a target dataset with 3 natural clusters:

X = make_blobs(n_samples=100, n_features=2,
               centers=3, cluster_std=.8,)[0]

plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(X)
    plt.subplot(1,5,k)
    plt.scatter(X[:, 0], X[:, 1], c=a)
    plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()

Here we note that the algorithm, with K=2, correctly isolates one of the clusters grouping the other two together. Then, with K=3, correctly identifies the natural clusters. But, with K=4 and K=5 some of the natural clusters are split in two. If we plot the inertia in both cases we'll see something interesting:

def compute_inertia(a, X):
    W = [np.mean(pairwise_distances(X[a == c, :])) for c in np.unique(a)]
    return np.mean(W)

def compute_gap(clustering, data, k_max=5, n_references=5):
    if len(data.shape) == 1:
        data = data.reshape(-1, 1)
    reference = np.random.rand(*data.shape)
    reference_inertia = []
    for k in range(1, k_max+1):
        local_inertia = []
        for _ in range(n_references):
            clustering.n_clusters = k
            assignments = clustering.fit_predict(reference)
            local_inertia.append(compute_inertia(assignments, reference))
        reference_inertia.append(np.mean(local_inertia))
    
    ondata_inertia = []
    for k in range(1, k_max+1):
        clustering.n_clusters = k
        assignments = clustering.fit_predict(data)
        ondata_inertia.append(compute_inertia(assignments, data))
        
    gap = np.log(reference_inertia)-np.log(ondata_inertia)
    return gap, np.log(reference_inertia), np.log(ondata_inertia)

k_max = 5
gap, reference_inertia, ondata_inertia = compute_gap(KMeans(), X, k_max)


plt.plot(range(1, k_max+1), reference_inertia,
         '-o', label='reference')
plt.plot(range(1, k_max+1), ondata_inertia,
         '-o', label='data')
plt.xlabel('k')
plt.ylabel('log(inertia)')
plt.show()

On the reference dataset the inertia goes down’ very slowly while on the target dataset it assumes the shape of an elbow! We can now compute the Gap Statistics for each K computing the difference of the two curves showed above:

plt.plot(range(1, k_max+1), gap, '-o')
plt.ylabel('gap')
plt.xlabel('k')

It’s easy to see that the Gap is maximum for K=3, just the right choice for our target dataset.

For a more formal introduction you can check out the following paper: Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistics Society 2001.

11 comments:

Shreeram PattanayakMarch 13, 2020 at 7:29 AM
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
50 return gap, np.log(reference_inertia), np.log(ondata_inertia)
51
---> 52 gap, reference_inertia, ondata_inertia = compute_gap(KMeans())
53
54

TypeError: compute_gap() missing 1 required positional argument: 'data'
ReplyDelete
Replies
zilinn.wangAugust 14, 2020 at 3:54 PM
thank you so much,it helps me a lot
ReplyDelete
Replies
UnknownNovember 26, 2020 at 10:48 AM
Shouldn't the following line
>> reference = np.random.rand(*data.shape)
be moved into inside of
>> for _ in range(n_references)
loop? Otherwise, taking a mean out of local_interia list is useless.
ReplyDelete
Replies
UnknownJanuary 29, 2021 at 12:06 AM
This was super helpful, thank you so much! Also, the only thing I've found that is easily convertible for other non-KMeans algorithms, as long as we change the n_clusters argument. This was great.
ReplyDelete
Replies
UnknownFebruary 21, 2021 at 10:52 AM
I think the reference_inertia should be log first and then mean rather than mean first then log?
ReplyDelete
Replies
AnonymousMay 3, 2021 at 6:21 PM
Thank you very much for this code and explanation !
I have been looking at the paper you are refering to (Tibshirani et al.), and it seems your code stops too early. The papers states "Finally choose the number of clusters via k such that Gapk(k) >= Gap(k+1) - sk+1." (page 5/13). But, according to your code and explanations, the reader should only find the maximum of the gap. This seems incorrect according to Tibshirani et al.
BR,
ReplyDelete
Replies
DanJuly 19, 2022 at 11:18 PM
It seems that the code is calculating log(average of pairwise distance), vs the paper (https://hastie.su.domains/Papers/gap.pdf) is calculating average(log of W), where W is the sum of square pairwise distance divided by 2*size of the clusters.
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Tuesday, January 22, 2019

A visual introduction to the Gap Statistics

11 comments:

Quote