Tuesday, January 22, 2019

A visual introduction to the Gap Statistics

We have previously seen how to implement KMeans. However, the results of this algorithm strongly rely on the choice of the parameter K. According to statistical folklore the best K is located at the 'elbow' of the clusters inertia while K increases. This heuristic has been translated into a more formalized procedure by the Gap Statistics and in this post we'll see how to pick K in an optimal way using the Gap Statistics. The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of KMeans increasing K:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans

reference = np.random.rand(100, 2)
plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(reference)
    plt.scatter(reference[:, 0], reference[:, 1], c=a)

From the figure above we can see that the algorithm evenly splits the points K clusters even if there's no separation between them. Let’s now do the same on a target dataset with 3 natural clusters:

X = make_blobs(n_samples=100, n_features=2,
               centers=3, cluster_std=.8,)[0]

plt.figure(figsize=(12, 3))
for k in range(1,6):
    kmeans = KMeans(n_clusters=k)
    a = kmeans.fit_predict(X)
    plt.scatter(X[:, 0], X[:, 1], c=a)

Here we note that the algorithm, with K=2, correctly isolates one of the clusters grouping the other two together. Then, with K=3, correctly identifies the natural clusters. But, with K=4 and K=5 some of the natural clusters are split in two. If we plot the inertia in both cases we'll see something interesting:

def compute_inertia(a, X):
    W = [np.mean(pairwise_distances(X[a == c, :])) for c in np.unique(a)]
    return np.mean(W)

def compute_gap(clustering, data, k_max=5, n_references=5):
    if len(data.shape) == 1:
        data = data.reshape(-1, 1)
    reference = np.random.rand(*data.shape)
    reference_inertia = []
    for k in range(1, k_max+1):
        local_inertia = []
        for _ in range(n_references):
            clustering.n_clusters = k
            assignments = clustering.fit_predict(reference)
            local_inertia.append(compute_inertia(assignments, reference))
    ondata_inertia = []
    for k in range(1, k_max+1):
        clustering.n_clusters = k
        assignments = clustering.fit_predict(data)
        ondata_inertia.append(compute_inertia(assignments, data))
    gap = np.log(reference_inertia)-np.log(ondata_inertia)
    return gap, np.log(reference_inertia), np.log(ondata_inertia)

k_max = 5
gap, reference_inertia, ondata_inertia = compute_gap(KMeans(), X, k_max)

plt.plot(range(1, k_max+1), reference_inertia,
         '-o', label='reference')
plt.plot(range(1, k_max+1), ondata_inertia,
         '-o', label='data')

On the reference dataset the inertia goes down’ very slowly while on the target dataset it assumes the shape of an elbow! We can now compute the Gap Statistics for each K computing the difference of the two curves showed above:

plt.plot(range(1, k_max+1), gap, '-o')

It’s easy to see that the Gap is maximum for K=3, just the right choice for our target dataset.

For a more formal introduction you can check out the following paper: Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistics Society 2001.


  1. ---------------------------------------------------------------------------
    TypeError Traceback (most recent call last)
    in ()
    50 return gap, np.log(reference_inertia), np.log(ondata_inertia)
    ---> 52 gap, reference_inertia, ondata_inertia = compute_gap(KMeans())

    TypeError: compute_gap() missing 1 required positional argument: 'data'

    1. Hi Shreeram,

      Thanks for reporting this. I've updated the code and you should be now able to run it.

      Let me know if you still have problems.


  2. thank you so much,it helps me a lot

  3. Shouldn't the following line
    >> reference = np.random.rand(*data.shape)
    be moved into inside of
    >> for _ in range(n_references)
    loop? Otherwise, taking a mean out of local_interia list is useless.

  4. This was super helpful, thank you so much! Also, the only thing I've found that is easily convertible for other non-KMeans algorithms, as long as we change the n_clusters argument. This was great.

  5. I think the reference_inertia should be log first and then mean rather than mean first then log?

  6. Thank you very much for this code and explanation !
    I have been looking at the paper you are refering to (Tibshirani et al.), and it seems your code stops too early. The papers states "Finally choose the number of clusters via k such that Gapk(k) >= Gap(k+1) - sk+1." (page 5/13). But, according to your code and explanations, the reader should only find the maximum of the gap. This seems incorrect according to Tibshirani et al.

    1. Assuming that the curve is elbow shaped, it's the same.

    2. Thank you for your answer.
      you are are right, when the curved in well "elbow shaped", it works, because it is an easy case where the data are not blured. But it seems to me gap statistics is sophisticated enough to target more complicated cases.
      If you reduce the number of data per cluster, or increase the cluster_std, you might obtain elbows not so prominent. However, looking directly at the data will allow your eye to distinguish the clusters easily. The last term in "Gapk(k) >= Gap(k+1) - sk+1." seems to be the keystone of gap statistics, in enabling the algo to precisely distinguish clusters when they are a bit blured.
      The meaning of this last term was not easy to get (for me, at least). Reading the answer on the following forum helped me a lot : https://stats.stackexchange.com/questions/95290/how-should-i-interpret-gap-statistic
      Anyway, thank you very much for your article !

  7. It seems that the code is calculating log(average of pairwise distance), vs the paper (https://hastie.su.domains/Papers/gap.pdf) is calculating average(log of W), where W is the sum of square pairwise distance divided by 2*size of the clusters.