## Tuesday, January 22, 2019

### A visual introduction to the Gap Statistics

We have previously seen how to implement KMeans. However, the results of this algorithm strongly rely on the choice of the parameter K. According to statistical folklore the best K is located at the 'elbow' of the clusters inertia while K increases. This heuristic has been translated into a more formalized procedure by the Gap Statistics and in this post we'll see how to pick K in an optimal way using the Gap Statistics. The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of KMeans increasing K:

```import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans

reference = np.random.rand(100, 2)
plt.figure(figsize=(12, 3))
for k in range(1,6):
kmeans = KMeans(n_clusters=k)
a = kmeans.fit_predict(reference)
plt.subplot(1,5,k)
plt.scatter(reference[:, 0], reference[:, 1], c=a)
plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()
```

From the figure above we can see that the algorithm evenly splits the points K clusters even if there's no separation between them. Let’s now do the same on a target dataset with 3 natural clusters:

```X = make_blobs(n_samples=100, n_features=2,
centers=3, cluster_std=.8,)

plt.figure(figsize=(12, 3))
for k in range(1,6):
kmeans = KMeans(n_clusters=k)
a = kmeans.fit_predict(X)
plt.subplot(1,5,k)
plt.scatter(X[:, 0], X[:, 1], c=a)
plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()
```

Here we note that the algorithm, with K=2, correctly isolates one of the clusters grouping the other two together. Then, with K=3, correctly identifies the natural clusters. But, with K=4 and K=5 some of the natural clusters are split in two. If we plot the inertia in both cases we'll see something interesting:

```def compute_inertia(a, X):
W = [np.mean(pairwise_distances(X[a == c, :])) for c in np.unique(a)]
return np.mean(W)

def compute_gap(clustering, data, k_max=5, n_references=5):
if len(data.shape) == 1:
data = data.reshape(-1, 1)
reference = np.random.rand(*data.shape)
reference_inertia = []
for k in range(1, k_max+1):
local_inertia = []
for _ in range(n_references):
clustering.n_clusters = k
assignments = clustering.fit_predict(reference)
local_inertia.append(compute_inertia(assignments, reference))
reference_inertia.append(np.mean(local_inertia))

ondata_inertia = []
for k in range(1, k_max+1):
clustering.n_clusters = k
assignments = clustering.fit_predict(data)
ondata_inertia.append(compute_inertia(assignments, data))

gap = np.log(reference_inertia)-np.log(ondata_inertia)
return gap, np.log(reference_inertia), np.log(ondata_inertia)

k_max = 5
gap, reference_inertia, ondata_inertia = compute_gap(KMeans(), X, k_max)

plt.plot(range(1, k_max+1), reference_inertia,
'-o', label='reference')
plt.plot(range(1, k_max+1), ondata_inertia,
'-o', label='data')
plt.xlabel('k')
plt.ylabel('log(inertia)')
plt.show()
```

On the reference dataset the inertia goes down’ very slowly while on the target dataset it assumes the shape of an elbow! We can now compute the Gap Statistics for each K computing the difference of the two curves showed above:

```plt.plot(range(1, k_max+1), gap, '-o')
plt.ylabel('gap')
plt.xlabel('k')
```

It’s easy to see that the Gap is maximum for K=3, just the right choice for our target dataset.

For a more formal introduction you can check out the following paper: Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistics Society 2001.

1. ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
50 return gap, np.log(reference_inertia), np.log(ondata_inertia)
51
---> 52 gap, reference_inertia, ondata_inertia = compute_gap(KMeans())
53
54

TypeError: compute_gap() missing 1 required positional argument: 'data'

1. Hi Shreeram,

Thanks for reporting this. I've updated the code and you should be now able to run it.

Let me know if you still have problems.

G

2. thank you so much,it helps me a lot

3. Shouldn't the following line
>> reference = np.random.rand(*data.shape)
be moved into inside of
>> for _ in range(n_references)
loop? Otherwise, taking a mean out of local_interia list is useless.

4. This was super helpful, thank you so much! Also, the only thing I've found that is easily convertible for other non-KMeans algorithms, as long as we change the n_clusters argument. This was great.

5. I think the reference_inertia should be log first and then mean rather than mean first then log?

6. Thank you very much for this code and explanation !
I have been looking at the paper you are refering to (Tibshirani et al.), and it seems your code stops too early. The papers states "Finally choose the number of clusters via k such that Gapk(k) >= Gap(k+1) - sk+1." (page 5/13). But, according to your code and explanations, the reader should only find the maximum of the gap. This seems incorrect according to Tibshirani et al.
BR,

1. Assuming that the curve is elbow shaped, it's the same.

2. 3. 7. 