Friday, June 4, 2021

The Central Limit Theorem, a hands-on introduction

The central limit theorem can be informally summarized in few words: The sum of x1, x2, ... xn samples from the same distribution is normally distributed, provided that n is big enough and that the distribution has a finite variance. to show this in an experimental way, let's define a function that sums n samples from the same distrubution for 100000 times:
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt

def sum_random_variables(*kwarg, sp_distribution, n):
    # returns the sum of n random samples
    # drawn from sp_distribution
    v = [sp_distribution.rvs(*kwarg, size=100000) for _ in range(n)]
    return np.sum(v, axis=0)
This function takes in input the parameters of the distrubution, the function that implements the distrubution and n. It returns an array of 100000 elements, where each element is the sum of n samples. Given the Central Limit Theorem, we expect that the values in output are normally distributed if n is big enough. To verify this, let's consider a beta distribution with parameters alpha=1 and beta=2, run our function increasing n and plot the histogram of the values in output:
plt.figure(figsize=(9, 3))
N = 5
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(1, 2, sp_distribution=sps.beta, n=n)
    plt.hist(s, density=True)
On the far left we have the histogram with n=1 , the one with n=2 right next to it, and so on until n=4. With n=1 we have the original distribution, which is heavily skewed. With n=2 we have a distribution which is less skewed. When we reach n=4 we see that the distribution is almost symmetrical, resembling a normal distribution.

Let's do the same experiment using a uniform distribution:
plt.figure(figsize=(9, 3))
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(1, 1, sp_distribution=sps.beta, n=n)
    plt.hist(s, density=True)
Here we have that for n=2 the distribution is already symmetrical, resembling a triangle, and increasing n further we get closer to the shape of a Gaussian.

The same behaviour can be shown for discrete distributions. Here's what happens if we use the Bernoulli distribution:
plt.figure(figsize=(9, 3))
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(.5, sp_distribution=sps.bernoulli, n=n)
    plt.hist(s, bins=n+1, density=True, rwidth=.7)
We see again that for n=2 the distribution starts to be symmetrical and that the shape of a Gaussian is almost clear for n=4.

Wednesday, April 7, 2021

A Simple model that earned a Silver medal in predicting the results of the NCAAW tournament

This year I decided to join the March Machine Learning Mania 2021 - NCAAW challenge on Kaggle. It proposes to predict the outcome of each game into the basketball NCAAW tournament, which is a tournament for women at college level. Participants can assign a probability to each outcome and they're ranked on the leaderboard according to the accuracy of their prediction. One of the most attractive elements of the challenge is that the leaderboard is updated after each game throughout the tournament.

Since I have limited knowledge of basketball I decided to use a minimalistic model:
  • It uses three features that are easy to interpret: seed, percentage of victories, and the average score of each team.
  • It is based on linear Linear Regression, and it's tuned to predict extreme probability values only for games that are easy to predict.
The following visualizations give insight into how the model estimates the winning probability in a game between two teams:

Surprisingly, this model ranked 46th out of 451 submissions, placing itself in the top 11% of the leaderboard and earning a silver medal!

The notebook with the solution and some more charts can be found here.

Wednesday, November 11, 2020

Visualize the Dictionary of Obscure Words with T-SNE

I recently published on a wrapper around The Dictionary of Obscure Words (originally from this website for Python and in this post we'll see how to create a visualization to highlight few entries from the dictionary using the dimensionality reduction technique called T-SNE. The dictionary is available on github at this address and can be installed as follows:
pip install git+
We can now import the dictionary and create a vectorial representation of each word:
import matplotlib.pyplot as plt
import numpy as np
from obscure_words import load_obscure_words
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.manifold import TSNE

obscure_dict = load_obscure_words()
words = np.array(list(obscure_dict.keys()))
definitions = np.array(list(obscure_dict.values()))

vectorizer = TfidfVectorizer(stop_words=None)
X = vectorizer.fit_transform(definitions)

projector = TSNE(random_state=0)
XX = projector.fit_transform(X)
In the snippet above, we compute a Tf-Idf representation using the definition of each word. This gives us a vector for each word in our dictionary, but each of these vectors has many elements as the total number of words used in all the definitions. Since we can't plot all the features extracted, we reduce our data to 2 dimensions we use T-SNE. We have now a mapping that allows us to place each word in a point of a bi-dimensional space. There's one problem remaining, how can we plot the words in a way that we can still read them? Here's a solution:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

def textscatter(x, y, text, k=10):
    X = np.array([x, y]).T
    clustering = KMeans(n_clusters=k)
    scaler = StandardScaler()
    centers = scaler.inverse_transform(clustering.cluster_centers_)
    selected = np.argmin(pairwise_distances(X, centers), axis=0)
    plt.scatter(x, y, s=6, c=clustering.predict(scaler.transform(X)), alpha=.05)
    for i in selected:
        plt.text(x[i], y[i], text[i], fontsize=10)

plt.figure(figsize=(16, 16))
textscatter(XX[:, 0], XX[:, 1], 
            [w+'\n'+d for w, d in zip(words, definitions)], 20)
In the function textscatter we segment all the points created at the previous steps in k clusters using K-Means, then we plot the word related to the center of cluster (and also its definion). Given the properties of K-Means we know that the centers are distant from each other and with the right choice of k we can maximize the number of words we can display. This is the result of the snippet above:
(click on the figure to see the entire chart)