Sunday, April 5, 2020

What makes a word beautiful?

What makes a word beautiful? Answering this question is not easy because of the inherent complexity and ambiguity in defining what it means to be beautiful. Let's tackle the question with a quantitative approach introducing the Aesthetic Potential, a metric that aims to quantify the beaty of a word w as follows:

where w+ is a word labelled as beautifu, w- as ugly and the function s is a similarity function between two words. In a nutshell, AP is the difference of the average similarity to beautiful words minus the average similarity to ugly words. This metric is positive for beautiful words and negative for ugly ones.
Before we can compute the Aesthetic Potential we need a similarity function s and a set of words labeled as beautiful and ugly. The similarity function that we will use considers the similarity of two words as the maximum Lin similarity between all the synonyms in WordNet of the two words in input (I will not introduce WordNet or the Lin similarity for brevity, but the curious reader is invited to follow the links above). Here's the Python implementation:
import numpy as np
from itertools import product
from nltk.corpus import wordnet, wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

def similarity(word1, word2):
    returns the similarity between word1 and word2 as the maximum
    Lin similarity between all the synsets of the two words.
    syns1 = wordnet.synsets(word1)
    syns2 = wordnet.synsets(word2)
    sims = []
    for sense1, sense2 in product(syns1, syns2):
        if sense1._pos == sense2._pos and not sense1._pos in ['a', 'r', 's']:
            d = wordnet.lin_similarity(sense1, sense2, brown_ic)
    if len(sims) > 0 or not np.all(np.isnan(sims)):        
        return np.nanmax(sims)
    return 0 # no similarity

print('s(cat, dog) =', similarity('cat', 'dog'))
print('s(cat, bean) = ', similarity('cat', 'bean'))
print('s(coffee, bean) = ', similarity('coffee', 'bean'))
s(cat, dog) = 0.8768009843733973
s(cat, bean) = 0.3079964716744931
s(coffee, bean) = 0.788150820826125
This function returns a value between 0 and 1. High values indicate that the two words are highly similar and low values indicate that there's no similarity. Looking at the output of the function three pairs of test words we note that the function considers "cat" and "dog" fairly similar while "dog" and "bean" not similar. Finally, "coffee" and "bean" are considered similar but not as similar as "cat" and "dog".
Now we need some words labeled as beautiful and some as ugly. Here I propose two lists of words inspired by the ones used in (Jacobs, 2017) for the German language:
beauty = ['amuse',  'art', 'attractive',
          'authentic', 'beautiful', 'beauty',
          'bliss', 'cheerful', 'culture',
          'delight', 'emotion', 'enjoyment',
          'enthusiasm', 'excellent', 'excited',
          'fascinate', 'fascination', 'flower',
          'fragrance', 'good', 'grace',
          'graceful', 'happy', 'heal',
          'health', 'healthy', 'heart',
          'heavenly', 'hope', 'inspire',
          'light', 'joy', 'love',
          'lovely', 'lullaby', 'lucent',
          'loving', 'luck', 'magnificent',
          'music', 'muse', 'life',
          'paradise', 'perfect', 'perfection',
          'picturesque', 'pleasure',
          'poetic', 'poetry', 'pretty',
          'protect', 'protection',
          'rich', 'spring', 'smile',
          'summer', 'sun', 'surprise',          
          'wealth', 'wonderful']

ugly = ['abuse', 'anger', 'imposition', 'anxiety',
        'awkward', 'bad', 'unlucky', 'blind',
        'chaotic', 'crash', 'crazy',
        'cynical', 'dark', 'disease',
        'deadly', 'decrepit', 'death',
        'despair', 'despise', 'disgust',
        'dispute', 'depression', 'dull',
        'evil', 'fail', 'hate',
        'hideous', 'horrible', 'horror',
        'haunted', 'illness', 'junk',
        'kill', 'less',
        'malicious', 'misery', 'murder',
        'nasty', 'nausea', 'pain',
        'piss', 'poor', 'poverty',
        'puke', 'punishment', 'rot',
        'scam', 'scare', 'shame',
        'spoil', 'spite', 'slaughter',
        'stink', 'terrible', 'trash',
        'trouble', 'ugliness', 'ugly',
        'unattractive', 'virus']
A remark is necessary here. The AP strongly depends on these two lists and the fact that I made them on my own strongly biases the results towards my personal preferences. If you're interested on a more general approach to label your data, the work published by Westbury et all in 2014 is a good place to start.
We now have all the pieces to compute our Aesthetic Potential:
def aesthetic_potential(word, beauty, ugly):
    returns the aesthetic potential of word
    beauty and ugly must be lists of words
    labelled as beautiful and ugly respectively
    b = np.nanmean([similarity(word, w) for w in beauty])
    u = np.nanmean([similarity(word, w) for w in ugly])
    return (b - u)*100

print('AP(smile) =', aesthetic_potential('smile', beauty, ugly))
print('AP(conjuncture) =', aesthetic_potential('conjuncture', beauty, ugly))
print('AP(hassle) =', aesthetic_potential('hassle', beauty, ugly))
AP(smile) = 2.6615214570040195
AP(conjuncture) = -3.418813636728729e-299
AP(hassle) = -2.7675826881674497
It is a direct implementation of the equation introduced above, the only difference is that the result is multiplied by 100 to have the metric in percentage for readability purposes. Looking at the results we see that the metric is positive for the word "smile", indicating that the word tends toward the beauty side. It's negative for "hassle", meaning it tends to the ugly side. It's 0 for "conjuncture", meaning that we can consider it a neutral word. To better understand these results we can compute the metric for a set of words and plot it agains the probability of a value of the metric:
test_words = ['hate', 'rain', #'snow', 
         'earth', 'love', 'child', #'clarinettist',
         'sun', 'patience', #'smile', 'touch',
         'coffee', 'regret', #'shepherd', 'man',
         'depression', 'obscure', 'bat', 'woman',
         'dull', 'nothing', 'disillusion',
         'abort', 'blurred', 'cruelness', #'hassle',
         'stalking', 'relevance', #'infected', 
         'conjuncture', 'god', 'moon', #'tortoise',
         'humorist', 'idea', 'poisoning']

ap = [aesthetic_potential(w.lower(), beauty, ugly) for w in test_words]

from scipy.stats import norm
import matplotlib.pyplot as plt
from matplotlib.colors import to_hex, LinearSegmentedColormap, Normalize
%matplotlib inline

p_score = norm.pdf(ap, loc=0.0, scale=0.7) #params estimated on a larger sample
p_score = p_score / p_score.sum()

normalizer = Normalize(vmin=-10, vmax=10)
colors = ['crimson', 'crimson', 'silver', 'deepskyblue', 'deepskyblue']
cmap = LinearSegmentedColormap.from_list('beauty', colors=colors)

plt.figure(figsize=(8, 12))
plt.title('What makes a word beautiful?',
          loc='left', color='gray', fontsize=22)
plt.scatter(p_score, ap, c='gray', marker='.', alpha=.6)
for prob, potential, word in zip(p_score, ap, test_words):
    plt.text(prob, potential, word.lower(),
             fontsize=(np.log10(np.abs(potential)+2))*30, alpha=.8,
plt.text(-0.025, 6, 'beautiful', va='center',
         fontsize=20, rotation=90, color='deepskyblue')
plt.text(-0.025, -6, 'ugly', va='center',
         fontsize=20, rotation=90, color='crimson')
plt.xlabel('P(Aesthetic Potential)', fontsize=20)
plt.ylabel('Aesthetic Potential', fontsize=20)
plt.gca().tick_params(axis='both', which='major', labelsize=14)

This chart gives us a better insight on the meaning of the values we just computed. We note that high probability values are around 0, hence most words in the vocabulary are neutral. Values above 2 and below -2 have a quite low probability, this tells us that words associated with these values have a strong Aesthetic Potential. From this chart we can see that the words "idea" and "sun" are considered beautiful while "hate" and "poisoning" are ugly (who would disagree with that :).

Tuesday, March 17, 2020

Ridgeline plots in pure matplotlib

A Ridgeline plot (also called Joyplot) allows us to compare several statistical distributions. In this plot each distribution is shown with a density plot, and all the distributions are aligned to the same horizontal axis and, sometimes, presented with a slight overlap.

There are many options to make a Ridgeline plot in Python (joypy being one of them) but I decided to make my own function using matplotlib to have full flexibility and minimal dependencies:
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

def ridgeline(data, overlap=0, fill=True, labels=None, n_points=150):
    Creates a standard ridgeline plot.

    data, list of lists.
    overlap, overlap between distributions. 1 max overlap, 0 no overlap.
    fill, matplotlib color to fill the distributions.
    n_points, number of points to evaluate each distribution function.
    labels, values to place on the y axis to describe the distributions.
    if overlap > 1 or overlap < 0:
        raise ValueError('overlap must be in [0 1]')
    xx = np.linspace(np.min(np.concatenate(data)),
                     np.max(np.concatenate(data)), n_points)
    curves = []
    ys = []
    for i, d in enumerate(data):
        pdf = gaussian_kde(d)
        y = i*(1.0-overlap)
        curve = pdf(xx)
        if fill:
            plt.fill_between(xx, np.ones(n_points)*y, 
                             curve+y, zorder=len(data)-i+1, color=fill)
        plt.plot(xx, curve+y, c='k', zorder=len(data)-i+1)
    if labels:
        plt.yticks(ys, labels)
The function takes in input a list of datasets where each dataset contains the values to derive a single distribution. Each distribution is estimated using Kernel Density Estimation, just as we've seen previously, and plotted increasing the y value.

Let's generate data from few normal distributions with different means and have a look at the output of the function:
data = [norm.rvs(loc=i, scale=2, size=50) for i in range(8)]
ridgeline(data, overlap=.85, fill='y')

Not too bad, we can clearly see that each distribution has a different mean. Let's apply the function on real world data:
import pandas as pd
data_url = ''
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
                       names=['year', 'month', 'day', 'decimal', 'ppm', 
                       'days', '1_yr_ago',  '10_yr_ago', 'since_1800'])
co2_data = co2_data[co2_data.year >= 2000]
co2_data = co2_data[co2_data.year != 2020]

plt.figure(figsize=(8, 10))
grouped = [(y, g.ppm.dropna().values) for y, g in co2_data.groupby('year')]
years, data = zip(*grouped)
ridgeline(data, labels=years, overlap=.85, fill='tomato')
plt.title('Distribution of CO2 levels per year since 2000',
          loc='left', fontsize=18, color='gray')
plt.xlim((co2_data.ppm.min(), co2_data.ppm.max()))
plt.ylim((0, 3.1))

In the snippet above we downloaded the measurements of the concentration of CO2 in the atmosphere, the same data was also used here, and grouped the values by year. Then, we generated a Ridgeline plot that shows the distribution of CO2 levels each year since 2000. We easily note that the average concentration went from 370ppm to 420pmm gradually increasing over the 19 years abserved. We also note that the span of each distribution is approximatively 10ppm.

Wednesday, September 11, 2019

Organizing movie covers with Neural Networks

In this post we will see how to organize a set of movie covers by similarity on a 2D grid using a particular type of Neural Network called Self Organizing Map (SOM). First, let's load the movie covers of the top 100 movies according to IMDB (the files can be downloaded here) and convert the images in samples that we can use to feed the Neural Network:
import numpy as np
import imageio
from glob import glob
from sklearn.preprocessing import StandardScaler

# covers of the top 100 movies on 
# (the 13th of August 2019)
# images downloaded from
data = []
all_covers = glob('movie_covers/*.jpg')
for cover_jpg in all_covers:
    cover = imageio.imread(cover_jpg)
original_shape = imageio.imread(all_covers[0]).shape

scaler = StandardScaler()
data = scaler.fit_transform(data)
In the snippet above we load every image and for each of them we stack the color values of each pixel in a one dimensional vector. After loading all the images a standard scaling is applied to have all the values with mean 0 and standard deviation equal to 1. This scaling strategies often turns out to be quite successful when working with SOMs. Now we can train our model:
from minisom import MiniSom

w = 10
h = 10
som = MiniSom(h, w, len(data[0]), learning_rate=0.5,
              sigma=3, neighborhood_function='triangle')

som.train_random(data, 2500, verbose=True)
win_map = som.win_map(data)
Here we use Minisom, a lean implementation of the SOM, to implement a 10-by-10 map of neurons. Each movie cover is mapped in a neuron and we can display the results as follows:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(figsize=(30, 20))
grid = ImageGrid(fig, 111,
                 nrows_ncols=(h, w), axes_pad=0)

def place_image(i, img):
    img = (scaler.inverse_transform(img)).astype(int)

to_fill = []
collided = []

for i in range(w*h):
    position = np.unravel_index(i, (h, w))
    if position in win_map:
        img = win_map[position][0]
        collided += win_map[position][1:]
        place_image(i, img)

collided = collided[::-1]
for i in to_fill:
    position = np.unravel_index(i, (h, w))
    img = collided.pop()
    place_image(i, img)
Since some images can be mapped in the same neuron, we first draw all the covers picking only one per neuron, then we fill the empty spaces of the map with covers that have been mapped in nearby neurons but have not been plotted yet.

This is the result:

Where to go next:
  • Read more about how Self Organizing Maps work here.
  • Check out how to install Minisom here.