import pandas as pd import matplotlib.pyplot as plt url = 'https://raw.githubusercontent.com/fivethirtyeight' url += '/data/master/tennis-time/serve_times.csv' event = pd.read_csv(url) plt.hist(event.seconds_before_next_point, bins=10) plt.xlabel('Seconds before next serve') plt.show()
The histogram reveals some interesting aspects of the distribution, indeed we can see that data is slightly skewed to the right and that on average the server takes 20 seconds. However, we couldn't tell how many time the serves happens before 10 seconds or after 35. Of course, one could increase the bins of the histogram, but this would lead to a chart which is not particularly elegant and that might hide some other details.
To have a better understanding of the situation we can draw a scatter plot of the variable we are studying:
import numpy as np from scipy.stats.kde import gaussian_kde def distribution_scatter(x, symmetric=True, cmap=None, size=None): """ Plot the distribution of x showing all the points. The x axis represents the samples in x and the y axis is function of the probability of x and random assignment. Returns the position on the y axis. """ pdf = gaussian_kde(x) w = np.random.rand(len(x)) if symmetric: w = w*2-1 pseudo_y = pdf(x) * w if cmap: plt.scatter(x, pseudo_y, c=x, cmap=cmap, s=size) else: plt.scatter(x, pseudo_y, s=size) return pseudo_y
In this chart each sample is represented with a point and the spread of the points in the y direction depends on the probability of occurrence. In this case we can easily see that 4 serves happened before 10 seconds and 3 after 35.
Since we're not really interested on the values on y axis but only on the spread, we can remove the axis and add few details on the outliers to enrich the chart:
url = 'https://raw.githubusercontent.com/fivethirtyeight' url += '/data/master/tennis-time/serve_times.csv' event = pd.read_csv(url) plt.figure(figsize=(7, 11)) title = 'Time in seconds between' title += '\nend of marked point and next serve' title += '\nat 2015 French Open' plt.title(title, loc='left', fontsize=18, color='gray') py = distribution_scatter(event.seconds_before_next_point, cmap='cool'); cut_h = np.percentile(event.seconds_before_next_point, 98) outliers = event.seconds_before_next_point> cut_h ha = {True: 'right', False: 'left'} for x, y, c in zip(event[outliers].seconds_before_next_point, py[outliers], event[outliers].server): plt.text(x, y+.0005, c, ha=ha[x<0], va='bottom', fontsize=12) plt.xlabel('Seconds before next serve', fontsize=15) plt.gca().spines['left'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.gca().spines['top'].set_visible(False) plt.yticks([]) plt.xticks(np.arange(5, 41, 5)) plt.xlim([5, 40]) plt.show()
Where to go next:
- The data used in this post has also been presented in this article on fivethirtyeight.com.
- If you found this visualization technique useful, you can check out the seaborn documentation about swarm plots.