tag:blogger.com,1999:blog-16930143295671448722024-03-13T10:36:51.590+00:00The Glowing PythonA collection of sloppy snippets for scientific computing and data visualization in Python.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.comBlogger142125tag:blogger.com,1999:blog-1693014329567144872.post-17835154857991103352021-06-04T08:32:00.002+01:002021-06-04T08:32:27.030+01:00The Central Limit Theorem, a hands-on introductionThe central limit theorem can be informally summarized in few words: The sum of <i>x<sub>1</sub>, x<sub>2</sub>, ... x<sub>n</sub></i> samples from the same distribution is normally distributed, provided that <i>n</i> is big enough and that the distribution has a finite variance.
to show this in an experimental way, let's define a function that sums <i>n</i> samples from the same distrubution for 100000 times:
<pre class="prettyprint">
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
def sum_random_variables(*kwarg, sp_distribution, n):
# returns the sum of n random samples
# drawn from sp_distribution
v = [sp_distribution.rvs(*kwarg, size=100000) for _ in range(n)]
return np.sum(v, axis=0)
</pre>
This function takes in input the parameters of the distrubution, the function that implements the distrubution and n. It returns an array of 100000 elements, where each element is the sum of n samples. Given the Central Limit Theorem, we expect that the values in output are normally distributed if n is big enough.
To verify this, let's consider a beta distribution with parameters alpha=1 and beta=2, run our function increasing n and plot the histogram of the values in output:
<pre class="prettyprint">
plt.figure(figsize=(9, 3))
N = 5
for n in range(1, N):
plt.subplot(1, N-1, n)
s = sum_random_variables(1, 2, sp_distribution=sps.beta, n=n)
plt.hist(s, density=True)
plt.tight_layout()
</pre>
<div class="separator" style="clear: both;"><a href="https://1.bp.blogspot.com/-5IrDjChsBNs/YKTdjZB3ZLI/AAAAAAAAB9E/kgBA_HP15j09ixfEEn07837p5QNSQaN2ACLcBGAsYHQ/s0/clt_beta.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="208" data-original-width="568" src="https://1.bp.blogspot.com/-5IrDjChsBNs/YKTdjZB3ZLI/AAAAAAAAB9E/kgBA_HP15j09ixfEEn07837p5QNSQaN2ACLcBGAsYHQ/s0/clt_beta.png"/></a></div>
On the far left we have the histogram with n=1 , the one with n=2 right next to it, and so on until n=4. With n=1 we have the original distribution, which is heavily skewed. With n=2 we have a distribution which is less skewed. When we reach n=4 we see that the distribution is almost symmetrical, resembling a normal distribution.
<br/><br/>
Let's do the same experiment using a uniform distribution:
<pre class="prettyprint">
plt.figure(figsize=(9, 3))
for n in range(1, N):
plt.subplot(1, N-1, n)
s = sum_random_variables(1, 1, sp_distribution=sps.beta, n=n)
plt.hist(s, density=True)
plt.tight_layout()
</pre>
<div class="separator" style="clear: both;"><a href="https://1.bp.blogspot.com/-XfAW-zGK6J0/YKTdoaNrUEI/AAAAAAAAB9I/kyMY_F47hFMBoaTWDUu9-w43BPn6m1JtACLcBGAsYHQ/s0/clt_uniform.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="208" data-original-width="568" src="https://1.bp.blogspot.com/-XfAW-zGK6J0/YKTdoaNrUEI/AAAAAAAAB9I/kyMY_F47hFMBoaTWDUu9-w43BPn6m1JtACLcBGAsYHQ/s0/clt_uniform.png"/></a></div>
Here we have that for n=2 the distribution is already symmetrical, resembling a triangle, and increasing n further we get closer to the shape of a Gaussian.
<br/><br/>
The same behaviour can be shown for discrete distributions. Here's what happens if we use the Bernoulli distribution:
<pre class="prettyprint">
plt.figure(figsize=(9, 3))
for n in range(1, N):
plt.subplot(1, N-1, n)
s = sum_random_variables(.5, sp_distribution=sps.bernoulli, n=n)
plt.hist(s, bins=n+1, density=True, rwidth=.7)
plt.tight_layout()
</pre>
<div class="separator" style="clear: both;"><a href="https://1.bp.blogspot.com/-h3E_TOEVC7c/YKTd2r3NlgI/AAAAAAAAB9Q/hJj0_35J3WMNsgCF77rfXtqOgIoSRophwCLcBGAsYHQ/s0/clt_bernoulli.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="208" data-original-width="568" src="https://1.bp.blogspot.com/-h3E_TOEVC7c/YKTd2r3NlgI/AAAAAAAAB9Q/hJj0_35J3WMNsgCF77rfXtqOgIoSRophwCLcBGAsYHQ/s0/clt_bernoulli.png"/></a></div>
We see again that for n=2 the distribution starts to be symmetrical and that the shape of a Gaussian is almost clear for n=4.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com2tag:blogger.com,1999:blog-1693014329567144872.post-33189066295514580482021-04-07T12:50:00.004+01:002021-04-07T12:50:45.281+01:00A Simple model that earned a Silver medal in predicting the results of the NCAAW tournamentThis year I decided to join the <a href="https://www.kaggle.com/c/ncaaw-march-mania-2021/leaderboard" target="_blank">March Machine Learning Mania 2021 - NCAAW</a> challenge on Kaggle. It proposes to predict the outcome of each game into the basketball NCAAW tournament, which is a tournament for women at college level. Participants can assign a probability to each outcome and they're ranked on the leaderboard according to the accuracy of their prediction. One of the most attractive elements of the challenge is that the leaderboard is updated after each game throughout the tournament.
<br /><br />
Since I have limited knowledge of basketball I decided to use a minimalistic model:
<ul>
<li>It uses three features that are easy to interpret: seed, percentage of victories, and the average score of each team.</li>
<li>It is based on linear Linear Regression, and it's tuned to predict extreme probability values only for games that are easy to predict.</li>
</ul>
The following visualizations give insight into how the model estimates the winning probability in a game between two teams:
<table style="width: 100%;">
<tbody><tr>
<td>
<div class="separator" style="clear: both;"><a href="https://1.bp.blogspot.com/-tKJaHX_psoQ/YGxHA9z1A7I/AAAAAAAAB6s/47sMY0STH4AFGksE9pgG0AP-o4hlhDeVQCLcBGAsYHQ/s0/ncaa_model.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="442" data-original-width="581" height="191" src="https://1.bp.blogspot.com/-tKJaHX_psoQ/YGxHA9z1A7I/AAAAAAAAB6s/47sMY0STH4AFGksE9pgG0AP-o4hlhDeVQCLcBGAsYHQ/w251-h191/ncaa_model.png" width="251" /></a></div>
</td>
<td>
<div class="separator" style="clear: both;"><a href="https://1.bp.blogspot.com/-THFpHtk4euQ/YG1gyZLL0AI/AAAAAAAAB60/iVbrf_dM0_0xu8NbtvGJLnHKABaCrNk1ACLcBGAsYHQ/s0/ncaa_model2.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="438" data-original-width="567" height="190" src="https://1.bp.blogspot.com/-THFpHtk4euQ/YG1gyZLL0AI/AAAAAAAAB60/iVbrf_dM0_0xu8NbtvGJLnHKABaCrNk1ACLcBGAsYHQ/w246-h190/ncaa_model2.png" width="246" /></a></div>
</td>
</tr>
</tbody></table>
<br />
Surprisingly, this model ranked 46th out of 451 submissions, placing itself in the top 11% of the leaderboard and earning a silver medal!
<br /><br />
The notebook with the solution and some more charts can be found <a href="https://colab.research.google.com/drive/1GRhTlpxp2qePwGOKKDq4y9HAAqS4E_MC?usp=sharing" target="_blank">here</a>.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-16189611228208896422020-11-11T09:22:00.005+00:002020-11-11T09:22:47.535+00:00Visualize the Dictionary of Obscure Words with T-SNE
I recently published on a wrapper around The Dictionary of Obscure Words (originally from this website <a href="http://phrontistery.info">http://phrontistery.info</a>) for Python and in this post we'll see how to create a visualization to highlight few entries from the dictionary using the dimensionality reduction technique called T-SNE. The dictionary is available on github at this address <a href="https://github.com/JustGlowing/obscure_words" target="_blank">https://github.com/JustGlowing/obscure_words</a> and can be installed as follows:
<pre>pip install git+https://github.com/JustGlowing/obscure_words
</pre>
We can now import the dictionary and create a vectorial representation of each word:
<pre class="prettyprint">import matplotlib.pyplot as plt
import numpy as np
from obscure_words import load_obscure_words
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.manifold import TSNE
obscure_dict = load_obscure_words()
words = np.array(list(obscure_dict.keys()))
definitions = np.array(list(obscure_dict.values()))
vectorizer = TfidfVectorizer(stop_words=None)
X = vectorizer.fit_transform(definitions)
projector = TSNE(random_state=0)
XX = projector.fit_transform(X)
</pre>
In the snippet above, we compute a <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf#:~:text=In%20information%20retrieval%2C%20tf%E2%80%93idf,in%20a%20collection%20or%20corpus.">Tf-Idf</a> representation using the definition of each word. This gives us a vector for each word in our dictionary, but each of these vectors has many elements as the total number of words used in all the definitions. Since we can't plot all the features extracted, we reduce our data to 2 dimensions we use <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">T-SNE</a>. We have now a mapping that allows us to place each word in a point of a bi-dimensional space. There's one problem remaining, how can we plot the words in a way that we can still read them? Here's a solution:
<pre class="prettyprint">from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances
def textscatter(x, y, text, k=10):
X = np.array([x, y]).T
clustering = KMeans(n_clusters=k)
scaler = StandardScaler()
clustering.fit(scaler.fit_transform(X))
centers = scaler.inverse_transform(clustering.cluster_centers_)
selected = np.argmin(pairwise_distances(X, centers), axis=0)
plt.scatter(x, y, s=6, c=clustering.predict(scaler.transform(X)), alpha=.05)
for i in selected:
plt.text(x[i], y[i], text[i], fontsize=10)
plt.figure(figsize=(16, 16))
textscatter(XX[:, 0], XX[:, 1],
[w+'\n'+d for w, d in zip(words, definitions)], 20)
plt.show()
</pre>
In the function <b>textscatter</b> we segment all the points created at the previous steps in k clusters using K-Means, then we plot the word related to the center of cluster (and also its definion). Given the properties of K-Means we know that the centers are distant from each other and with the right choice of k we can maximize the number of words we can display. This is the result of the snippet above:
<div class="separator" style="clear: both;"><a href="https://1.bp.blogspot.com/-6dwQuHKewjo/X6qJS7xkrgI/AAAAAAAABy8/KNby6HPGqisBuWmS25J-vy1MgEpMpu2lQCLcBGAsYHQ/s1002/Screenshot%2B2020-11-10%2Bat%2B12.36.05.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="619" data-original-width="516" src="https://1.bp.blogspot.com/-wn6p4nYLJj4/X6qLAhVMuyI/AAAAAAAABzI/wSy7O3B8fdAlnxL2KfikJERvO2fzuw1CACLcBGAsYHQ/s0/Screenshot%2B2020-11-10%2Bat%2B12.43.26.png" /></a></div><div style="text-align: center;">(click on the figure to see the entire chart)</div>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com1tag:blogger.com,1999:blog-1693014329567144872.post-14401452363791852232020-06-29T12:01:00.000+01:002020-06-29T12:05:42.235+01:00Solving the Travelling Salesman Problem with MiniSomHave you ever heard of the Travelling Salesman Problem? I'm pretty sure you do, but let's refresh our mind looking at its formulation: "Given a list of points and the distances between each pair of points, what is the shortest possible path that visits each point and returns to the starting point?". <br />
What makes this problem so famous and so studied is the fact that it has no "quick" solution as the complexity of calculating the best path increases adding more points. And the complexity increases so fast that, even with modern hardware, it can be impossible to compute an exact solution in a reasonable time. In more rigorous terms, it is an <a href="https://en.wikipedia.org/wiki/NP-hardness">NP-hard</a> problem.
Many heuristics are known to solve this problem and in this post we will see a solution based on <a href="https://glowingpython.blogspot.com/2013/09/self-organizing-maps.html">Self-organizing Maps</a> (SOM).
A SOM is a Neural Network that is capable of mapping an input point into a bi-dimnsional space placing points that are close to each other into the same area. Hence, the idea to solve our problem is to train the SOM in order to map the points to visit in single dimension map and visit the points from the one mapped to the first cell (the one on the left) to the last cell (on the right). Points that are mapped to the same cell are visited consecutively.
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-vAT3H88AyVw/XvhFZWHGsUI/AAAAAAAABtU/Q9OEsKhdotAHZ9mcIHl1XIJqHUV54YpgACLcBGAsYHQ/s1600/somTSP_diag.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="292" data-original-width="475" src="https://3.bp.blogspot.com/-vAT3H88AyVw/XvhFZWHGsUI/AAAAAAAABtU/Q9OEsKhdotAHZ9mcIHl1XIJqHUV54YpgACLcBGAsYHQ/s1600/somTSP_diag.png" /></a></div>
<br />
Let's generate a set of points to test this idea:
<br />
<pre class="prettyprint">import numpy as np
import matplotlib.pyplot as plt
np.random.RandomState(10)
N_points = 20
N_neurons = N_points*2
t = np.linspace(0, np.pi*2, N_points)
x = np.cos(t)+(np.random.rand(N_points)-.5)*.3
y = np.sin(t)*.8+(np.random.rand(N_points)-.5)*.2
points = np.array([x,y]).T
plt.scatter(x, y)
</pre>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-oSX_fZXQMHM/XvhF3U874dI/AAAAAAAABtc/cnNAmUwoUtgUedm4RiMInTW5KBMyjJ-iwCLcBGAsYHQ/s1600/somTSP_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="251" data-original-width="392" src="https://1.bp.blogspot.com/-oSX_fZXQMHM/XvhF3U874dI/AAAAAAAABtc/cnNAmUwoUtgUedm4RiMInTW5KBMyjJ-iwCLcBGAsYHQ/s1600/somTSP_1.png" /></a></div>
<br />
We can now import MiniSom, our favorite implementation of the Self_Organizing Maps, and see what path it's able to produce:
<br />
<pre class="prettyprint">from minisom import MiniSom
som = MiniSom(1, N_neurons*2, 2, sigma=10,
neighborhood_function='gaussian', random_seed=50)
max_iter = 2000
som.pca_weights_init(points)
paths_x = []
paths_y = []
for i in np.arange(max_iter):
i_point = i % len(points)
som.update(points[i_point], som.winner(points[i_point]), i, max_iter)
visit_order = np.argsort([som.winner(p)[1] for p in points])
visit_order = np.concatenate((visit_order, [visit_order[0]]))
paths_x.append(points[visit_order][:,0])
paths_y.append(points[visit_order][:,1])
plt.scatter(x, y, label='point to visit')
plt.plot(paths_x[-1], paths_y[-1],
'C3', linewidth=2, label='path')
</pre>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-uyqoOmNuB3Q/XvhGggP8bGI/AAAAAAAABto/62j50ZK3W48f66vxSjB_00gFMyQJmwrLQCLcBGAsYHQ/s1600/somTSP_2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="250" data-original-width="389" src="https://1.bp.blogspot.com/-uyqoOmNuB3Q/XvhGggP8bGI/AAAAAAAABto/62j50ZK3W48f66vxSjB_00gFMyQJmwrLQCLcBGAsYHQ/s1600/somTSP_2.png" /></a></div>
<br />
In the snippet above we initialized the SOM and run 2000 training iterations (check <a href="https://glowingpython.blogspot.com/2013/09/self-organizing-maps.html">this</a> out to discover how that works). At each iteration we have saved the path found and visualized the last solution. As we can see, the line covers all the points and it's easy to see that it's the best possible path with just a glance. However, it's interesting to see how the solution evolves at each iteration:
<br />
<pre class="prettyprint">from matplotlib.animation import FuncAnimation
from IPython.display import HTML
fig, ax = plt.subplots()
plt.scatter(x, y, label='point to visit')
ln, = plt.plot([], [], 'C3', linewidth=2, label='path')
plt.legend()
def update(frame):
ln.set_data(paths_x[frame], paths_y[frame])
plt.title('iteration = %d' % frame)
return ln,
ani = FuncAnimation(fig, update, frames=np.arange(max_iter),
interval=10, repeat=False, blit=False)
HTML(ani.to_html5_video())
</pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /><iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.blogger.com/video.g?token=AD6v5dxMTtvXqU6pE0ERo5F-2ses13EXiwP-8Fa0Kq0gN1lnRdU8qf5O7KTTAwPg7MWwnh4vVjW8i_PfbMS-DOxJOw' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<br />
Here we note that the initial path is very messy and presents various loops and that the more the network is trained the more optimal the solution becomes. Notice that the snippet above uses the object <i>HTML</i> from the IPython library and it will automatically display the video if a Jupyter notebook is used. The video can be saved in a specific location using <i>ani.save(filename.mp4)</i>.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-77154357699292903562020-05-17T19:07:00.001+01:002020-08-25T06:46:18.525+01:00Neural Networks Regularization made easy with sklearn and matplotlibUsing regularization has many benefits, the most common are reduction of overfitting and solving multicollinearity issues. All of this is covered very well in literature, especially in <a href="https://web.stanford.edu/~hastie/ElemStatLearn/">(Hastie et all)</a>. Howerver, wihout touching too many details we can have a very straigthforward interpretation of regularization. Regularization is a way to constrain a model in order to learn less from the data. In this post we will experimentally show what are the effects of regularization on a Neural Network (Multilayer Perceptron) validating this interpretation.
<br />
<br />
Let's define a goal for our Neural Network. We have a dataset <i>H</i> and we want to build a model able to reconstruct the same data. More formally, we want to build a function <i>f</i> that takes in input <i>H</i> and returns the same values or an approximation close as possible to <i>H</i>. We can say that we want <i>f</i> to minimize the following error
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-zwIJXB2kAWw/XqAgE7DfS_I/AAAAAAAABqY/SwzMKCvo4jwfRxrg61kQeshmVmKkURY7ACLcBGAsYHQ/s1600/Screenshot%2B2020-04-22%2Bat%2B11.33.44.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="90" data-original-width="336" height="53" src="https://1.bp.blogspot.com/-zwIJXB2kAWw/XqAgE7DfS_I/AAAAAAAABqY/SwzMKCvo4jwfRxrg61kQeshmVmKkURY7ACLcBGAsYHQ/s200/Screenshot%2B2020-04-22%2Bat%2B11.33.44.png" width="200" /></a></div>
<br />
<br />
To begin our experiment we build a data matrix <i>H</i> that contains the coordinate of a stylized star:
<br />
<pre class="prettyprint">import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x= [np.cos(np.pi/2), 2/5*np.cos(np.pi/2+np.pi/5), np.cos(np.pi/2+2*np.pi/5),
2/5*np.cos(np.pi/2+3*np.pi/5), np.cos(np.pi/2+4*np.pi/5),
2/5*np.cos(3*np.pi/2), np.cos(np.pi/2+6*np.pi/5),
2/5*np.cos(np.pi/2+7*np.pi/5), np.cos(np.pi/2+8*np.pi/5),
2/5*np.cos(np.pi/2+9*np.pi/5), np.cos(np.pi/2)]
y=[np.sin(np.pi/2), 2/5*np.sin(np.pi/2+np.pi/5), np.sin(np.pi/2+2*np.pi/5),
2/5*np.sin(np.pi/2+3*np.pi/5), np.sin(np.pi/2+4*np.pi/5),
2/5*np.sin(3*np.pi/2), np.sin(np.pi/2+6*np.pi/5),
2/5*np.sin(np.pi/2+7*np.pi/5), np.sin(np.pi/2+8*np.pi/5),
2/5*np.sin(np.pi/2+9*np.pi/5), np.sin(np.pi/2)]
xp = np.linspace(0, 1, len(x))
np.interp(2.5, xp, x)
x = np.log(np.interp(np.linspace(0, 1, len(x)*10), xp, x)+1)
yp = np.linspace(0, 1, len(y))
np.interp(2.5, yp, y)
y = np.interp(np.linspace(0, 1, len(y)*10), yp, y)
#y[::2] += .1
H = np.array([x, y]).T
plt.plot(H[:,0], H[:,1], '-o')
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-vYSJvF1WHL8/XqAdv64ueXI/AAAAAAAABpY/Vi2HmlEwoporsIW9vKwSCPNU8ZoPQlqpwCLcBGAsYHQ/s1600/download-36.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="248" data-original-width="386" height="206" src="https://1.bp.blogspot.com/-vYSJvF1WHL8/XqAdv64ueXI/AAAAAAAABpY/Vi2HmlEwoporsIW9vKwSCPNU8ZoPQlqpwCLcBGAsYHQ/s320/download-36.png" width="320" /></a></div>
<br />
<br />
Now the matrix <i>H</i> contains the x coordinates of our star in the first column and the y coordinates in the second.
With the help of sklearn, we can now train a Neural Network and plot the result:
<br />
<pre class="prettyprint">from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import minmax_scale
H = scale(H)
plt.figure()
f = MLPRegressor(hidden_layer_sizes=(200, 200, 200), alpha=0)
f.fit(H, H)
result = f.predict(H)
plt.plot(result[:,0], result[:,1], 'C3', linewidth=3, label='Neural Network')
plt.plot(H[:,0], H[:,1], 'o', alpha=.3, label='original')
plt.legend()
#plt.xlim([-0.1, 1.1])
#plt.ylim([-0.1, 1.1])
plt.show()
</pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-rp5GmX67TBw/XqAd7Xbu7NI/AAAAAAAABpg/P-Zo18EdsNQ8-zXsrUsHUyQvdoBpTrdbQCLcBGAsYHQ/s1600/download-37.png" imageanchor="1"><img border="0" data-original-height="248" data-original-width="380" src="https://4.bp.blogspot.com/-rp5GmX67TBw/XqAd7Xbu7NI/AAAAAAAABpg/P-Zo18EdsNQ8-zXsrUsHUyQvdoBpTrdbQCLcBGAsYHQ/s1600/download-37.png" /></a></div>
<br />
<br />
In the snippet above we created a Neural Network with 3 layers of 200 neurons. Then, we trained the model using <i>H</i> as both input and output data. In the chart we compare the original data with the estimation. It's easy to see that there are small differences between the two stars but they are very close.
<br />
Here's important to notice that we initialized <i>MLPRegressor</i> using <i>alpha=0</i>. This parameter controls the regularization of the model and the higher its value, the more regularization we apply. To understand how <i>alpha</i> affects the learning of the model we need to add a term to the computation of the error that was just introduced:
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-rRlHznTTrtQ/XqAeIsUNaCI/AAAAAAAABpo/BeqOD55NgBsHqyNbb2gq-9Ki4fJ-1LY0gCLcBGAsYHQ/s1600/Screenshot%2B2020-04-22%2Bat%2B11.36.07.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="84" data-original-width="506" height="52" src="https://2.bp.blogspot.com/-rRlHznTTrtQ/XqAeIsUNaCI/AAAAAAAABpo/BeqOD55NgBsHqyNbb2gq-9Ki4fJ-1LY0gCLcBGAsYHQ/s320/Screenshot%2B2020-04-22%2Bat%2B11.36.07.png" width="320" /></a></div>
<br />
where <i>W</i> is a matrix of all the weights in the network. The error not only takes in account the difference between the ouput of the function and the data, but also the size of weights of the connections of the network. Hence, the higher alpha, the less the model is free to learn. If we set alpha to 20 in our experiment we have the following result:
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-vHN1IqxZh5g/XqAeTcyOUhI/AAAAAAAABpw/hfYCcwyPzlI9rIHI08kFTGI2VEpik-R9ACLcBGAsYHQ/s1600/download-38.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="248" data-original-width="380" src="https://4.bp.blogspot.com/-vHN1IqxZh5g/XqAeTcyOUhI/AAAAAAAABpw/hfYCcwyPzlI9rIHI08kFTGI2VEpik-R9ACLcBGAsYHQ/s1600/download-38.png" /></a></div>
<br />
<br />
We still achie an approximation of the star but the output of our model is smaller than before and the edges of the star are smoother.
<br />
Increasing alpha gradually is a good way to understand the effects of regularization:
<br />
<pre class="prettyprint">from matplotlib.animation import FuncAnimation
from IPython.display import HTML
fig, ax = plt.subplots()
ln, = plt.plot([], [], 'C3', linewidth=3, label='Neural Network')
plt.plot(H[:,0], H[:,1], 'o', alpha=.3, label='original')
plt.legend()
def update(frame):
f = MLPRegressor(hidden_layer_sizes=(200, 200, 200), alpha=frame)
f.fit(H, H)
result = f.predict(H)
ln.set_data(result[:,0], result[:,1])
plt.title('alpha = %.2f' % frame)
return ln,
ani = FuncAnimation(fig, update, frames=np.linspace(0, 40, 100), blit=True)
HTML(ani.to_html5_video())
</pre>
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='420' height='300' src='https://www.blogger.com/video.g?token=AD6v5dytLCtfQm10ASzF9N46EDBZ-Ig4cSDB_16pH1MvwD7Ap1ZUaqZwAcpu-dYiiLx5GrXdtUwjbSFvi-_OEBMeAA' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<br />
<br />
Here we vary alpha from 0 to 40 and plot the result for each value. We notice here that not only the star gets smaller and smoother as alpha increases but also that the network tends to preserve long lines as much as possible getting away from the edges which account less on error function. Finally we see that the result degenerates into a point when alpha is too high.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-83820431071095366042020-05-01T09:50:00.003+01:002020-09-05T07:39:25.171+01:00Tornado plots with matplotlibLately there's a bit of attention about charts where the values of a time series are plotted against the change point by point. This thanks to this rather colorful and cluttered <a href="https://pbs.twimg.com/media/EWoCUylXsAcZn_p?format=jpg&name=large">Tornado plot</a>.
<br/><br/>
In this post we will see how to make one of those charts with our favorite plotting library, matplotlib, and we'll also try to understand how to read them.
<br/>
Let's start loading the records of the concentration of CO2 in the atmosphere and aggregate the values on month level. After that we can plot straight away the value of the concentration against the change compared to the previous month:
<pre class="prettyprint">
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
names=['year', 'month', 'day', 'decimal', 'ppm',
'days', '1_yr_ago', '10_yr_ago', 'since_1800'])
co2_data = co2_data.groupby([co2_data.year, co2_data.month]).mean()
idx = co2_data.index.to_flat_index()
co2_data['timestamp'] = [pd.Timestamp(year=y, month=m, day=1) for y, m in idx]
co2_data.set_index('timestamp', inplace=True)
co2_data = co2_data['2018-04-01':]
plt.plot(co2_data.ppm.diff(), co2_data.ppm, '-o')
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-iN5-FC-OsUg/XqvhWNNyycI/AAAAAAAABq8/Wrjw7zDcax8hi_sN3OUgSsKoitnQXYDgACLcBGAsYHQ/s1600/tornado_co2_bad.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-iN5-FC-OsUg/XqvhWNNyycI/AAAAAAAABq8/Wrjw7zDcax8hi_sN3OUgSsKoitnQXYDgACLcBGAsYHQ/s1600/tornado_co2_bad.png" data-original-width="390" data-original-height="260" /></a></div>
<br/><br/>
The result is nicely loopy. This is because we plotted data for 2 years and the time series presents a yearly seasonality. Still, apart from the loop it's quite hard to understand anything without adding few details. Let's print the name of the months and highlight beginning and end of the time series:
<pre class="prettyprint">
from calendar import month_abbr
plt.figure(figsize=(8, 12))
diffs = co2_data.ppm.diff()
plt.plot(diffs, co2_data.ppm, '-o', alpha=.6, zorder=0)
plt.scatter([diffs[1]], [co2_data.ppm[1]],
c='C0', s=140)
plt.scatter([diffs[-1]], [co2_data.ppm[-1]],
c='C0', s=140)
for d, v, ts in zip(diffs,
co2_data.ppm,
co2_data.index):
plt.text(d, v-.15, '%s\n%d' % (month_abbr[ts.month], ts.year),
horizontalalignment='left', verticalalignment='top')
plt.xlim([-np.abs(diffs).max()-.1,
np.abs(diffs).max()+.1])
plt.ylabel('CO2 concentrations (ppm)')
plt.xlabel('change from previous month (ppm)')
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-LKc89KzGr4U/XqvhlLlNXsI/AAAAAAAABrA/RSpsz_ad-7kCUqQtuMlE9ldOdnNbybdnwCLcBGAsYHQ/s1600/tornado_co2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-LKc89KzGr4U/XqvhlLlNXsI/AAAAAAAABrA/RSpsz_ad-7kCUqQtuMlE9ldOdnNbybdnwCLcBGAsYHQ/s1600/tornado_co2.png" data-original-width="538" data-original-height="721" /></a></div>
<br/><br/>
The situation is much improved. Now we can see that from June to September (value on the left) the concentrations decrease while from October to May (value on the right) they increase. If we look at the points where the line of zero chance is crossed, we can spot when there's a change of trend. We see that the trend changes from negative to positive from September to October and the opposite from May to June.
<br/><br/>
Was this easier to observe just have time on the y axis? That's easy to see looking at the chart below.
<br/><br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-eBWFuP1Kbbg/XqvhxQljfWI/AAAAAAAABrQ/1i6OXrrQ4ZcnHHxUukUZAkNM3u8DiZxlACPcBGAYYCw/s1600/co2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-eBWFuP1Kbbg/XqvhxQljfWI/AAAAAAAABrQ/1i6OXrrQ4ZcnHHxUukUZAkNM3u8DiZxlACPcBGAYYCw/s400/co2.png" width="500" height="272" data-original-width="729" data-original-height="313" /></a></div>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-86345240462215397882020-04-14T11:20:00.000+01:002020-04-14T11:20:56.800+01:00Recoloring NoIR images on the Raspberry Pi with OpenCVNot too long ago I've been gifted a Raspberry Pi camera, after taking some pictures I realized that it produced very weird colors and I discovered that it was a NoIR camera! It means that it has no infrared filter and that it can take pictures in the darkness using an infrared LED. Since I never found an application that required taking pictures without proper lighting I started wondering if I could recolor the images processing them with some Python magic.
While it's clear that the problem is ill posed, once the camera takes a picture sensing the wrong colors, the original colors are lost. It's also true that it's possible to transfer the coloring from one image to another. This, old but gold, <a href="https://www.cs.tau.ac.il/~turkel/imagepapers/ColorTransfer.pdf">paper</a> shows a technique that is simple enough to be implemented on a pi. Hence, the idea to create a script that recolors the images from the NoIR camera using the colors from images taken with a proper infrared filter.
<br/>
<br/>
Here are the elements I gathered to start experimenting:
<ul>
<li>A nice <a href="https://github.com/jrosebr1/color_transfer">implementation of the color transfer algorithm</a>, easy to install and run on the pi.</li>
<li>An installation of OpenCV on the pi. It's possible to have a fully optimized OpenCV installation for your pi <a href="Raspberry pi opencv: https://www.learnopencv.com/install-opencv-4-on-raspberry-pi/">building it from the source</a> but for this project it's okay to install the library from binaries (this command will do the trick: <i>sudo apt-get install python-opencv</i>).</li>
<li>A camera stand that I built myself recycling the components of an unused usb fan.</li>
</ul>
<br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-f_N09NTQxS0/XpNOw6bXCZI/AAAAAAAABn4/8opd2Bg77TQUwQK1fkC7kZG5KMfCjnf2wCLcBGAsYHQ/s1600/camera_stand.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-f_N09NTQxS0/XpNOw6bXCZI/AAAAAAAABn4/8opd2Bg77TQUwQK1fkC7kZG5KMfCjnf2wCLcBGAsYHQ/s320/camera_stand.jpeg" width="320" height="148" data-original-width="1600" data-original-height="738" /></a></div>
<br/>
The Python script to acquire and recolor the images turned out to be pretty compact:
<pre class="prettyprint">
from picamera.array import PiRGBArray
from picamera import PiCamera
from sys import argv
# get this with: pip install color_transfer
from color_transfer import color_transfer
import time
import cv2
# init the camera
camera = PiCamera()
rawCapture = PiRGBArray(camera)
# camera to warmup
time.sleep(0.1)
# capture
camera.capture(rawCapture, format="bgr")
captured = rawCapture.array
# import the color source
color_source = cv2.imread(argv[1])
# transfer the color
result = color_transfer(color_source, captured,
clip=True, preserve_paper=False)
cv2.imwrite(argv[2], result)
</pre>
This script captures an image from the camera and reads another image, that will be the color source, from the disk. Then, it recolors the captured image and saves the result.
The script takes in input two parameters, the color source and the name of the file in output. Here's an example of how to run the script on the pi:
<pre>
$ python capture.py color_source.jpg result.jpg
</pre>
Here are some samples pictures that were recolored. In each of the figures below there is the color source on the left, the image from the NoIR camera in the middle and final result on the right.
<br/><br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/--3-1T9l2ZLA/XpNUcoGesFI/AAAAAAAABoE/NHCkggB7BqQzG1OpX90J8JjHP0hjBF7sgCLcBGAsYHQ/s1600/teddy_ir2color.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/--3-1T9l2ZLA/XpNUcoGesFI/AAAAAAAABoE/NHCkggB7BqQzG1OpX90J8JjHP0hjBF7sgCLcBGAsYHQ/s400/teddy_ir2color.png" width="400" height="103" data-original-width="1505" data-original-height="389" /></a></div>
<br/>
Here the source has vivid colors and the details are nice and sharp while the image from the NoIR camera is almost monochromatic. In the recolored image the color of the curtain and the wall were recovered, still the image has quite a low contrast.<br/><br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-0mohWhM4Rq0/XpRQ7SMwuCI/AAAAAAAABoc/Yo34nQdUYAEzkJrH-qUZ_9Z-W5-FfSyugCLcBGAsYHQ/s1600/Screenshot%2B2020-04-13%2Bat%2B12.45.13.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-0mohWhM4Rq0/XpRQ7SMwuCI/AAAAAAAABoc/Yo34nQdUYAEzkJrH-qUZ_9Z-W5-FfSyugCLcBGAsYHQ/s640/Screenshot%2B2020-04-13%2Bat%2B12.45.13.png" width="400" height="95" data-original-width="1600" data-original-height="378" /></a></div>
This time the resulting image is much sharper and the resulting colors are more intense, even more intense than the source.<br/><br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-RZ0J-7hHbR0/XpNUdwB7MEI/AAAAAAAABoM/NRKZZR2qqC4NGwnfkokLub90TIgk61HwwCLcBGAsYHQ/s1600/outside_ir2color.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-RZ0J-7hHbR0/XpNUdwB7MEI/AAAAAAAABoM/NRKZZR2qqC4NGwnfkokLub90TIgk61HwwCLcBGAsYHQ/s400/outside_ir2color.png" width="400" height="95" data-original-width="1600" data-original-height="378" /></a></div>
<br/>
This result is particularly interesting because the NoIR image shows very nasty colors as there was quite a lot of sunlight when the picture was taken. Recoloring the image I could recover the green of some trees and the blue of the sky, however the walls and the ground got a greenish appearance while some plants look purple.
<br/><br/>
In conclusion, this turned out to be a fun experiment that also provided some encouraging results. Next step? Recoloring the images with, more modern, Deep Learning techniques.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-34704955780922643952020-04-05T07:40:00.000+01:002020-04-20T14:21:30.344+01:00What makes a word beautiful?What makes a word beautiful? Answering this question is not easy because of the inherent complexity and ambiguity in defining what it means to be beautiful. Let's tackle the question with a quantitative approach introducing the <i>Aesthetic Potential</i>, a metric that aims to quantify the beaty of a word <i>w</i> as follows:
<br/><br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-t0gJyy6CYA0/XodemCGWyQI/AAAAAAAABnA/3XnyL0PjGv4WQPUH_uAMy9YvPCaARCL4gCLcBGAsYHQ/s1600/Screenshot%2B2020-04-03%2Bat%2B17.04.21.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-t0gJyy6CYA0/XodemCGWyQI/AAAAAAAABnA/3XnyL0PjGv4WQPUH_uAMy9YvPCaARCL4gCLcBGAsYHQ/s1600/Screenshot%2B2020-04-03%2Bat%2B17.04.21.png" data-original-width="410" data-original-height="82" /></a></div>
<br/>
where <i>w</i><sup>+</sup> is a word labelled as beautifu, <i>w</i><sup>-</sup> as ugly and the function <i>s</i> is a similarity function between two words. In a nutshell, AP is the difference of the average similarity to beautiful words minus the average similarity to ugly words. This metric is positive for beautiful words and negative for ugly ones.
<br/>
Before we can compute the Aesthetic Potential we need a similarity function <i>s</i> and a set of words labeled as beautiful and ugly.
The similarity function that we will use considers the similarity of two words as the maximum <a href="https://nlpforhackers.io/wordnet-sentence-similarity/">Lin similarity</a> between all the synonyms in <a href="https://wordnet.princeton.edu/">WordNet</a> of the two words in input (I will not introduce WordNet or the Lin similarity for brevity, but the curious reader is invited to follow the links above). Here's the Python implementation:
<pre class="prettyprint">
import numpy as np
from itertools import product
from nltk.corpus import wordnet, wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
def similarity(word1, word2):
"""
returns the similarity between word1 and word2 as the maximum
Lin similarity between all the synsets of the two words.
"""
syns1 = wordnet.synsets(word1)
syns2 = wordnet.synsets(word2)
sims = []
for sense1, sense2 in product(syns1, syns2):
if sense1._pos == sense2._pos and not sense1._pos in ['a', 'r', 's']:
d = wordnet.lin_similarity(sense1, sense2, brown_ic)
sims.append(d)
if len(sims) > 0 or not np.all(np.isnan(sims)):
return np.nanmax(sims)
return 0 # no similarity
print('s(cat, dog) =', similarity('cat', 'dog'))
print('s(cat, bean) = ', similarity('cat', 'bean'))
print('s(coffee, bean) = ', similarity('coffee', 'bean'))
</pre>
<pre>
s(cat, dog) = 0.8768009843733973
s(cat, bean) = 0.3079964716744931
s(coffee, bean) = 0.788150820826125
</pre>
This function returns a value between 0 and 1. High values indicate that the two words are highly similar and low values indicate that there's no similarity. Looking at the output of the function three pairs of test words we note that the function considers "cat" and "dog" fairly similar while "dog" and "bean" not similar. Finally, "coffee" and "bean" are considered similar but not as similar as "cat" and "dog".
<br/>
Now we need some words labeled as beautiful and some as ugly. Here I propose two lists of words inspired by the ones used in <a href="https://www.frontiersin.org/articles/10.3389/fnhum.2017.00622/full#B23">(Jacobs, 2017)</a> for the German language:
<pre class="prettyprint">
beauty = ['amuse', 'art', 'attractive',
'authentic', 'beautiful', 'beauty',
'bliss', 'cheerful', 'culture',
'delight', 'emotion', 'enjoyment',
'enthusiasm', 'excellent', 'excited',
'fascinate', 'fascination', 'flower',
'fragrance', 'good', 'grace',
'graceful', 'happy', 'heal',
'health', 'healthy', 'heart',
'heavenly', 'hope', 'inspire',
'light', 'joy', 'love',
'lovely', 'lullaby', 'lucent',
'loving', 'luck', 'magnificent',
'music', 'muse', 'life',
'paradise', 'perfect', 'perfection',
'picturesque', 'pleasure',
'poetic', 'poetry', 'pretty',
'protect', 'protection',
'rich', 'spring', 'smile',
'summer', 'sun', 'surprise',
'wealth', 'wonderful']
ugly = ['abuse', 'anger', 'imposition', 'anxiety',
'awkward', 'bad', 'unlucky', 'blind',
'chaotic', 'crash', 'crazy',
'cynical', 'dark', 'disease',
'deadly', 'decrepit', 'death',
'despair', 'despise', 'disgust',
'dispute', 'depression', 'dull',
'evil', 'fail', 'hate',
'hideous', 'horrible', 'horror',
'haunted', 'illness', 'junk',
'kill', 'less',
'malicious', 'misery', 'murder',
'nasty', 'nausea', 'pain',
'piss', 'poor', 'poverty',
'puke', 'punishment', 'rot',
'scam', 'scare', 'shame',
'spoil', 'spite', 'slaughter',
'stink', 'terrible', 'trash',
'trouble', 'ugliness', 'ugly',
'unattractive', 'virus']
</pre>
A remark is necessary here. The AP strongly depends on these two lists and the fact that I made them on my own strongly biases the results towards my personal preferences. If you're interested on a more general approach to label your data, the work published by <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.703.1078&rep=rep1&type=pdf">Westbury et all in 2014</a> is a good place to start.
<br/>
We now have all the pieces to compute our Aesthetic Potential:
<pre class="prettyprint">
def aesthetic_potential(word, beauty, ugly):
"""
returns the aesthetic potential of word
beauty and ugly must be lists of words
labelled as beautiful and ugly respectively
"""
b = np.nanmean([similarity(word, w) for w in beauty])
u = np.nanmean([similarity(word, w) for w in ugly])
return (b - u)*100
print('AP(smile) =', aesthetic_potential('smile', beauty, ugly))
print('AP(conjuncture) =', aesthetic_potential('conjuncture', beauty, ugly))
print('AP(hassle) =', aesthetic_potential('hassle', beauty, ugly))
</pre>
<pre>
AP(smile) = 2.6615214570040195
AP(conjuncture) = -3.418813636728729e-299
AP(hassle) = -2.7675826881674497
</pre>
It is a direct implementation of the equation introduced above, the only difference is that the result is multiplied by 100 to have the metric in percentage for readability purposes. Looking at the results we see that the metric is positive for the word "smile", indicating that the word tends toward the beauty side. It's negative for "hassle", meaning it tends to the ugly side. It's 0 for "conjuncture", meaning that we can consider it a neutral word.
To better understand these results we can compute the metric for a set of words and plot it agains the probability of a value of the metric:
<pre class="prettyprint">
test_words = ['hate', 'rain',
'earth', 'love', 'child',
'sun', 'patience',
'coffee', 'regret',
'depression', 'obscure', 'bat', 'woman',
'dull', 'nothing', 'disillusion',
'abort', 'blurred', 'cruelness', #'hassle',
'stalking', 'relevance',
'conjuncture', 'god', 'moon',
'humorist', 'idea', 'poisoning']
ap = [aesthetic_potential(w.lower(), beauty, ugly) for w in test_words]
from scipy.stats import norm
import matplotlib.pyplot as plt
from matplotlib.colors import to_hex, LinearSegmentedColormap, Normalize
%matplotlib inline
p_score = norm.pdf(ap, loc=0.0, scale=0.7) #params estimated on a larger sample
p_score = p_score / p_score.sum()
normalizer = Normalize(vmin=-10, vmax=10)
colors = ['crimson', 'crimson', 'silver', 'deepskyblue', 'deepskyblue']
cmap = LinearSegmentedColormap.from_list('beauty', colors=colors)
plt.figure(figsize=(8, 12))
plt.title('What makes a word beautiful?',
loc='left', color='gray', fontsize=22)
plt.scatter(p_score, ap, c='gray', marker='.', alpha=.6)
for prob, potential, word in zip(p_score, ap, test_words):
plt.text(prob, potential, word.lower(),
fontsize=(np.log10(np.abs(potential)+2))*30, alpha=.8,
color=cmap(normalizer(potential)))
plt.text(-0.025, 6, 'beautiful', va='center',
fontsize=20, rotation=90, color='deepskyblue')
plt.text(-0.025, -6, 'ugly', va='center',
fontsize=20, rotation=90, color='crimson')
plt.xlabel('P(Aesthetic Potential)', fontsize=20)
plt.ylabel('Aesthetic Potential', fontsize=20)
plt.gca().tick_params(axis='both', which='major', labelsize=14)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.show()
</pre>
<br/><br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-IdQkAny0liY/XodgguE6SXI/AAAAAAAABnU/P8Wbe65MYK4UsOV2A--GUl18Gnq8ZQc8QCLcBGAsYHQ/s1600/Screenshot%2B2020-04-03%2Bat%2B17.10.19.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-IdQkAny0liY/XodgguE6SXI/AAAAAAAABnU/P8Wbe65MYK4UsOV2A--GUl18Gnq8ZQc8QCLcBGAsYHQ/s1600/Screenshot%2B2020-04-03%2Bat%2B17.10.19.png" data-original-width="554" data-original-height="727" /></a></div>
<br/>
This chart gives us a better insight on the meaning of the values we just computed. We note that high probability values are around 0, hence most words in the vocabulary are neutral. Values above 2 and below -2 have a quite low probability, this tells us that words associated with these values have a strong Aesthetic Potential. From this chart we can see that the words "idea" and "sun" are considered beautiful while "hate" and "poisoning" are ugly (who would disagree with that :).JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-64557012956140834872020-03-17T10:12:00.000+00:002020-03-17T10:12:23.398+00:00Ridgeline plots in pure matplotlibA Ridgeline plot (also called Joyplot) allows us to compare several statistical distributions. In this plot each distribution is shown with a density plot, and all the distributions are aligned to the same horizontal axis and, sometimes, presented with a slight overlap.
<br/><br/>
There are many options to make a Ridgeline plot in Python (<a href="https://github.com/sbebo/joypy">joypy</a> being one of them) but I decided to make my own function using matplotlib to have full flexibility and minimal dependencies:
<pre class="prettyprint">
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
def ridgeline(data, overlap=0, fill=True, labels=None, n_points=150):
"""
Creates a standard ridgeline plot.
data, list of lists.
overlap, overlap between distributions. 1 max overlap, 0 no overlap.
fill, matplotlib color to fill the distributions.
n_points, number of points to evaluate each distribution function.
labels, values to place on the y axis to describe the distributions.
"""
if overlap > 1 or overlap < 0:
raise ValueError('overlap must be in [0 1]')
xx = np.linspace(np.min(np.concatenate(data)),
np.max(np.concatenate(data)), n_points)
curves = []
ys = []
for i, d in enumerate(data):
pdf = gaussian_kde(d)
y = i*(1.0-overlap)
ys.append(y)
curve = pdf(xx)
if fill:
plt.fill_between(xx, np.ones(n_points)*y,
curve+y, zorder=len(data)-i+1, color=fill)
plt.plot(xx, curve+y, c='k', zorder=len(data)-i+1)
if labels:
plt.yticks(ys, labels)
</pre>
The function takes in input a list of datasets where each dataset contains the values to derive a single distribution. Each distribution is estimated using Kernel Density Estimation, just as we've seen <a href="https://glowingpython.blogspot.com/2012/08/kernel-density-estimation-with-scipy.html">previously</a>, and plotted increasing the y value.<br/><br/>
Let's generate data from few normal distributions with different means and have a look at the output of the function:
<pre class="prettyprint">
data = [norm.rvs(loc=i, scale=2, size=50) for i in range(8)]
ridgeline(data, overlap=.85, fill='y')
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-bloEanNZ5ss/Xm32YGsly1I/AAAAAAAABmA/ekxTexxxDCoae2oNH02R9MuzdStfrOALgCLcBGAsYHQ/s1600/ridge_plot_matplotlib1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-bloEanNZ5ss/Xm32YGsly1I/AAAAAAAABmA/ekxTexxxDCoae2oNH02R9MuzdStfrOALgCLcBGAsYHQ/s1600/ridge_plot_matplotlib1.png" data-original-width="383" data-original-height="255" /></a></div>
<br/><br/>
Not too bad, we can clearly see that each distribution has a different mean. Let's apply the function on real world data:
<pre class="prettyprint">
import pandas as pd
data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
names=['year', 'month', 'day', 'decimal', 'ppm',
'days', '1_yr_ago', '10_yr_ago', 'since_1800'])
co2_data = co2_data[co2_data.year >= 2000]
co2_data = co2_data[co2_data.year != 2020]
plt.figure(figsize=(8, 10))
grouped = [(y, g.ppm.dropna().values) for y, g in co2_data.groupby('year')]
years, data = zip(*grouped)
ridgeline(data, labels=years, overlap=.85, fill='tomato')
plt.title('Distribution of CO2 levels per year since 2000',
loc='left', fontsize=18, color='gray')
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.xlabel('ppm')
plt.xlim((co2_data.ppm.min(), co2_data.ppm.max()))
plt.ylim((0, 3.1))
plt.grid(zorder=0)
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-Ccyn9LWhDLE/Xm32dMsBRmI/AAAAAAAABmE/qM9ECAXYmDEOXgBgT6QzFPgG940zHtsVwCLcBGAsYHQ/s1600/ridgeplot_matplotlib_co2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-Ccyn9LWhDLE/Xm32dMsBRmI/AAAAAAAABmE/qM9ECAXYmDEOXgBgT6QzFPgG940zHtsVwCLcBGAsYHQ/s1600/ridgeplot_matplotlib_co2.png" data-original-width="512" data-original-height="617" /></a></div>
<br/><br/>
In the snippet above we downloaded the measurements of the concentration of CO2 in the atmosphere, the same data was also used <a href="https://glowingpython.blogspot.com/2019/04/visualizing-atmospheric-carbon-dioxide.html">here</a>, and grouped the values by year. Then, we generated a Ridgeline plot that shows the distribution of CO2 levels each year since 2000. We easily note that the average concentration went from 370ppm to 420pmm gradually increasing over the 19 years abserved. We also note that the span of each distribution is approximatively 10ppm.
JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com2tag:blogger.com,1999:blog-1693014329567144872.post-72500275253576119452019-09-11T10:58:00.000+01:002019-09-11T10:58:04.080+01:00Organizing movie covers with Neural NetworksIn this post we will see how to organize a set of movie covers by similarity on a 2D grid using a particular type of Neural Network called <a href="https://glowingpython.blogspot.com/2013/09/self-organizing-maps.html">Self Organizing Map (SOM)</a>.
First, let's load the movie covers of the top 100 movies according to IMDB (the files can be downloaded <a href="https://github.com/JustGlowing/minisom/tree/master/examples/movie_covers">here</a>) and convert the images in samples that we can use to feed the Neural Network:
<pre class="prettyprint">
import numpy as np
import imageio
from glob import glob
from sklearn.preprocessing import StandardScaler
# covers of the top 100 movies on www.imdb.com/chart/top
# (the 13th of August 2019)
# images downloaded from www.themoviedb.org
data = []
all_covers = glob('movie_covers/*.jpg')
for cover_jpg in all_covers:
cover = imageio.imread(cover_jpg)
data.append(cover.reshape(np.prod(cover.shape)))
original_shape = imageio.imread(all_covers[0]).shape
scaler = StandardScaler()
data = scaler.fit_transform(data)
</pre>
In the snippet above we load every image and for each of them we stack the color values of each pixel in a one dimensional vector. After loading all the images a standard scaling is applied to have all the values with mean 0 and standard deviation equal to 1. This scaling strategies often turns out to be quite successful when working with SOMs.
Now we can train our model:
<pre class="prettyprint">
from minisom import MiniSom
w = 10
h = 10
som = MiniSom(h, w, len(data[0]), learning_rate=0.5,
sigma=3, neighborhood_function='triangle')
som.train_random(data, 2500, verbose=True)
win_map = som.win_map(data)
</pre>
Here we use <a href="https://github.com/JustGlowing/minisom/">Minisom</a>, a lean implementation of the SOM, to implement a 10-by-10 map of neurons. Each movie cover is mapped in a neuron and we can display the results as follows:
<pre class="prettyprint">
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
fig = plt.figure(figsize=(30, 20))
grid = ImageGrid(fig, 111,
nrows_ncols=(h, w), axes_pad=0)
def place_image(i, img):
img = (scaler.inverse_transform(img)).astype(int)
grid[i].imshow(img.reshape(original_shape))
grid[i].axis('off')
to_fill = []
collided = []
for i in range(w*h):
position = np.unravel_index(i, (h, w))
if position in win_map:
img = win_map[position][0]
collided += win_map[position][1:]
place_image(i, img)
else:
to_fill.append(i)
collided = collided[::-1]
for i in to_fill:
position = np.unravel_index(i, (h, w))
img = collided.pop()
place_image(i, img)
plt.show()
</pre>
Since some images can be mapped in the same neuron, we first draw all the covers picking only one per neuron, then we fill the empty spaces of the map with covers that have been mapped in nearby neurons but have not been plotted yet.
<br/><br/>
This is the result:
<br/><br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-_jLpzeBJalA/XVKw3F39doI/AAAAAAAABg8/1_yshhXwefc-l4qBsa4EPU5qrF6X-sVVACPcBGAYYCw/s1600/movie_covers_nn.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-_jLpzeBJalA/XVKw3F39doI/AAAAAAAABg8/1_yshhXwefc-l4qBsa4EPU5qrF6X-sVVACPcBGAYYCw/s640/movie_covers_nn.png" width="428" height="640" data-original-width="737" data-original-height="1101" /></a></div>
<br/><br/>
Where to go next:
<ul>
<li>Read more about how Self Organizing Maps work <a href="https://glowingpython.blogspot.com/2013/09/self-organizing-maps.html">here</a>.</li>
<li>Check out how to install Minisom <a href="https://github.com/JustGlowing/minisom">here</a>.</li>
</ul>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-41577895591435050162019-08-11T11:28:00.000+01:002019-08-11T11:28:51.675+01:00Visualizing distributions with scatter plots in matplotlibLet's say that we want to study the time between the end of a marked point and next serve in a tennis game. After gathering our data, the first thing that we can do is to draw a histogram of the variable that we are interested in:
<br/><br/>
<pre class="prettyprint">
import pandas as pd
import matplotlib.pyplot as plt
url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)
plt.hist(event.seconds_before_next_point, bins=10)
plt.xlabel('Seconds before next serve')
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-x__H73GeG7I/XU_kX4_LYXI/AAAAAAAABgQ/MWp6uXv7D3IaCsjzfIKClgilkmjOyz1ugCLcBGAs/s1600/download.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-x__H73GeG7I/XU_kX4_LYXI/AAAAAAAABgQ/MWp6uXv7D3IaCsjzfIKClgilkmjOyz1ugCLcBGAs/s1600/download.png" data-original-width="372" data-original-height="266" /></a></div>
<br/><br/>
The histogram reveals some interesting aspects of the distribution, indeed we can see that data is slightly skewed to the right and that on average the server takes 20 seconds. However, we couldn't tell how many time the serves happens before 10 seconds or after 35. Of course, one could increase the bins of the histogram, but this would lead to a chart which is not particularly elegant and that might hide some other details.
<br/><br/>
To have a better understanding of the situation we can draw a scatter plot of the variable we are studying:
<pre class="prettyprint">
import numpy as np
from scipy.stats.kde import gaussian_kde
def distribution_scatter(x, symmetric=True, cmap=None, size=None):
"""
Plot the distribution of x showing all the points.
The x axis represents the samples in x
and the y axis is function of the probability of x
and random assignment.
Returns the position on the y axis.
"""
pdf = gaussian_kde(x)
w = np.random.rand(len(x))
if symmetric:
w = w*2-1
pseudo_y = pdf(x) * w
if cmap:
plt.scatter(x, pseudo_y, c=x, cmap=cmap, s=size)
else:
plt.scatter(x, pseudo_y, s=size)
return pseudo_y
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-dQ2mm00itdg/XU_k2VG05xI/AAAAAAAABgY/KK4elbkSjQQ4NvmT8coxM4nqW8k0HW6awCLcBGAs/s1600/download-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-dQ2mm00itdg/XU_k2VG05xI/AAAAAAAABgY/KK4elbkSjQQ4NvmT8coxM4nqW8k0HW6awCLcBGAs/s1600/download-1.png" data-original-width="390" data-original-height="266" /></a></div>
<br/><br/>
In this chart each sample is represented with a point and the spread of the points in the y direction depends on the probability of occurrence. In this case we can easily see that 4 serves happened before 10 seconds and 3 after 35.
<br/><br/>
Since we're not really interested on the values on y axis but only on the spread, we can remove the axis and add few details on the outliers to enrich the chart:
<br/><br/>
<pre class="prettyprint">
url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)
plt.figure(figsize=(7, 11))
title = 'Time in seconds between'
title += '\nend of marked point and next serve'
title += '\nat 2015 French Open'
plt.title(title, loc='left', fontsize=18, color='gray')
py = distribution_scatter(event.seconds_before_next_point, cmap='cool');
cut_h = np.percentile(event.seconds_before_next_point, 98)
outliers = event.seconds_before_next_point> cut_h
ha = {True: 'right', False: 'left'}
for x, y, c in zip(event[outliers].seconds_before_next_point,
py[outliers],
event[outliers].server):
plt.text(x, y+.0005, c,
ha=ha[x<0], va='bottom', fontsize=12)
plt.xlabel('Seconds before next serve', fontsize=15)
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.yticks([])
plt.xticks(np.arange(5, 41, 5))
plt.xlim([5, 40])
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-c2SW1ouGUeQ/XU_tmFpznwI/AAAAAAAABgs/IyTSou5GTqomzjy0ydAWO50VbwTCv-MHgCLcBGAs/s1600/tennis_serve_sec%2B%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-c2SW1ouGUeQ/XU_tmFpznwI/AAAAAAAABgs/IyTSou5GTqomzjy0ydAWO50VbwTCv-MHgCLcBGAs/s1600/tennis_serve_sec%2B%25281%2529.png" data-original-width="504" data-original-height="504" /></a></div>
<br/><br/>
Where to go next:
<ul>
<li>The data used in this post has also been presented in <a href="https://fivethirtyeight.com/features/why-some-tennis-matches-take-forever/">this article on fivethirtyeight.com</a>.</li>
<li>If you found this visualization technique useful, you can check out the <a href="https://seaborn.pydata.org/generated/seaborn.swarmplot.html">seaborn documentation about swarm plots</a>.</li>
</ul>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-30497685884350360492019-06-07T08:45:00.005+01:002019-06-07T13:06:14.680+01:00Exporting Decision Trees in textual format with sklearnIn the past we have covered Decision Trees showing how interpretable these models can be (see the tutorials <a href="https://glowingpython.blogspot.com/2016/05/an-intro-to-regression-analysis-with.html">here</a>). In the previous tutorials we have exported the rules of the models using the function <i>export_graphviz</i> from sklearn and visualized the output of this function in a graphical way with an external tool which is not easy to install in some cases. Luckily, since version 0.21.2, scikit-learn offers the possibility to export Decision Trees in a textual format (I implemented this feature personally ^_^) and in this post we will see an example how of to use this new feature.<br/><br/>
Let's train a tree with 2 layers on the famous <a href="https://archive.ics.uci.edu/ml/datasets/iris">iris dataset</a> using all the data and print the resulting rules using the brand new function <i>export_text</i>:
<pre class="prettyprint">
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data']
y = ['setosa']*50+['versicolor']*50+['virginica']*50
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)
</pre>
<pre>
|--- petal width (cm) <= 0.80
| |--- class: setosa
|--- petal width (cm) > 0.80
| |--- petal width (cm) <= 1.75
| | |--- class: versicolor
| |--- petal width (cm) > 1.75
| | |--- class: virginica
</pre>
Reading the them we note that if the feature <i>petal width</i> is less or equal than 80mm the samples are always classified as setosa. Otherwise if the <i>petal width</i> is less or equal than 1.75cm they're classified as versicolor or as virginica if the <i>petal width</i> is more than 1.75cm. This model might well suffer from overfitting but tells us some important details of the data. It's easy to note that the <i>petal width</i> is the only feature used, we could even say that the petal width is small for setosa samples, medium for versicolor and large for virginica.
<br/><br/>
To understand how the rules separate the labels we can also print the number of samples from each class (class weights) on the leaves:
<pre class="prettyprint">
r = export_text(decision_tree, feature_names=iris['feature_names'],
decimals=0, show_weights=True)
print(r)
</pre>
<pre>
|--- petal width (cm) <= 1
| |--- weights: [50, 0, 0] class: setosa
|--- petal width (cm) > 1
| |--- petal width (cm) <= 2
| | |--- weights: [0, 49, 5] class: versicolor
| |--- petal width (cm) > 2
| | |--- weights: [0, 1, 45] class: virginica
</pre>
Here we have the number of samples per class among square brackets. Recalling that we have 50 samples per class, we see that all the samples labeled as setosa are correctly modelled by the tree while for 5 virginica and 1 versicolor the model fails to capture the information given by the label.<br/><br/>
Check out the documentation of the function <i>export_text</i> to discover all its capabilities <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_text.html">here</a>.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-91457736649832774032019-05-17T08:29:00.001+01:002019-05-17T08:29:02.571+01:00Feelings toward immigration of people from other EU Member States in November 2018In this post we will see a snippet about how to plot a part of the results of the <a href="https://en.wikipedia.org/wiki/Eurobarometer">eurobarometer</a> survey <a href="http://data.europa.eu/euodp/en/data/dataset/S2215_90_3_STD90_ENG">released</a> last March. In particular, we will focus on the responses to the following question:
<blockquote>Please tell me whether the following statement evokes a positive or negative feeling for you: Immigration of people from other EU Member States.</blockquote>
The data from the main spreadsheet reporting the results country by country was isolated in a csv file (then uploaded on github) so that it could be easily loaded in Pandas as follows:
<pre class="prettyprint">
import pandas as pd
# github gist
gist = 'https://gist.githubusercontent.com/JustGlowing/'
gist += '2c25b9b153192baf573ce3b744ea6a65/raw/'
gist += '5f3888f7f42caca58b2418ec5822425083b6d559/'
gist += 'immigration_from_EU_eurobarometer_2018.csv'
df = pd.read_csv(gist, index_col=0)
df = df[df.index.map(lambda x: not '\n' in x)]
df.sort_values(by=["Total 'Positive'"], inplace=True)
# from https://ec.europa.eu/eurostat/statistics-explained/index.php
country_names = {'BE' : 'Belgium',
'BG' : 'Bulgaria',
'CZ' : 'Czechia',
'DK' : 'Denmark',
'DE' : 'Germany',
'EE' : 'Estonia',
'IE' : 'Ireland',
'EL' : 'Greece',
'ES' : 'Spain',
'FR' : 'France',
'HR' : 'Croatia',
'IT' : 'Italy',
'CY' : 'Cyprus',
'LV' : 'Latvia',
'LT' : 'Lithuania',
'LU' : 'Luxembourg',
'HU' : 'Hungary',
'MT' : 'Malta',
'NL' : 'Netherlands',
'AT' : 'Austria',
'PL' : 'Poland',
'PT' : 'Portugal',
'RO' : 'Romania',
'SI' : 'Slovenia',
'SK' : 'Slovakia',
'FI' : 'Finland',
'SE' : 'Sweden',
'UK' : 'United Kingdom'}
df.index = df.index.map(country_names.get)
</pre>
The idea is to create a bar chart with two sides, positive responses on the right and negative on the left. To do this, we can use the function <i>barh</i> and the attribute <i>left</i> can be used to stack the two subsets of responses ("Fairly positive/ negative" and "Very positive/negative"). The <i>xticks</i> also need to be adapted to reflect that the left side of the axis doesn't report values below zero. Here's the snippet:
<pre class="prettyprint">
import matplotlib.pyplot as plt
import numpy as np
country_idx = range(len(df))
plt.figure(figsize=(11, 14))
plt.barh(country_idx, df['Fairly positive'],
color='deepskyblue',label='Fairly positive')
plt.barh(country_idx, df['Very positive'], left=df['Fairly positive'],
color='dodgerblue', label='Very positive')
plt.barh(country_idx, -df['Fairly negative'],
color='tomato', label='Fairly negative')
plt.barh(country_idx, -df['Very negative'], left=-df['Fairly negative'],
color='firebrick', label='Very negative')
plt.yticks(country_idx, df.index)
plt.xlim([-100, 100])
plt.xticks(np.arange(-100, 101, 25), np.abs(np.arange(-100, 101, 25)))
plt.ylim([-.5, len(df)-.5])
title = 'Feelings toward immigration of people from\n'
title += 'other EU Member States in November 2018'
plt.title(title)
xlbl = 'negative <<< % responses >>> positive'
plt.xlabel(xlbl)
plt.legend(loc='lower right')
bbox_props = dict(fc="white", ec="k", lw=2)
plt.text(-95, 27, 'twitter: @justglowing \nhttps://glowingpython.blogspot.com',
ha="left", va="center", size=11, bbox=bbox_props)
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-R_4n3wTxMYw/XNp6Hr1SyYI/AAAAAAAABac/plBKHYxVc7MCGtJkaTt7Bkx8RS9q2ADkACLcBGAs/s1600/immigration_from_EU_eurobarometer_2018_.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-R_4n3wTxMYw/XNp6Hr1SyYI/AAAAAAAABac/plBKHYxVc7MCGtJkaTt7Bkx8RS9q2ADkACLcBGAs/s400/immigration_from_EU_eurobarometer_2018_.png" width="314" height="400" data-original-width="792" data-original-height="1008" /></a></div>
<br/><br/>
From the chart we note that the percentage of positive responses per country is mostly above 50% while the negative ones reach 50% only in two cases. We also see that Ireland and Sweden are the countries with the most positive responses, while Czechia (yes, that's Chech Republic :) is the country with most negative responses, though Cypriots also gave a similar number of "Very negative" responses.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-4807795009023787272019-04-17T11:08:00.003+01:002019-04-17T11:24:34.580+01:00Visualizing atmospheric carbon dioxideLet's have a look at how to create a visualization that shows how CO2 concentrations evolved in the atmosphere. First, we fetched from the Earth System Research Laboratory <a href="https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html">website</a> like follows:
<pre class="prettyprint">
import pandas as pd
data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
names=['year', 'month', 'day', 'decimal', 'ppm',
'days', '1_yr_ago', '10_yr_ago', 'since_1800'])
co2_data['timestamp'] = co2_data.apply(lambda x: pd.Timestamp(year=int(x.year),
month=int(x.month),
day=int(x.day)),
axis=1)
co2_data = co2_data[['timestamp', 'ppm']].set_index('timestamp').ffill()
</pre>
Then, we group the it by year and month at the same time storing the result in a matrix where each element represents the concentration in a specific year and month:
<pre class="prettyprint">
import numpy as np
import matplotlib.pyplot as plt
from calendar import month_abbr
co2_data = co2_data['1975':'2018']
n_years = co2_data.index.year.max() - co2_data.index.year.min()
z = np.ones((n_years +1 , 12)) * np.min(co2_data.ppm)
for d, y in co2_data.groupby([co2_data.index.year, co2_data.index.month]):
z[co2_data.index.year.max() - d[0], d[1] - 1] = y.mean()[0]
plt.figure(figsize=(10, 14))
plt.pcolor(np.flipud(z), cmap='hot_r')
plt.yticks(np.arange(0, n_years+1)+.5,
range(co2_data.index.year.min(), co2_data.index.year.max()+1));
plt.xticks(np.arange(13)-.5, month_abbr)
plt.xlim((0, 12))
plt.colorbar().set_label('Atmospheric Carbon Dioxide in ppm')
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-f4Mqj9IP9Zw/XLWl9jhZ7zI/AAAAAAAABZc/GRIoJx6F_Is5oa90rCwwI8M_MOvShy-IgCLcBGAs/s1600/co2_heatmap_m.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-f4Mqj9IP9Zw/XLWl9jhZ7zI/AAAAAAAABZc/GRIoJx6F_Is5oa90rCwwI8M_MOvShy-IgCLcBGAs/s400/co2_heatmap_m.png" width="313" height="400" data-original-width="625" data-original-height="799" /></a></div>
<br/><br/>
This visualization makes us able to compare the CO2 levels month by month with single glance. For example, we see that the period from April to June gets dark quicker than other periods, meaning that it contains the highest levels every year. Conversely, the period that goes from September to October gets darker more slowly, meaning that it's the period with the lowest CO2 levels. Also, looking at the color bar we note that in 43 years there was a 80 ppm increase.
<br/><br/>
Is this bad for the planet earth? Reading <i>Hansen et al. (2008)</i> we can classify CO2 levels less than 300 ppm as safe, levels between 300 and 350 ppm as dangerous, while levels beyond 350 ppm are considered catastrophic. According to this, the chart is a sad picture of how the levels transitioned from dangerous to catastrophic!
<br/><br/>
Concerned by this situation I created the <a href="https://twitter.com/Co2Forecast">CO2 Forecast twitter account</a> where I'll publish short and long term forecasts of CO2 levels in the atmosphere.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-50845475190137359122019-03-28T14:09:00.000+00:002019-03-29T10:21:22.610+00:00Speeding up the Sieve of Eratosthenes with NumbaLately, on invitation of my right honourable friend <a href="https://twitter.com/mikel87">Michal</a>, I've been trying to solve some problems from the <a href="https://projecteuler.net">Euler project</a> and felt the need to have a good way to find prime numbers. So implemented the <a href="https://primes.utm.edu/glossary/page.php?sort=SieveOfEratosthenes">the Sieve of Eratosthenes</a>. The algorithm is simple and efficient. It creates a list of all integers below a number <i>n</i> then filters out the multiples of all primes less than or equal to the square root of <i>n</i>, the remaining numbers are the eagerly-awaited primes. Here's the first version of the implementation I came up with:
<pre class="prettyprint">
def sieve_python(limit):
is_prime = [True]*limit
is_prime[0] = False
is_prime[1] = False
for d in range(2, int(limit**0.5) + 1):
if is_prime[d]:
for n in range(d*d, limit, d):
is_prime[n] = False
return is_prime
</pre>
This returns a list <i>is_prime</i> where <i>is_prime[n]</i> is True <i>n</i> is a prime number.
The code is straightforward but it wasn't fast enough for my taste so I decided to time it:
<pre class="prettyprint">
from timeit import timeit
def elapse_time(s):
s = timeit(s, number=100, globals=globals())
return f'{s:.3f} seconds'
print(elapse_time('sieve_python(100000)'))
</pre>
<pre>
1.107 seconds
</pre>
1.1 seconds to check 100000 values sounded indeed too slow so I decided to precompile the function with <a href="http://numba.pydata.org/">Numba</a>:
<pre class="prettyprint">
from numba import njit
@njit
def sieve_python_jit(limit):
is_prime = [True]*limit
is_prime[0] = False
is_prime[1] = False
for d in range(2, int(limit**0.5) + 1):
if is_prime[d]:
for n in range(d*d, limit, d):
is_prime[n] = False
return is_prime
sieve_python_jit(10) # compilation
print(elapse_time('sieve_python_jit(100000)'))
</pre>
<pre>
0.103 seconds
</pre>
The only addition to the previous version is the decorator <i>@njit</i> and this simple change resulted in a whopping 10x speed up! However, <a href="https://twitter.com/mikel87">Michal</a> shared with me some code making me notice that combining Numba with the appropriate Numpy data structures leads to impressive results so this implementation materialized:
<pre class="prettyprint">
import numpy as np
@njit
def sieve_numpy_jit(limit):
is_prime = np.full(limit, True)
is_prime[0] = False
is_prime[1] = False
for d in range(2, int(np.sqrt(limit) + 1)):
if is_prime[d]:
for n in range(d*d, limit, d):
is_prime[n] = False
return is_prime
sieve_numpy_jit(10) # compilation
print(elapse_time('sieve_numpy_jit(100000)'))
</pre>
<pre>
0.018 seconds
</pre>
The speed up respect to the first version is 61x!
<br/><br/>
Lessons learned:
<ul>
<li>Using Numba is very straightforward and a Python function written in a decent manner can be speeded up with little effort.</li>
<li>Python lists are too heavy in some cases. Even with pre-allocation of the memory they can't beat Numpy arrays for this specific task.</li>
<li>Assigning types correctly is key. Using a Numpy array of integers instead of bools in the function <i>sieve_numpy_jit</i> would result in a slow down.</li>
</ul>
Update: Thanks to <a href="https://www.reddit.com/user/gwillicoder">gwillicoder</a> who made me realize the code could be speed up checking if the divisor is a prime and providing a very efficient numpy implementation <a href="https://www.reddit.com/r/Python/comments/b6j9pn/speeding_up_the_sieve_of_eratosthenes_with_numba/">here</a>.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-10733261126907517252019-03-23T07:38:00.000+00:002019-03-23T07:38:27.383+00:00Visualizing the trend of a time series with PandasThe trend of time series is the general direction in which the values change. In this post we will focus on how to use rolling windows to isolate it.
Let's download from Google Trends the interest of the search term <i>Pancakes</i> and see what we can do with it:
<pre class="prettyprint">
import pandas as pd
import matplotlib.pyplot as plt
url = './data/pancakes.csv' # downloaded from https://trends.google.com
data = pd.read_csv(url, skiprows=2, parse_dates=['Month'], index_col=['Month'])
plt.plot(data)
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-JUkuOFazXhQ/XJXdRvhkJyI/AAAAAAAABX4/NdeJa7VE6lISxe9kFmRPWg-iNhqn3dh3wCLcBGAs/s1600/pancackes_ts.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-JUkuOFazXhQ/XJXdRvhkJyI/AAAAAAAABX4/NdeJa7VE6lISxe9kFmRPWg-iNhqn3dh3wCLcBGAs/s1600/pancackes_ts.png" data-original-width="378" data-original-height="252" /></a></div>
<br/><br/>
Looking at the data we notice that there's some seasonality (Pancakes day! yay!) and an increasing trend. What if we want to visualize just the trend of this curve? We only need to slide a rolling window through the data and compute the average at each step. This can be done in just one line if we use the method rolling:
<br/><br/>
<pre class="prettyprint">
y_mean = data.rolling('365D').mean()
plt.plot(y_mean)
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-rVDMpI7VTyU/XJXdRgbKr9I/AAAAAAAABX8/XvlrVwZsJkITnkNkWFK_XI50Z6H4tTGxwCLcBGAs/s1600/panckackes_trend.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-rVDMpI7VTyU/XJXdRgbKr9I/AAAAAAAABX8/XvlrVwZsJkITnkNkWFK_XI50Z6H4tTGxwCLcBGAs/s1600/panckackes_trend.png" data-original-width="372" data-original-height="252" /></a></div>
<br/><br/>
The parameter passed to rolling '365D' means that our rolling window will have size 365 days. Check out the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html">documentation of the method</a> to know more.
<br/>
We can also add highlight the variation each year adding to the chart a shade with the amplitude of the standard deviation:
<br/><br/>
<pre class="prettyprint">
y_std = data.rolling('365D').std()
plt.plot(y_mean)
plt.fill_between(y_mean.index,
(y_mean - y_std).values.T[0],
(y_mean + y_std).values.T[0], alpha=.5)
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-cm64DMlFDuI/XJXdRjZvNKI/AAAAAAAABYA/QOMbHW2z5DA5F-sKJOQUbapvTeGuvtUmgCLcBGAs/s1600/pankaces_ci.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-cm64DMlFDuI/XJXdRjZvNKI/AAAAAAAABYA/QOMbHW2z5DA5F-sKJOQUbapvTeGuvtUmgCLcBGAs/s1600/pankaces_ci.png" data-original-width="372" data-original-height="252" /></a></div>
<br/><br/>
Warning: the visualization above assumes that the distribution of the data each year follows a normal distribution, which is not entirely true.
<br/><br/>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com2tag:blogger.com,1999:blog-1693014329567144872.post-64340962654394346502019-03-20T13:59:00.000+00:002019-03-20T14:10:58.128+00:00Ravel and unravel with numpyRaveling and unraveling are common operations when working with matricies. With a ravel operation we go from matrix coordinate to index coordinates, while with an unravel operation we go the opposite way. In this post we will through an example how they can be done with numpy in a very easy way. Let's assume that we have a matrix of dimensions 4-by-4, and that we want to index of the element (1, 1) counting from the top right corner of the matrix. Using <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel_multi_index.html?highlight=ravel_#numpy.ravel_multi_index">ravel_multi_index</a> the solution is easy:
<pre class="prettyprint">
import numpy as np
coordinates = [[1], [1]]
shape = (4, 4)
idx = np.ravel_multi_index(coordinates, shape)
print(idx)
array([5])
</pre>
What if we want to go back to the original coordinates? In this case we can use <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.unravel_index.html?highlight=unravel#numpy.unravel_index">unravel_index</a>:
<pre class="prettyprint">
np.unravel_index(idx, shape)
(array([1]), array([1]))
</pre>
So now we know that the elements (1, 1) has index 5 ;-)JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-51139255438093397492019-01-22T15:22:00.001+00:002020-03-13T08:10:28.261+00:00A visual introduction to the Gap StatisticsWe have previously seen how to <a href="https://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html">implement KMeans</a>. However, the results of this algorithm strongly rely on the choice of the parameter K. According to statistical folklore the best K is located at the <a href="https://en.wikipedia.org/wiki/Elbow_method_(clustering)">'elbow'</a> of the clusters <a href="https://scikit-learn.org/stable/modules/clustering.html#k-means">inertia</a> while K increases. This heuristic has been translated into a more formalized procedure by the Gap Statistics and in this post we'll see how to pick K in an optimal way using the Gap Statistics. The main idea of the methodology is to compare the clusters inertia on the data to cluster and a reference dataset. The optimal choice of K is given by k for which the gap between the two results is maximum. To illustrate this idea, let’s pick as reference dataset a uniformly distributed set of points and see the result of KMeans increasing K:
<br />
<br />
<pre class="prettyprint">import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans
reference = np.random.rand(100, 2)
plt.figure(figsize=(12, 3))
for k in range(1,6):
kmeans = KMeans(n_clusters=k)
a = kmeans.fit_predict(reference)
plt.subplot(1,5,k)
plt.scatter(reference[:, 0], reference[:, 1], c=a)
plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-3_nCsr-sLOM/XHj4eK001eI/AAAAAAAABXE/lCj_qKXxGdI7K8EWXXgVmnTFp09AD46-gCLcBGAs/s1600/gap_statistics.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="856" width="520" src="https://2.bp.blogspot.com/-3_nCsr-sLOM/XHj4eK001eI/AAAAAAAABXE/lCj_qKXxGdI7K8EWXXgVmnTFp09AD46-gCLcBGAs/s400/gap_statistics.png" /></a></div>
<br />
<br />
From the figure above we can see that the algorithm evenly splits the points K clusters even if there's no separation between them. Let’s now do the same on a target dataset with 3 natural clusters:
<br />
<br />
<pre class="prettyprint">
X = make_blobs(n_samples=100, n_features=2,
centers=3, cluster_std=.8,)[0]
plt.figure(figsize=(12, 3))
for k in range(1,6):
kmeans = KMeans(n_clusters=k)
a = kmeans.fit_predict(X)
plt.subplot(1,5,k)
plt.scatter(X[:, 0], X[:, 1], c=a)
plt.xlabel('k='+str(k))
plt.tight_layout()
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-4L8iDxlxQcY/XHj5Ulqf5AI/AAAAAAAABXM/OGcTRAIQnZ4Nv9X_vQ9y5FwUkY9Td75rQCLcBGAs/s1600/gap_statistics_3_clusters.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-4L8iDxlxQcY/XHj5Ulqf5AI/AAAAAAAABXM/OGcTRAIQnZ4Nv9X_vQ9y5FwUkY9Td75rQCLcBGAs/s1600/gap_statistics_3_clusters.png" data-original-width="856" data-original-height="209" width="520" /></a></div>
<br />
<br />
Here we note that the algorithm, with K=2, correctly isolates one of the clusters grouping the other two together. Then, with K=3, correctly identifies the natural clusters. But, with K=4 and K=5 some of the natural clusters are split in two.
If we plot the inertia in both cases we'll see something interesting:
<br />
<br />
<pre class="prettyprint">def compute_inertia(a, X):
W = [np.mean(pairwise_distances(X[a == c, :])) for c in np.unique(a)]
return np.mean(W)
def compute_gap(clustering, data, k_max=5, n_references=5):
if len(data.shape) == 1:
data = data.reshape(-1, 1)
reference = np.random.rand(*data.shape)
reference_inertia = []
for k in range(1, k_max+1):
local_inertia = []
for _ in range(n_references):
clustering.n_clusters = k
assignments = clustering.fit_predict(reference)
local_inertia.append(compute_inertia(assignments, reference))
reference_inertia.append(np.mean(local_inertia))
ondata_inertia = []
for k in range(1, k_max+1):
clustering.n_clusters = k
assignments = clustering.fit_predict(data)
ondata_inertia.append(compute_inertia(assignments, data))
gap = np.log(reference_inertia)-np.log(ondata_inertia)
return gap, np.log(reference_inertia), np.log(ondata_inertia)
k_max = 5
gap, reference_inertia, ondata_inertia = compute_gap(KMeans(), X, k_max)
plt.plot(range(1, k_max+1), reference_inertia,
'-o', label='reference')
plt.plot(range(1, k_max+1), ondata_inertia,
'-o', label='data')
plt.xlabel('k')
plt.ylabel('log(inertia)')
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-nR-_r1mYuMg/XEcSgrpOAeI/AAAAAAAABV4/WrugFEjF9Q8gw3uXjkeP5ji2WkQbNe_RACLcBGAs/s1600/gap_statistics1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="278" data-original-width="415" height="214" src="https://3.bp.blogspot.com/-nR-_r1mYuMg/XEcSgrpOAeI/AAAAAAAABV4/WrugFEjF9Q8gw3uXjkeP5ji2WkQbNe_RACLcBGAs/s320/gap_statistics1.png" width="320" /></a></div>
<br />
<br />
On the reference dataset the inertia goes down’ very slowly while on the target dataset it assumes the shape of an elbow! We can now compute the Gap Statistics for each K computing the difference of the two curves showed above:
<br />
<br />
<pre class="prettyprint">plt.plot(range(1, k_max+1), gap, '-o')
plt.ylabel('gap')
plt.xlabel('k')
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-j5LOVgdeH7g/XEcSunMgcoI/AAAAAAAABV8/1PWrMFrYVN8IkO3au4zAk1aCk8IXYPXuwCLcBGAs/s1600/gap_statistics_chart.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="278" data-original-width="403" height="221" src="https://4.bp.blogspot.com/-j5LOVgdeH7g/XEcSunMgcoI/AAAAAAAABV8/1PWrMFrYVN8IkO3au4zAk1aCk8IXYPXuwCLcBGAs/s320/gap_statistics_chart.png" width="320" /></a></div>
<br />
<br />
It’s easy to see that the Gap is maximum for K=3, just the right choice for our target dataset.
<br/><br/>
For a more formal introduction you can check out the following paper: <a href="https://statweb.stanford.edu/~gwalther/gap">Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistics Society 2001.</a>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com11tag:blogger.com,1999:blog-1693014329567144872.post-4706490403175040582018-06-29T12:59:00.001+01:002018-06-29T12:59:33.700+01:00Plotting a calendar in matplotlibAnd here's a function to plot a compact calendar with matplotlib:
<pre class="prettyprint">
import calendar
import numpy as np
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
def plot_calendar(days, months):
plt.figure(figsize=(9, 3))
# non days are grayed
ax = plt.gca().axes
ax.add_patch(Rectangle((29, 2), width=.8, height=.8,
color='gray', alpha=.3))
ax.add_patch(Rectangle((30, 2), width=.8, height=.8,
color='gray', alpha=.5))
ax.add_patch(Rectangle((31, 2), width=.8, height=.8,
color='gray', alpha=.5))
ax.add_patch(Rectangle((31, 4), width=.8, height=.8,
color='gray', alpha=.5))
ax.add_patch(Rectangle((31, 6), width=.8, height=.8,
color='gray', alpha=.5))
ax.add_patch(Rectangle((31, 9), width=.8, height=.8,
color='gray', alpha=.5))
ax.add_patch(Rectangle((31, 11), width=.8, height=.8,
color='gray', alpha=.5))
for d, m in zip(days, months):
ax.add_patch(Rectangle((d, m),
width=.8, height=.8, color='C0'))
plt.yticks(np.arange(1, 13)+.5, list(calendar.month_abbr)[1:])
plt.xticks(np.arange(1,32)+.5, np.arange(1,32))
plt.xlim(1, 32)
plt.ylim(1, 13)
plt.gca().invert_yaxis()
# remove borders and ticks
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.tick_params(top=False, bottom=False, left=False, right=False)
plt.show()
</pre>
You can use it to highlight the days with full moon in 2018:
<pre class="prettyprint">
full_moon_day = [2, 31, 2, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22]
full_moon_month = [1, 1, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
plot_calendar(full_moon_day, full_moon_month)
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-GJaEgO23BaA/Wx-OOTI-8YI/AAAAAAAABR4/zo2WrO-dvgE9q7PI6ASQtoja5NO-75hpQCLcBGAs/s1600/full_moon_calendar.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-GJaEgO23BaA/Wx-OOTI-8YI/AAAAAAAABR4/zo2WrO-dvgE9q7PI6ASQtoja5NO-75hpQCLcBGAs/s400/full_moon_calendar.png" width="400" height="144" data-original-width="547" data-original-height="197" /></a></div>
<br/>
Or just highlight all the weekends:
<pre class="prettyprint">
from datetime import datetime, timedelta
def get_weekends(year):
weekend_day = []
weekend_month = []
start = datetime(year, 1, 1)
for i in range(365):
day_of_the_year = start + timedelta(days=i)
if day_of_the_year.weekday() > 4:
weekend_day.append(day_of_the_year.day)
weekend_month.append(day_of_the_year.month)
return weekend_day, weekend_month
weekend_day, weekend_month = get_weekends(2018)
plot_calendar(weekend_day, weekend_month)
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-DXckmge9G3c/Wx-OaGgfrII/AAAAAAAABR8/gnRBgolPHbUvA8qy2M3ktplhRdaVduIbwCLcBGAs/s1600/weekends_calendar.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-DXckmge9G3c/Wx-OaGgfrII/AAAAAAAABR8/gnRBgolPHbUvA8qy2M3ktplhRdaVduIbwCLcBGAs/s400/weekends_calendar.png" width="400" height="144" data-original-width="547" data-original-height="197" /></a></div>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-41315104267544421172018-05-30T10:27:00.000+01:002018-05-30T10:27:05.525+01:00Visualizing UK Carbon EmissionsHave you ever wanted to check carbon emissions in the UK and never had an easy way to do it? Now you can use the <a href="https://api.carbonintensity.org.uk/">Official Carbon Intensity API</a> developed by the National Grid. Let's see an example of how to use the API to summarize the emissions in the month of May.
First, we download the data with a request to the API:
<pre class="prettyprint">
import urllib.request
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
period = ('2018-05-01T00:00Z', '2018-05-28T00:00Z')
url = 'https://api.carbonintensity.org.uk/intensity/%s/%s'
url = url % period
response = urllib.request.urlopen(url)
data = json.loads(response.read())['data']
</pre>
We organize the result in a DataFrame indexed by timestamps:
<pre class="prettyprint">
carbon_intensity = pd.DataFrame()
carbon_intensity['timestamp'] = [pd.to_datetime(d['from']) for d in data]
carbon_intensity['intensity'] = [d['intensity']['actual'] for d in data]
carbon_intensity['classification'] = [d['intensity']['index'] for d in data]
carbon_intensity.set_index('timestamp', inplace=True)
</pre>
From the classification provided we extract the thresholds to label emissions in low, high and moderate:
<pre class="prettyprint">
thresholds = carbon_intensity.groupby(by='classification').min()
threshold_high = thresholds[thresholds.index == 'high'].values[0][0]
threshold_moderate = thresholds[thresholds.index == 'moderate'].values[0][0]
</pre>
Now we group the data by hour of the day and create a boxplot that shows some interesting facts about carbon emissions in May:
<pre class="prettyprint">
hour_group = carbon_intensity.groupby(carbon_intensity.index.hour)
plt.figure(figsize=(12, 6))
plt.title('UK Carbon Intensity in May 2018')
plt.boxplot([g.intensity for _,g in hour_group],
medianprops=dict(color='k'))
ymin, ymax = plt.ylim()
plt.fill_between(x=np.arange(26),
y1=np.ones(26)*threshold_high,
y2=np.ones(26)*ymax,
color='crimson',
alpha=.3, label='high')
plt.fill_between(x=np.arange(26),
y1=np.ones(26)*threshold_moderate,
y2=np.ones(26)*threshold_high,
color='skyblue',
alpha=.5, label='moderate')
plt.fill_between(x=np.arange(26),
y1=np.ones(26)*threshold_moderate,
y2=np.ones(26)*ymin,
color='palegreen',
alpha=.3, label='low')
plt.ylim(ymin, ymax)
plt.ylabel('carbon intensity (gCO_2/kWH)')
plt.xlabel('hour of the day')
plt.legend(loc='upper left', ncol=3,
shadow=True, fancybox=True)
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-HVaog38KQtw/Ww5pqBJPx1I/AAAAAAAABRE/H8_h8lHHDIU8M_Ymnb7zMSH5KdM-iirwQCLcBGAs/s1600/uk_carbon_emissions.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-HVaog38KQtw/Ww5pqBJPx1I/AAAAAAAABRE/H8_h8lHHDIU8M_Ymnb7zMSH5KdM-iirwQCLcBGAs/s400/uk_carbon_emissions.png" width="400" height="213" data-original-width="728" data-original-height="387" /></a></div>
<br/>
We notice that the medians almost always falls in the moderate emissions region and in two cases it even falls in the low region. In the early afternoon the medians reach their minimum while the maximum is reached in the evening. It's nice to see that most of the hours present outliers in the low emissions region and only few outliers are in the high region.
<br/><br/>
Do you want to know more about boxplots? <a href="https://glowingpython.blogspot.com/2012/09/boxplot-with-matplotlib.html">Check this out</a>!JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-46194070626415110422017-10-09T13:00:00.000+01:002017-10-09T13:00:02.835+01:00Spotting outliers with Isolation Forest using sklearnIsolation Forest is an algorithm to detect outliers. It partitions the data using a set of trees and provides an anomaly scores looking at how isolated is the point in the structure found, the anomaly score is then used to tell apart outliers from normal observations. In this post we will see an example of how IsolationForest behaves in simple case. First, we will generate 1-dimensional data from bimodal distribution, then we will compare the anomaly score with the distribution of the data and highlighting the regions considered where the outliers fall.
<br/><br/>
To start, let's generate the data and plot the histogram:
<br />
<pre class="prettyprint">import numpy as np
import matplotlib.pyplot as plt
x = np.concatenate((np.random.normal(loc=-2, scale=.5,size=500),
np.random.normal(loc=2, scale=.5, size=500)))
plt.hist(x, normed=True)
plt.xlim([-5, 5])
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-U59TuChgJQM/WVuMRjbKXcI/AAAAAAAABIc/d-oOgRwFcQEKyDgFuOCGs14M_Kq9bYeeQCLcBGAs/s1600/bimodal_dist.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="252" data-original-width="381" height="212" src="https://3.bp.blogspot.com/-U59TuChgJQM/WVuMRjbKXcI/AAAAAAAABIc/d-oOgRwFcQEKyDgFuOCGs14M_Kq9bYeeQCLcBGAs/s320/bimodal_dist.png" width="320" /></a></div>
<br/>
Here we note that there are three regions where the data has low probability to appear. One on the right side of the distribution, another one and the left and another around zero. Let's see if using IsolationForest we are able to identify these three regions:
<br/><br/>
<pre class="prettyprint">from sklearn.ensemble import IsolationForest
isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(x.reshape(-1, 1))
xx = np.linspace(-6, 6, 100).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='r',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('x')
plt.xlim([-5, 5])
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-vPcoHoyKX_s/WVufChdxVbI/AAAAAAAABJM/O7mC4iyw7h0gEb7nzT7UWrg6CHmg7ogfgCLcBGAs/s1600/anomaly_score_outliers.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="266" data-original-width="404" height="211" src="https://4.bp.blogspot.com/-vPcoHoyKX_s/WVufChdxVbI/AAAAAAAABJM/O7mC4iyw7h0gEb7nzT7UWrg6CHmg7ogfgCLcBGAs/s320/anomaly_score_outliers.png" width="320" /></a></div>
<br/>
In the snippet above we have trained our IsolationForest using the data generated, computed the anomaly score for each observation and classified each observation as outlier or non outlier. The chart shows, the anomaly scores and the regions where the outliers are. As expected, the anomaly score reflects the shape of the underlying distribution and the outlier regions correspond to low probability areas. JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com1tag:blogger.com,1999:blog-1693014329567144872.post-6304665586264657162017-07-13T13:42:00.000+01:002017-07-13T13:42:12.378+01:00Dates in Pandas CheatsheetLately I've been working a lot with dates in Pandas so I decided to make this little cheatsheet with the commands I use the most.<br/><br/>
<h3>Importing a csv using a custom function to parse dates</h3>
<pre class="prettyprint">
import pandas as pd
def parse_month(month):
"""
Converts a string from the format <YYYY>M<MM> in datetime format.
Example: parse_month("2007M02") returns datetime(2007, 2, 1)
"""
return pd.datetime(int(month[:4]), int(month[-2:]), 1)
temperature = pd.read_csv('TempUSA.csv', parse_dates=['Date'],
date_parser=parse_month,
index_col=['Date'], # will become an index
# use a subset of the columns
usecols=['Date',
'LosAngelesMax', 'LosAngelesMin'])
print temperature
</pre>
<pre>
LosAngelesMax LosAngelesMin
Date
2000-01-01 19.6 10.0
2000-02-01 18.9 10.1
2000-03-01 18.6 10.1
2000-04-01 20.2 12.5
2000-05-01 21.9 14.2
</pre>
<h3>Format the dates in a chart</h3>
<pre class="prettyprint">
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
plt.plot(temperature['LosAngelesMax'])
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-pFz9f6M1dsY/WWdPJllB2fI/AAAAAAAABJs/mD47xOymfOwdTHJEKvdJ9-VldQan9pfHgCLcBGAs/s1600/date_formatter.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-pFz9f6M1dsY/WWdPJllB2fI/AAAAAAAABJs/mD47xOymfOwdTHJEKvdJ9-VldQan9pfHgCLcBGAs/s320/date_formatter.png" width="320" height="212" data-original-width="380" data-original-height="252" /></a></div>
<br/>
Here's the <a href="https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior">reference of the date format directives</a>. ISO compliant format: %Y-%m-%dT%H:%M:%S.<br/><br/>
<h3>Group the DataFrame by month</h3>
<pre class="prettyprint">
print temperature.groupby([temperature.index.month]).mean()
</pre>
<pre>
LosAngelesMax LosAngelesMin
Date
1 20.092308 8.992308
2 19.223077 9.276923
3 19.253846 10.492308
4 19.992308 11.461538
5 21.076923 13.761538
6 22.123077 15.800000
7 23.892308 17.315385
8 24.246154 17.530769
9 24.384615 16.846154
10 23.330769 14.630769
11 21.950000 11.241667
12 19.241667 8.683333
</pre>
The resulting DataFrame is indexed by month.<br/><br/>
<h3>Merging two DataFrames indexed with timestamps that don't match exactly</h3>
<pre class="prettyprint">
date_range_a = pd.date_range('2007-01-01 01:00',
'2007-01-01 3:00', freq='1h')
date_range_b = date_range_a + pd.Timedelta(10, 'm')
df_a = pd.DataFrame(np.arange(len(date_range_a)),
columns=['a'], index=date_range_a)
df_b = pd.DataFrame(['x', 'y', 'z'],
columns=['b'], index=date_range_b)
print 'left DataFrame'
print df_a
print '\nright DataFrame'
print df_b
print '\nmerge_AsOf result'
print pd.merge_asof(df_a, df_b, direction='nearest',
left_index=True, right_index=True)
</pre>
<pre>
left DataFrame
a
2007-01-01 01:00:00 0
2007-01-01 02:00:00 1
2007-01-01 03:00:00 2
right DataFrame
b
2007-01-01 01:10:00 x
2007-01-01 02:10:00 y
2007-01-01 03:10:00 z
merge_AsOf result
a b
2007-01-01 01:00:00 0 x
2007-01-01 02:00:00 1 y
2007-01-01 03:00:00 2 z
</pre>
The DataFrames have been aligned according to the index on the left.<br/><br/>
<h3>Aligning two DataFrames</h3>
<pre class="prettyprint">
aligned = df_a.align(df_b)
print 'left aligned'
print aligned[0]
print '\nright aligned'
print aligned[1]
print '\ncombination'
aligned[0]['b'] = aligned[1]['b']
print aligned[0]
</pre>
<pre>
left aligned
a b
2007-01-01 01:00:00 0.0 NaN
2007-01-01 01:10:00 NaN NaN
2007-01-01 02:00:00 1.0 NaN
2007-01-01 02:10:00 NaN NaN
2007-01-01 03:00:00 2.0 NaN
2007-01-01 03:10:00 NaN NaN
right aligned
a b
2007-01-01 01:00:00 NaN NaN
2007-01-01 01:10:00 NaN x
2007-01-01 02:00:00 NaN NaN
2007-01-01 02:10:00 NaN y
2007-01-01 03:00:00 NaN NaN
2007-01-01 03:10:00 NaN z
combination
a b
2007-01-01 01:00:00 0.0 NaN
2007-01-01 01:10:00 NaN x
2007-01-01 02:00:00 1.0 NaN
2007-01-01 02:10:00 NaN y
2007-01-01 03:00:00 2.0 NaN
2007-01-01 03:10:00 NaN z
</pre>
The timestamps are now aligned according to both the DataFrames and unknown values have been filled with NaNs. The missing value can be filled with interpolation when working with numeric values:
<pre class="prettyprint">
print aligned[0].a.interpolate()
</pre>
<pre>
2007-01-01 01:00:00 0.0
2007-01-01 01:10:00 0.5
2007-01-01 02:00:00 1.0
2007-01-01 02:10:00 1.5
2007-01-01 03:00:00 2.0
2007-01-01 03:10:00 2.0
Name: a, dtype: float64
</pre>
The categorical values can be filled using the fillna method:
<pre class="prettyprint">
print aligned[1].b.fillna(method='bfill')
</pre>
<pre>
2007-01-01 01:00:00 x
2007-01-01 01:10:00 x
2007-01-01 02:00:00 y
2007-01-01 02:10:00 y
2007-01-01 03:00:00 z
2007-01-01 03:10:00 z
Name: b, dtype: object
</pre>
The method bfill propagates the next valid observation, while ffil the last valid observation.<br/><br/>
<h3>Convert a Timedelta in hours</h3>
<pre class="prettyprint">
td = pd.Timestamp('2017-07-05 16:00') - pd.Timestamp('2017-07-05 12:00')
print td / pd.Timedelta(1, unit='h')
</pre>
<pre>
4.0
</pre>
To convert in days, months, minutes and so on one just need to change the unit. Here are the values accepted: D,h,m,s,ms,us,ns.<br/><br/>
<h3>Convert pandas timestamps in unix timestamps</h3>
<pre class="prettyprint">
unix_ts = pd.date_range('2017-01-01 1:00',
'2017-01-01 2:00',
freq='30min').astype(np.int64) // 10**9
print unix_ts
</pre>
<pre>
Int64Index([1483232400, 1483234200, 1483236000], dtype='int64')
</pre>
To convert in milliseconds divided by 10**6 instead of 10**9.<br/><br/>
<h3>Convert unix timestamps in pandas timestamps</h3>
<pre class="prettyprint">
print pd.to_datetime(unix_ts, unit='s')
</pre>
<pre>
DatetimeIndex(['2017-01-01 01:00:00', '2017-01-01 01:30:00',
'2017-01-01 02:00:00'],
dtype='datetime64[ns]', freq=None)
</pre>
To convert from timestamps in milliseconds change the unit to 'ms'.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com0tag:blogger.com,1999:blog-1693014329567144872.post-57082278036970238052017-06-16T16:27:00.000+01:002019-05-30T19:37:21.327+01:00A heatmap of male to female ratios with SeabornIn this post we will see how to create a heatmap with seaborn. We'll use a dataset from the <a href="http://www.wittgensteincentre.org/dataexplorer">Wittgenstein Centre Data Explorer</a>. The data extracted is also reported <a href="https://gist.github.com/JustGlowing/1f3d7ff0bba7f79651b00f754dc85bf1">here</a> in csv format. It contains the ratio of males to females in the population by age for 1970 to 2015 (data reported after this period is projected).
First, we import the data using Pandas:
<pre class="prettyprint">
import pandas as pd
import numpy as np
sex_ratios = pd.read_csv('m2f_ratios.csv', skiprows=8)
age_code = {a: i for i,a in enumerate(sex_ratios.Age.unique())}
age_label = {i: a for i,a in enumerate(sex_ratios.Age.unique())}
sex_ratios['AgeCode'] = sex_ratios.Age.apply(lambda x: age_code[x])
area_idx = sex_ratios.Area == \
'United Kingdom of Great Britain and Northern Ireland'
years_idx = sex_ratios.Year <= 2015
sex_ratios_uk = sex_ratios[np.logical_and(years_idx, area_idx)]
</pre>
Here take care of the age coding and isolate the data for the United Kingdom and Northern Ireland. Now we can rearrange the data to see ratio per year and age using a pivot table, we can then visualize the result using the heatmap function from seaborn:
<pre class="prettyprint">
import matplotlib as plt
import seaborn as sns
pivot_uk = sex_ratios_uk.pivot_table(values='Ratio',
index='AgeCode',
columns='Year')
pivot_uk.index = [age_label[a] for a in pivot_uk.index.values]
plt.figure(figsize=(10, 8))
plt.title('Sex ratio per year and age groups')
sns.heatmap(pivot_uk, annot=True)
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-E8BigyvkcUs/WQiR0-FwO0I/AAAAAAAABGw/f4E-SvT6Dt83fBz3ihDJfuMs2YXtJhPJACLcB/s1600/sex_ration_heatmap_sns.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-E8BigyvkcUs/WQiR0-FwO0I/AAAAAAAABGw/f4E-SvT6Dt83fBz3ihDJfuMs2YXtJhPJACLcB/s1600/sex_ration_heatmap_sns.png" width="500" /></a></div>
<br/>
In each year we see that the ratio was above 1 (in favor of males) for young ages it then becomes lower than 1 during adulthood and keeps lowering with the age. It also seems that with time the ratio decreases more slowly. For example, we see that the age group 70-74 had a ratio of 0.63 in 1970, while the ratio in 2015 was 0.9.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com2tag:blogger.com,1999:blog-1693014329567144872.post-23064107543853471702017-04-27T09:45:00.000+01:002017-04-27T09:46:12.256+01:00Solving the Two Spirals problem with KerasIn this post we will see how to create a Multi Layer Perceptron (MLP), one of the most common Neural Network architectures, with Keras. Then, we'll train the MLP to tell apart points from two different spirals in the same space.
<br/>
To have a sense of the problem, let's first generate the data to train the network:
<pre class="prettyprint">
import numpy as np
import matplotlib.pyplot as plt
def twospirals(n_points, noise=.5):
"""
Returns the two spirals dataset.
"""
n = np.sqrt(np.random.rand(n_points,1)) * 780 * (2*np.pi)/360
d1x = -np.cos(n)*n + np.random.rand(n_points,1) * noise
d1y = np.sin(n)*n + np.random.rand(n_points,1) * noise
return (np.vstack((np.hstack((d1x,d1y)),np.hstack((-d1x,-d1y)))),
np.hstack((np.zeros(n_points),np.ones(n_points))))
X, y = twospirals(1000)
plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.', label='class 1')
plt.plot(X[y==1,0], X[y==1,1], '.', label='class 2')
plt.legend()
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-f59mE0OpFIc/WQGrB5jHu_I/AAAAAAAABGI/7m1cwaTpI94TjizUrvCr7xa2pF3WeBLEQCLcB/s1600/spirals_nn1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-f59mE0OpFIc/WQGrB5jHu_I/AAAAAAAABGI/7m1cwaTpI94TjizUrvCr7xa2pF3WeBLEQCLcB/s1600/spirals_nn1.png" /></a></div>
<br/>
As we can see, this dataset contains two different spirals. This kind of dataset has been named as <a href="http://www.benmargolis.com/compsci/ai/two_spirals_problem.htm">Worst Dataset Ever!</a>, indeed telling apart the points from the two spirals is not an easy part if your MLP is not sophisticated enough.
Let's build a simple MLP with Keras and see what we can achieve:
<pre class="prettyprint">
from keras.models import Sequential
from keras.layers import Dense
mymlp = Sequential()
mymlp.add(Dense(12, input_dim=2, activation='tanh'))
mymlp.add(Dense(1, activation='sigmoid'))
mymlp.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
# trains the model
mymlp.fit(X, y, epochs=150, batch_size=10, verbose=0)
</pre>
Here we created a Neural Network with the following structure: 2 inputs (the data is in a 2D space) fully connected to 12 hidden neurons and 1 output.
Let's generate some test data and see if our model is able to classify them:
<pre class="prettyprint">
X_test, y_test = twospirals(1000)
yy = np.round(mymlp.predict(X_test).T[0])
plt.subplot(1,2,1)
plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.')
plt.plot(X[y==1,0], X[y==1,1], '.')
plt.subplot(1,2,2)
plt.title('Neural Network result')
plt.plot(X_test[yy==0,0], X_test[yy==0,1], '.')
plt.plot(X_test[yy==1,0], X_test[yy==1,1], '.')
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-SrywHFtY1ow/WQGrKWRVbYI/AAAAAAAABGM/pfH6XLnJqEUP3s78HjyR007OGhYRDVjbwCLcB/s1600/spirals_nn2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-SrywHFtY1ow/WQGrKWRVbYI/AAAAAAAABGM/pfH6XLnJqEUP3s78HjyR007OGhYRDVjbwCLcB/s1600/spirals_nn2.png" /></a></div>
<br/>
We have the original train set on the left and the results of the Neural Network on the right. It's easy to note that the model misclassified most of the points on the test data.
Let's add two hidden layers to our model and see what happens:
<pre class="prettyprint">
mymlp = Sequential()
mymlp.add(Dense(12, input_dim=2, activation='tanh'))
mymlp.add(Dense(12, activation='tanh'))
mymlp.add(Dense(12, activation='tanh'))
mymlp.add(Dense(1, activation='sigmoid'))
mymlp.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
# Fit the model
mymlp.fit(X, y, epochs=150, batch_size=10, verbose=0)
yy = np.round(mymlp.predict(X_test).T[0])
plt.subplot(1,2,1)
plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.')
plt.plot(X[y==1,0], X[y==1,1], '.')
plt.subplot(1,2,2)
plt.title('Neural Network result')
plt.plot(X_test[yy==0,0], X_test[yy==0,1], '.')
plt.plot(X_test[yy==1,0], X_test[yy==1,1], '.')
plt.show()
</pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-_lsqRJl5eVc/WQGrP2KD8VI/AAAAAAAABGQ/UcgdEydm8psa2LwJpNy1K3W0qfm2VuHGgCLcB/s1600/spirals_nn3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-_lsqRJl5eVc/WQGrP2KD8VI/AAAAAAAABGQ/UcgdEydm8psa2LwJpNy1K3W0qfm2VuHGgCLcB/s1600/spirals_nn3.png" /></a></div>
<br/>
The structure of our Network is now more suited to solve the problem and we see that most of the points used for the test were correctly classified.JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com3tag:blogger.com,1999:blog-1693014329567144872.post-27701661994152699922016-05-21T07:25:00.000+01:002019-06-05T10:37:18.998+01:00An intro to Regression Analysis with Decision TreesIt's a while that there are no posts on this blog, but the Glowing Python is still active and strong! I just decided to publish some of my post on the <a href="https://blog.cambridgecoding.com">Cambridge Coding Academy blog</a>. Here are the links to a series of two posts about Regression Analysis with Decision Trees:
<ul>
<li>1. <a href="
https://cambridgespark.com/getting-started-with-regression-and-decision-trees/">Getting started with Regression Analysis and Decision Trees</a></li>
<li>2. <a href="https://cambridgespark.com/from-simple-regression-to-multiple-regression-with-decision-trees/">From simple Regression to Multiple Regression with Decision Trees</a></li>
</ul>
In this introduction to Regression Analysis we will see how to user scikit-learn to train Decision Trees to solve a specific problem: "<i>How to predict the number of bikes hired in a bike sharing system in a given day?</i>"
<br/><br/>
In the first post, we will see how to train a simple Decision Tree to exploit the relation between temperature and bikes hired, this tree will be analysed to explain the result of the training process and gain insights about the data. In the second, we will see how to learn more complex decision trees and how to assess the accuracy of the prediction using <a href="http://glowingpython.blogspot.co.uk/2014/04/parameters-selection-with-cross.html">cross validation</a>.
<br/><br/>
Here's a sneak peak of the figures that we will generate:
<br/><br/>
<center>
<table border=0>
<tr>
<td><a href="https://cambridgecoding.files.wordpress.com/2016/01/regtree2.png?w=180" imageanchor="1" ><img border="0" src="https://cambridgecoding.files.wordpress.com/2016/01/regtree2.png?w=180" /></a></td>
<td><a href="https://cambridgecoding.files.wordpress.com/2016/01/regspace2.png?w=180" imageanchor="1" ><img border="0" src="https://cambridgecoding.files.wordpress.com/2016/01/regspace2.png?w=180" /></a></td>
</tr>
</table>
</center>JustGlowinghttp://www.blogger.com/profile/17212021288715206641noreply@blogger.com2