The Glowing Python

Friday, May 17, 2019

Feelings toward immigration of people from other EU Member States in November 2018

In this post we will see a snippet about how to plot a part of the results of the eurobarometer survey released last March. In particular, we will focus on the responses to the following question:

Please tell me whether the following statement evokes a positive or negative feeling for you: Immigration of people from other EU Member States.

The data from the main spreadsheet reporting the results country by country was isolated in a csv file (then uploaded on github) so that it could be easily loaded in Pandas as follows:

import pandas as pd

# github gist
gist = 'https://gist.githubusercontent.com/JustGlowing/'
gist += '2c25b9b153192baf573ce3b744ea6a65/raw/'
gist += '5f3888f7f42caca58b2418ec5822425083b6d559/'
gist += 'immigration_from_EU_eurobarometer_2018.csv'
df = pd.read_csv(gist, index_col=0)
df = df[df.index.map(lambda x: not '\n' in x)]
df.sort_values(by=["Total 'Positive'"], inplace=True)

# from https://ec.europa.eu/eurostat/statistics-explained/index.php
country_names = {'BE' : 'Belgium',
'BG' : 'Bulgaria',
'CZ' : 'Czechia',
'DK' : 'Denmark',
'DE' : 'Germany',
'EE' : 'Estonia',
'IE' : 'Ireland',
'EL' : 'Greece',
'ES' : 'Spain',
'FR' : 'France',
'HR' : 'Croatia',
'IT' : 'Italy',
'CY' : 'Cyprus',
'LV' : 'Latvia',
'LT' : 'Lithuania',
'LU' : 'Luxembourg',
'HU' : 'Hungary',
'MT' : 'Malta',
'NL' : 'Netherlands',
'AT' : 'Austria',
'PL' : 'Poland',
'PT' : 'Portugal',
'RO' : 'Romania',
'SI' : 'Slovenia',
'SK' : 'Slovakia',
'FI' : 'Finland',
'SE' : 'Sweden',
'UK' : 'United Kingdom'}

df.index = df.index.map(country_names.get)

The idea is to create a bar chart with two sides, positive responses on the right and negative on the left. To do this, we can use the function barh and the attribute left can be used to stack the two subsets of responses ("Fairly positive/ negative" and "Very positive/negative"). The xticks also need to be adapted to reflect that the left side of the axis doesn't report values below zero. Here's the snippet:

import matplotlib.pyplot as plt
import numpy as np

country_idx = range(len(df))

plt.figure(figsize=(11, 14))
plt.barh(country_idx, df['Fairly positive'],
         color='deepskyblue',label='Fairly positive')
plt.barh(country_idx, df['Very positive'], left=df['Fairly positive'],
         color='dodgerblue', label='Very positive')
plt.barh(country_idx, -df['Fairly negative'],
         color='tomato', label='Fairly negative')
plt.barh(country_idx, -df['Very negative'], left=-df['Fairly negative'],
         color='firebrick', label='Very negative')

plt.yticks(country_idx, df.index)
plt.xlim([-100, 100])
plt.xticks(np.arange(-100, 101, 25), np.abs(np.arange(-100, 101, 25)))
plt.ylim([-.5, len(df)-.5])
title = 'Feelings toward immigration of people from\n'
title += 'other EU Member States in November 2018'
plt.title(title)
xlbl = 'negative            <<<       % responses       >>>            positive'
plt.xlabel(xlbl)
plt.legend(loc='lower right')

bbox_props = dict(fc="white", ec="k", lw=2) 
plt.text(-95, 27, 'twitter: @justglowing \nhttps://glowingpython.blogspot.com',
         ha="left", va="center", size=11, bbox=bbox_props)
plt.show()

From the chart we note that the percentage of positive responses per country is mostly above 50% while the negative ones reach 50% only in two cases. We also see that Ireland and Sweden are the countries with the most positive responses, while Czechia (yes, that's Chech Republic :) is the country with most negative responses, though Cypriots also gave a similar number of "Very negative" responses.

Wednesday, April 17, 2019

Visualizing atmospheric carbon dioxide

Let's have a look at how to create a visualization that shows how CO2 concentrations evolved in the atmosphere. First, we fetched from the Earth System Research Laboratory website like follows:

import pandas as pd

data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
                       names=['year', 'month', 'day', 'decimal', 'ppm', 
                       'days', '1_yr_ago',  '10_yr_ago', 'since_1800'])

co2_data['timestamp'] = co2_data.apply(lambda x: pd.Timestamp(year=int(x.year),
                                                             month=int(x.month),
                                                             day=int(x.day)),
                                       axis=1)
co2_data = co2_data[['timestamp', 'ppm']].set_index('timestamp').ffill()

Then, we group the it by year and month at the same time storing the result in a matrix where each element represents the concentration in a specific year and month:

import numpy as np
import matplotlib.pyplot as plt
from calendar import month_abbr

co2_data = co2_data['1975':'2018']
n_years = co2_data.index.year.max() - co2_data.index.year.min()
z = np.ones((n_years +1 , 12)) * np.min(co2_data.ppm)
for d, y in co2_data.groupby([co2_data.index.year, co2_data.index.month]):
  z[co2_data.index.year.max() - d[0], d[1] - 1] = y.mean()[0]
  
plt.figure(figsize=(10, 14))
plt.pcolor(np.flipud(z), cmap='hot_r')
plt.yticks(np.arange(0, n_years+1)+.5,
           range(co2_data.index.year.min(), co2_data.index.year.max()+1));
plt.xticks(np.arange(13)-.5, month_abbr)
plt.xlim((0, 12))
plt.colorbar().set_label('Atmospheric Carbon Dioxide in ppm')
plt.show()

This visualization makes us able to compare the CO2 levels month by month with single glance. For example, we see that the period from April to June gets dark quicker than other periods, meaning that it contains the highest levels every year. Conversely, the period that goes from September to October gets darker more slowly, meaning that it's the period with the lowest CO2 levels. Also, looking at the color bar we note that in 43 years there was a 80 ppm increase.

Is this bad for the planet earth? Reading Hansen et al. (2008) we can classify CO2 levels less than 300 ppm as safe, levels between 300 and 350 ppm as dangerous, while levels beyond 350 ppm are considered catastrophic. According to this, the chart is a sad picture of how the levels transitioned from dangerous to catastrophic!

Concerned by this situation I created the CO2 Forecast twitter account where I'll publish short and long term forecasts of CO2 levels in the atmosphere.

Thursday, March 28, 2019

Speeding up the Sieve of Eratosthenes with Numba

Lately, on invitation of my right honourable friend Michal, I've been trying to solve some problems from the Euler project and felt the need to have a good way to find prime numbers. So implemented the the Sieve of Eratosthenes. The algorithm is simple and efficient. It creates a list of all integers below a number n then filters out the multiples of all primes less than or equal to the square root of n, the remaining numbers are the eagerly-awaited primes. Here's the first version of the implementation I came up with:

def sieve_python(limit):
    is_prime = [True]*limit
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(limit**0.5) + 1):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime

This returns a list is_prime where is_prime[n] is True n is a prime number. The code is straightforward but it wasn't fast enough for my taste so I decided to time it:

from timeit import timeit

def elapse_time(s):
    s = timeit(s, number=100, globals=globals())
    return f'{s:.3f} seconds'

print(elapse_time('sieve_python(100000)'))

1.107 seconds

1.1 seconds to check 100000 values sounded indeed too slow so I decided to precompile the function with Numba:

from numba import njit

@njit
def sieve_python_jit(limit):
    is_prime = [True]*limit
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(limit**0.5) + 1):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime

sieve_python_jit(10) # compilation
print(elapse_time('sieve_python_jit(100000)'))

0.103 seconds

The only addition to the previous version is the decorator @njit and this simple change resulted in a whopping 10x speed up! However, Michal shared with me some code making me notice that combining Numba with the appropriate Numpy data structures leads to impressive results so this implementation materialized:

import numpy as np

@njit
def sieve_numpy_jit(limit):
    is_prime = np.full(limit, True)
    is_prime[0] = False
    is_prime[1] = False
    for d in range(2, int(np.sqrt(limit) + 1)):
        if is_prime[d]:
            for n in range(d*d, limit, d):
                is_prime[n] = False  
    return is_prime

sieve_numpy_jit(10) # compilation
print(elapse_time('sieve_numpy_jit(100000)'))

0.018 seconds

The speed up respect to the first version is 61x!

Lessons learned:

Using Numba is very straightforward and a Python function written in a decent manner can be speeded up with little effort.
Python lists are too heavy in some cases. Even with pre-allocation of the memory they can't beat Numpy arrays for this specific task.
Assigning types correctly is key. Using a Numpy array of integers instead of bools in the function sieve_numpy_jit would result in a slow down.

Update: Thanks to gwillicoder who made me realize the code could be speed up checking if the divisor is a prime and providing a very efficient numpy implementation here.