Tuesday, January 27, 2015

Forecasting beer consumption with sklearn

In this post we will see how to implement a straightforward forecasting model based on the linear regression object of sklearn. The model that we are going to build is based on the idea idea that past observations are good predictors of a future value. Using some symbols, given xn−k,...,xn−2,xn−1 we want to estimate xn+h where h is the forecast horizon just using the given values. The estimation that we are going to apply is the following:

where xn−k and xn−1 are respectively the oldest and the newest observation we consider for the forecast. The weights wk,...,w1,w0 are chosen in order to minimize

where m is number of periods available to train our model. This model is often referred as regression model with lagged explanatory variables and k is called lag order.

Before implementing the model let's load a time series to forecast:
import pandas as pd
df = pd.read_csv('NZAlcoholConsumption.csv')
to_forecast = df.TotalBeer.values
dates = df.DATE.values
The time series represent the total of alcohol consumed by quarter millions of litres from the 1st quarter of 2000 to 3rd quarter of 2012. The data is from New Zealand government and can be downloaded in csv from here. We will focus on the forecast of beer consumption.
First, we need to organize our data in forecast in windows that contain the previous observations:
import numpy as np

def organize_data(to_forecast, window, horizon):
      to_forecast, univariate time series organized as numpy array
      window, number of items to use in the forecast window
      horizon, horizon of the forecast
      X, a matrix where each row contains a forecast window
      y, the target values for each row of X
    shape = to_forecast.shape[:-1] + /
            (to_forecast.shape[-1] - window + 1, window)
    strides = to_forecast.strides + (to_forecast.strides[-1],)
    X = np.lib.stride_tricks.as_strided(to_forecast, 
    y = np.array([X[i+horizon][-1] for i in range(len(X)-horizon)])
    return X[:-horizon], y

k = 4   # number of previous observations to use
h = 1   # forecast horizon
X,y = organize_data(to_forecast, k, h)
Now, X is a matrix where the i-th row contains the lagged variables xn−k,...,xn−2,xn−1 and y[i] contains the i-th target value. We are ready to train our forecasting model:
from sklearn.linear_model import LinearRegression
m = 10 # number of samples to take in account
regressor = LinearRegression(normalize=True)
regressor.fit(X[:m], y[:m])
We trained our model using the first 10 observations, which means that we used the data from 1st quarter of 2000 to the 2nd quarter of 2002. We use a lag order of one year and a forecast horizon of 1 quarter. To estimate the error of the model we will use the mean absolute percentage error (MAPE). Computing this metric to compare the forecast of the remaining observation of the time series and the actual observations we have:
def mape(ypred, ytrue):
    """ returns the mean absolute percentage error """
    idx = ytrue != 0.0
    return 100*np.mean(np.abs(ypred[idx]-ytrue[idx])/ytrue[idx])

print 'The error is %0.2f%%' % mape(regressor.predict(X[m:]),y[m:])
The error is 6.15%
Which means that, on average, the forecast provided by our model differs from the target value only of 6.15%. Let's compare the forecast and the observed values visually:
plot(y, label='True demand', color='#377EB8', linewidth=2)
     '--', color='#EB3737', linewidth=3, label='Prediction')
plot(y[:m], label='Train data', color='#3700B8', linewidth=2)
xticks(arange(len(dates))[1::4],dates[1::4], rotation=45)
legend(loc='upper right')
ylabel('beer consumed (millions of litres)')

We note that the forecast is very close to the target values and that the model was able to learn the trends and anticipate them in many cases.


  1. Do you have any example of how antecipate values?

    1. I don't really understand what you mean. If you want an exampe of forecast you can refer to the picture. The red dashed line is the forecast and the light blue line is the actual consumption.

  2. What I meant was: Can we antecipate values, like predict the values of 2012Q4 or more future?

    1. Yes, you have two options. You can train the model with the specific time horizon that you need or train the model with time horizon equal to 1 and predict the values one by one till you reach the point that you need. In the second case you have to reuse the predictions in between as known values to produce the prediction that follow, beware that in this way you will accumulate some error.

    2. In first case, will accumulate some error, but less or equal to second, right?

      I'm trying to do some example, but I'm kind a newbie on python and IA.

      Aah, by the way, it was me the Anonymous up there :).

  3. I am also interested in learning how to predict the value of a future point (say the next step). How would we implement the prediction on the next point in the time series?

    1. If you consider the code in this post, you just need to set h = 1 and use the predict method.

  4. i know this is an old post, but i was trying to replicate what you posted and i receive an error:

    on this line of code, it doesn't like the "+ /"; i get an invalid syntax error.

    shape = to_forecast.shape[:-1] + /

    any thoughts on what could cause that?

    1. hi, try to remove the "/" and merge the line below:

      shape = to_forecast.shape[:-1] + (to_forecast.shape[-1] - window + 1, window)

    2. That worked, thank you!

  5. hello, i am wondering when you predict 2012-q4, do you mean predict('2012-q4')? to predict one by one? could you give an example? thank you.

  6. Thanks for this interesting program. The following line does not work. Can you please help?

    1. Hi! make sure you have imported the required functions from matplotlib.pyplot.

  7. how can one add the analysis future prediction, ex 2013Q1


Note: Only a member of this blog may post a comment.