Saturday, March 23, 2019

Visualizing the trend of a time series with Pandas

The trend of time series is the general direction in which the values change. In this post we will focus on how to use rolling windows to isolate it. Let's download from Google Trends the interest of the search term Pancakes and see what we can do with it:
import pandas as pd
import matplotlib.pyplot as plt
url = './data/pancakes.csv' # downloaded from https://trends.google.com
data = pd.read_csv(url, skiprows=2, parse_dates=['Month'], index_col=['Month'])
plt.plot(data)


Looking at the data we notice that there's some seasonality (Pancakes day! yay!) and an increasing trend. What if we want to visualize just the trend of this curve? We only need to slide a rolling window through the data and compute the average at each step. This can be done in just one line if we use the method rolling:

y_mean = data.rolling('365D').mean()
plt.plot(y_mean)


The parameter passed to rolling '365D' means that our rolling window will have size 365 days. Check out the documentation of the method to know more.
We can also add highlight the variation each year adding to the chart a shade with the amplitude of the standard deviation:

y_std = data.rolling('365D').std()
plt.plot(y_mean)
plt.fill_between(y_mean.index,
                 (y_mean - y_std).values.T[0],
                 (y_mean + y_std).values.T[0], alpha=.5)


Warning: the visualization above assumes that the distribution of the data each year follows a normal distribution, which is not entirely true.

2 comments:

  1. One of the trickier aspects I've encountered is where data is not regularly sampled, and messes up the rolling statistics a little. I'd love to see your approach for solving that issue!

    ReplyDelete
    Replies
    1. Hi! There's one thing to notice. In this post I specifid the time window with an offset ('365D'), it means that all the data to compute a given mean is in a window of 365 days not 365 samples. On the other side, when working with data not regularly samples I succesfully applied the following resampling: pd.resampling(offset).interpolate(). Of course, it may not work in some cases.

      Delete

Note: Only a member of this blog may post a comment.