import pandas as pd import matplotlib.pyplot as plt url = './data/pancakes.csv' # downloaded from https://trends.google.com data = pd.read_csv(url, skiprows=2, parse_dates=['Month'], index_col=['Month']) plt.plot(data)
Looking at the data we notice that there's some seasonality (Pancakes day! yay!) and an increasing trend. What if we want to visualize just the trend of this curve? We only need to slide a rolling window through the data and compute the average at each step. This can be done in just one line if we use the method rolling:
y_mean = data.rolling('365D').mean() plt.plot(y_mean)
The parameter passed to rolling '365D' means that our rolling window will have size 365 days. Check out the documentation of the method to know more.
We can also add highlight the variation each year adding to the chart a shade with the amplitude of the standard deviation:
y_std = data.rolling('365D').std() plt.plot(y_mean) plt.fill_between(y_mean.index, (y_mean - y_std).values.T[0], (y_mean + y_std).values.T[0], alpha=.5)
Warning: the visualization above assumes that the distribution of the data each year follows a normal distribution, which is not entirely true.
One of the trickier aspects I've encountered is where data is not regularly sampled, and messes up the rolling statistics a little. I'd love to see your approach for solving that issue!
ReplyDeleteHi! There's one thing to notice. In this post I specifid the time window with an offset ('365D'), it means that all the data to compute a given mean is in a window of 365 days not 365 samples. On the other side, when working with data not regularly samples I succesfully applied the following resampling: pd.resampling(offset).interpolate(). Of course, it may not work in some cases.
Delete