Time series analysis is an essential concept that has practical applications in our day-to-day living. The skills you acquire from mastering time series analysis will place you in a good position to work as a data scientist, a finance analyst, or a data analyst. By acquiring these skills, you will help handle issues such as estimating the risk of a portfolio’s stock, predicting certain properties in real estate, and making predictions on the performance of things like loans.
In illustrating time series concepts, we will use the Python programming language, which has suitable libraries with ideal functionality to handle time series. The libraries include NumPy, StatsModels, pmdarima, ARCH, matplotlib, and pandas.
Time Series Analysis in Python
The essential time series models include:
- autoregressive model (AR )
- moving-average model (MA)
- autoregressive-moving-average model (ARMA)
- autoregressive integrated moving average model (ARIMA)
- autoregressive integrated moving average model with exogenous variables (ARIMAX)
- seasonal autoregressive moving average model (SARIA)
- seasonal autoregressive integrated moving average model (SARIMA)
- seasonal autoregressive integrated moving average model with exogenous variables (SARIMAX)
- autoregressive conditional heteroscedasticity model (ARCH)
- generalized autoregressive conditional heteroscedasticity model (GARCH)
- vector autoregressive moving average model (VARMA)
When observations are made and recorded at regular time intervals that form a sequence, we get a time series. These observations’ frequency can vary from minutes, hours, days, weeks, months, quarterly, or even yearly. Time series analysis is a preparatory step necessary to establish a forecast of the specific series. This is the main reason we analyze a time series.
As indicated earlier, we will use Python in this article to help you analyze and understand the characteristics of a specified time series. Inherently, this aims at forming a solid understanding and establish a solid creation of forecasts that is very accurate around various core concepts on the series.
Importing time series in Python
Data on any time series is mostly presented in the form of a date and a measured value. The latter is usually in spreadsheet formats such as the .csv.
Pandas library has a function called read_csv() that is essential in reading a time series in .csv format. The latter reading forms a pandas dataframe.
In the example below, we can parse as a date field the date column adding the argument parse_dates=[‘date’].
from dateutil.parser import parse
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
plt.rcParams.update({'figure.figsize': (10, 7), 'figure.dpi': 120})
Import the data as Dataframe
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df.head()
The other option is to use the date as the index by explicitly indicating this in the read_csv(). This approach is called the pandas series.
ser = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')
ser.head()
Further, placing the ‘value’ column higher than the date indicates that it is a series.
Panel data
Besides being a time-based dataset, panel data is added to the time and has either one or many related variables and is usually measured for the same periods.
Panel data columns’ have variables that are explicitly explanatory and essential in making the Y predictions. However, it is dependent on the availability of these columns at the forecasting period in the future.
Below is an example of panel data,
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/MarketArrivals.csv') df = df.loc[df.market=='LUCKNOW', :] df.head()
Time Series Visualization
Visualization of the series is made possible by using matplotlib as follows:
# Time series data source: fpp pacakge in R. import matplotlib.pyplot as plt df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date') # Draw the Plot def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100): plt.figure(figsize=(16,5), dpi=dpi) plt.plot(x, y, color='tab:red') plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel) plt.show() plot_df(df, x=df.index, y=df.value, title='Monthly anti-diabetic drug sales in Australia from 1992 to 2008.')
Representation of the values on both sides of the Y-axis helps to cement the concept of growth when all values are positive.
# Import the dataset df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date') df.reset_index(inplace=True) #data preparation df['year'] = [d.year for d in df.date] df['month'] = [d.strftime('%b') for d in df.date] years = df['year'].unique() #Colors prep np.random.seed(100) mycolors = np.random.choice(list(mpl.colors.XKCD_COLORS.keys()), len(years), replace=False) # Drawing the Plot plt.figure(figsize=(16,12), dpi= 80) for i, y in enumerate(years): if i > 0: plt.plot('month', 'value', data=df.loc[df.year==y, :], color=mycolors[i], label=y) plt.text(df.loc[df.year==y, :].shape[0]-.9, df.loc[df.year==y, 'value'][-1:].values[0], y, fontsize=12, color=mycolors[i]) # Decor plt.gca().set(xlim=(-0.3, 11), ylim=(2, 30), ylabel='$Drug Sales$', xlabel='$Month$') plt.yticks(fontsize=12, alpha=.7) plt.title("Drug Sales Time Series' Seasonal Plot ", fontsize=20) plt.show()
According to the data, the drug sales fall almost every February, rises in March, fall in April, with a clear repeat of the same each year.
Also, note the overall rise in drug sales over the years. Visualizing the trend and each year’s variation can be accompanied by a monthly boxplot visualization of the distributions.
Month-wise (seasonal) and year-wise (trend) distribution boxplot
Data can be grouped at seasonal intervals and visualize how the distribution compares for a given period within a specific year or a month.
# Import Data df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date') df.reset_index(inplace=True) # data preparation df['year'] = [d.year for d in df.date] df['month'] = [d.strftime('%b') for d in df.date] years = df['year'].unique() # plot drawing fig, axes = plt.subplots(1, 2, figsize=(20,7), dpi= 80) sns.boxplot(x='year', y='value', data=df, ax=axes[0]) sns.boxplot(x='month', y='value', data=df.loc[~df.year.isin([1991, 2008]), :]) # Set Title axes[0].set_title('year-wise Box Plot\n( Trend)', fontsize=18); axes[1].set_title('month-wise Box Plot\n( Seasonality)', fontsize=18) plt.show()
Based on the representation, both month-wise and year-wise distributions are utterly evident. In fact, both January and December have higher drug sales when you examine the boxplot, which ties to the discounted holiday season.
Patterns in a time series
The four main components of a time series include :
- Base Level
- Trend
- Seasonality
- Error
Trend
Is represented by a changing decrease or increase slope in the given time series.
Seasonality
Seasonality occurs when there are patterns of distinct repetition at certain intervals regularly, which is attributed to factors termed seasonal. Most occurrences manifest as the time of the day, weekdays, the day of the month, or the year.
On the contrary, a time series can have either seasonality or a trend, or both.
It is not surprising to see a time series being imagined as a combination of seasonality, the trend, and the error.
fig, axes = plt.subplots(1,3, figsize=(20,4), dpi=100) pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/guinearice.csv', parse_dates=['date'], index_col='date').plot(title='Trend Only', legend=False, ax=axes[0]) pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv', parse_dates=['date'], index_col='date').plot(title='Seasonality Only', legend=False, ax=axes[1]) pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/AirPassengers.csv', parse_dates=['date'], index_col='date').plot(title='Trend and Seasonality', legend=False, ax=axes[2]) plt.show()
Differentiating between a Seasonal and a Cyclic Pattern
Cyclic behavior is manifested when the fall and rise pattern in the time series does not occur in fixed calendar-based intervals. Most of these influences are business-based, among other factors that are socio-economic based. On the other hand, seasonal is identified by distinct repetition at certain regular intervals, attributed to seasonal factors.
Multiplicative and Addictive time series
The nature of trend and seasonality determines if the time series is multiplicative or additive. However, every observation in the series is represented as a product or a sum of the individual components.
Multiplicative Time Series:
value = Base Level * Trend * Seasonality * Error
Additive time series:
value = Base Level + Trend +Seasonality + Error
Decomposition of a time series into its components
Considering a time series either as a multiplicative or additive combination results in a classical decomposition of the series as the trend, base level, the residual, and the seasonal index.
There is a convenient implementation of this in statsmodels which is found in seasonal_decompose.
from statsmodels.tsa.seasonal import seasonal_decompose from dateutil.parser import parse import pandas as pd import matplotlib.pyplot as plt # Import the Data df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', # Decomposition is Multiplicative result_mul = seasonal_decompose(df['value'], model='multiplicative', extrapolate_trend='freq') parse_dates=['date'], index_col='date') # Decomposition is Additive result_add = seasonal_decompose(df['value'], model='additive', extrapolate_trend='freq') # Plotting plt.rcParams.update({'figure.figsize': (10,10)}) result_mul.plot().suptitle('Multiplicative Decompose', fontsize=22) result_add.plot().suptitle('Additive Decompose', fontsize=22) plt.show()
At the beginning of the series, set extraplocate_trend= ‘freq’ to take care of the trend’s missing values.
A closer look at the additive decomposition of the residuals shows leftover patterns. On the contrary, the multiplicative decomposition shows a random representation. In such a scenario, the preferred option for the time series is the multiplicative decomposition.
result_mul stores the residual and the components, as well as the trend’s numerical output. In the example below, we show how to extract them and put them in a dataframe.
# Extract the Components # Actual Values = Product of (Seasonal * Trend * Resid) df_reconstructed = pd.concat([result_mul.seasonal, result_mul.trend, result_mul.resid, result_mul.observed], axis=1) df_reconstructed.columns = ['season', 'trend', 'residual', 'actual_values'] df_reconstructed.head()
The actual_values should be representative of a trend, residual, and season product columns.
Stationary and Non-Stationary Time Series
If the values of the series are not a representative function of time, then we are dealing with a stationary series. Thus, stationary is a time series’ property.
In essence, a statistical component’s properties such as the autocorrelation, variance, and the series’s mean are constant over time.
In this case, auto-correlation is simply the series’s correlation with its previous values in the given series.
A typical case is the id of a stationary time series without the seasonal effects.
Any time series can be made stationary through relevant transformation. Coincidentally, most of the forecasting methods are innately designed to work on a stationary time series.
In fact, in forecasting, the initial step entails transformation that seeks to convert a non-stationary series to a stationary one.
Making a time series Stationary
There are various ways to help make a series stationary. Below are some proven techniques
- -Series differencing
- -taking the series’ log
- -Take the series’ nth root
- -combination of all of the above
The most obvious and common way is to stationarize a given series is to use the concept of differencing the series once at a minimum until it becomes stationary.
In differencing, the next value is subtracted by the current value. So, after getting the first difference, you can opt for the second or advanced if the first one does not make the series stationary.
Lets consider the series, [1, 5, 2, 12, 20]
The first differencing is as follows [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]
The second differencing makes: [-3-4, -10-3, 8-10] = [-7, -13, -2
Is there a reason to make a non-stationary series stationary before forecasting?
Forecasts from a stationary time series are usually more reliable and easy as well.
Linear regression is best when the predictors, which are the X variables, have no-correlation against each other. What stationarizing the series does is removing any traces of persistent auto-correlation. As a result, the predictors are the series’s lags in the models’ forecasting almost independent.
Therefore, the problem is solved because auto-regressive forecasting models are typically linear regression models that use the lag(s) of the given series as predictors.
Check if a given series is stationary or not?
Here, it is important to examine the plot of the series. However, statistical tests called ‘Unit Root Tests’ are important in quantitatively determining if the series provided is stationary or not. It comes with several variations. The tests ascertain if a given time series is not stationary and possesses a unit root.
The common implementations of a Unit Root test include:
- Philips Perron test (PP Test)
- Augmented Dickey-Fuller Test ( ADH Test)
- Kwiatkowski -Phillips-Schmidt-Shin (KPSS) or the trend stationary
Among the three presented here, the ADH test is common. It mostly involves the null hypothesis where the time series has a unit root and is not stationary. The null hypothesis is rejected when the P-Value in the ADH test is smaller than the significance level of 0.05.
Trend stationery is tested using the KPSS test. Here, both the null hypothesis and the P-Value interpretation are the opposite of the ADH test. These two tests can both use the statsmodels package for its implementation.
from statsmodels.tsa.stattools import adfuller, kpss df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date']) # The ADF Test result = adfuller(df.value.values, autolag='AIC') print(f'The ADF Statistic: {result[0]}') print(f'p-value: {result[1]}') for key, value in result[4].items(): print('Critial Values:') print(f' {key}, {value}') # The KPSS Test result = kpss(df.value.values, regression='c') print('\nThe KPSS Statistic: %f' % result[0]) print('p-value: %f' % result[1]) for key, value in result[3].items(): print('The Critical Values:') print(f' {key}, {value}')
The output
Difference between white noise and a stationary series
White noise is similar to a stationary series since it is not a function of time. What it means is white noise’s variance and mean do not change with a change in time. However, the fact that white noise is fully random and has a mean of 0. That is the differentiating factor.
Also, there is no pattern at all in white noise.
Consider an Fm’s radio signal as a time series. In that case, the blank sound heard between the channels is white noise. In perspective, when a sequence of random numbers has a mean of zero in mathematics, then we are dealing with white noise.
import pandas as pd import numpy as np randvals = np.random.randn(1000) pd.Series(randvals).plot(title='White Noise at Random ', color='k') plt.show()
Detrending a time series
In a time series, the process of removing the trend component is called detrending. We will examine various approaches below on how to extract the trend from the given time series. These include:
- Subtraction of the line of best fit from the series. A Linear regression model is important in obtaining the best fit line while using the time steps as the predictor. In the case of complex trends, using quadratic terms, x^2 in the model is highly encouraged.
- Subtraction of the resultant trend component from the time series decomposition.
- Mean subtraction
- For instance, in filter applications, the Baxter-King filter removes either the cyclic components or the moving average trend lines. Other filters include the Hodrick-Prescott filter.
Implementation of the initial two methods.
# Subtraction of the line of best fit with scipy from scipy import signal df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date']) detrended = signal.detrend(df.value.values) plt.plot(detrended) plt.title('Drug sales were detrended by removing the least squares fit.', fontsize=16) plt.show()
# Subtraction of the Trend Portion Using Statmodels from statsmodels.tsa.seasonal import seasonal_decompose df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date') result_mul = seasonal_decompose(df['value'], model='multiplicative', extrapolate_trend='freq') detrended = df.value.values - result_mul.trend plt.plot(detrended) plt.title('Drug sales were detrended by eliminating the pattern portion.', fontsize=16) plt.show()
Drug sales were detrended by eliminating the pattern portion
Deseasonalizing a time series
Deasosonalizing a time series can take various dimensions, and we will look at some below.
1- Given a moving average that has the length as the seasonal window. In the process, this smoothens in series.
2 – Seasonal differentiation of the series involves subtraction of the previous’ season value from the given current value
3 – Dividing the given series by the season index that is obtained from the STL decomposition.
In case the seasonal index ends up not working well, you can make attempts on a log of the series. Subsequently, do deseasonalize. Later, do a restoration to the original scale by choosing to take an exponential.
# Taking the Trend Component away. df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date') # Decomposition of Time Series result_mul = seasonal_decompose(df['value'], model='multiplicative', extrapolate_trend='freq') # Deseasonalize- Remove the seasonings deseasonalized = df.value.values / result_mul.seasonal # Plot plt.plot(deseasonalized) plt.title('Deseasonalized Drug Sales', fontsize=16) plt.plot() plt.show()
Testing time series’ seasonality
The most obvious approach to ascertain if a given time series has seasonality is plotting the series and finding patterns of repetition at fixed intervals.
The determination of seasonality is by the clock or the calendar in the form of
1- day’s hour
2- day’s month
3- per week
4- per month
5- per year
The autocorrelation function (ACF) plot is applicable if you intend to a definitive seasonality inspection. This approach reveals spikes repeated many times during the seasonal window.
In most practical data sets, it is difficult to notice strong patterns of repetition because of noise distortion, and lots of care should be taken for you to stand any chance of capturing such patterns.
import pandas as pd import matplotlib.pyplot as plt from pandas.plotting import autocorrelation_plot df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv') # Plot Drawing plt.rcParams.update({'figure.figsize':(9,5), 'figure.dpi':120}) autocorrelation_plot(df.value.tolist()) plt.title('definitive inspection of the seasonality', fontsize=16) plt.show()
The other alternative is using CHTest, a statistical test that can easily establish any chances of seasonal differencing needed to stationarize the series.
Treating missing values in a time series
In some cases, the given time series will miss out on times and dates, which implies missing data for the given periods or the data was simply not captured. In other cases, the values could be zero, which is easier to fill up.
By any chance, do not replace the missing values in a time series with the particular series’s mean. More so if the given series is not a stationary one.
The easiest and dirtiest workaround in such a scenario is to forward-fill the previous value.
This, however, depends on the nature of the series you are dealing with at the given time. The best thing is to make attempts at different available approaches before reaching your conclusion. Some of the methods already tested and have been significantly useful in the past include:
- Mean of the nearest neighbor
- Quadratic interpolation
- Mean of seasonal counterparts
- Linear interpolation
- Backward fill
Measuring imputation performance involves manually introducing missing values to the time series you are dealing with. Subsequently, impute the series with the methods listed above and measure the mean squared error of the imputed versus the real values.
import numpy as np import pandas as pd import matplotlib.pyplot as plt ''' dataset Generation ''' from scipy.interpolate import interp1d from sklearn.metrics import mean_squared_error df_orig = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date').head(100) df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10_missings.csv', parse_dates=['date'], index_col='date') fig, axes = plt.subplots(7, 1, sharex=True, figsize=(10, 12)) plt.rcParams.update({'xtick.bottom' : False}) ''' Actual ''' df_orig.plot(title='Actual', ax=axes[0], label='Actual', color='red', style=".-") df.plot(title='Actual', ax=axes[0], label='Actual', color='green', style=".-") axes[0].legend(["Missing Data", "Available Data"]) ''' Forward Fill ''' df_ffill = df.ffill() error = np.round(mean_squared_error(df_orig['value'], df_ffill['value']), 2) df_ffill['value'].plot(title='Forward Fill (MSE: ' + str(error) +")", ax=axes[1], label='Forward Fill', style=".-") ''' Backward Fill ''' df_bfill = df.bfill() error = np.round(mean_squared_error(df_orig['value'], df_bfill['value']), 2) df_bfill['value'].plot(title="Backward Fill (MSE: " + str(error) +")", ax=axes[2], label='Back Fill', color='firebrick', style=".-") ''' Linear Interpolation ''' df['rownum'] = np.arange(df.shape[0]) df_nona = df.dropna(subset = ['value']) f = interp1d(df_nona['rownum'], df_nona['value']) df['linear_fill'] = f(df['rownum']) error = np.round(mean_squared_error(df_orig['value'], df['linear_fill']), 2) df['linear_fill'].plot(title="Linear Fill (MSE: " + str(error) +")", ax=axes[3], label='Cubic Fill', color='brown', style=".-") ''' Cubic Interpolation ''' f2 = interp1d(df_nona['rownum'], df_nona['value'], kind='cubic') df['cubic_fill'] = f2(df['rownum']) error = np.round(mean_squared_error(df_orig['value'], df['cubic_fill']), 2) df['cubic_fill'].plot(title="Cubic Fill (MSE: " + str(error) +")", ax=axes[4], label='Cubic Fill', color='red', style=".-") ''' Mean of 'n' Nearest Past Neighbors ''' def knn_mean(ts, n): out = np.copy(ts) for i, val in enumerate(ts): if np.isnan(val): n_by_2 = np.ceil(n/2) lower = np.max([0, int(i-n_by_2)]) upper = np.min([len(ts)+1, int(i+n_by_2)]) ts_near = np.concatenate([ts[lower:i], ts[i:upper]]) out[i] = np.nanmean(ts_near) return out df['knn_mean'] = knn_mean(df.value.values, 8) error = np.round(mean_squared_error(df_orig['value'], df['knn_mean']), 2) df['knn_mean'].plot(title="KNN Mean (MSE: " + str(error) +")", ax=axes[5], label='KNN Mean', color='tomato', alpha=0.5, style=".-") ''' Seasonal Mean ''' def seasonal_mean(ts, n, lr=0.7): """ Compute the mean of corresponding seasonal periods ts: 1D array-like of the time series n: Seasonal window length of the time series """ out = np.copy(ts) for i, val in enumerate(ts): if np.isnan(val): ts_seas = ts[i-1::-n] # handle previous seasons only if np.isnan(np.nanmean(ts_seas)): ts_seas = np.concatenate([ts[i-1::-n], ts[i::n]]) # track the previous and forward out[i] = np.nanmean(ts_seas) * lr return out df['seasonal_mean'] = seasonal_mean(df.value, n=12, lr=1.25) error = np.round(mean_squared_error(df_orig['value'], df['seasonal_mean']), 2) df['seasonal_mean'].plot(title="Seasonal Mean (MSE: " + str(error) +")", ax=axes[6], label='Seasonal Mean', color='blue', alpha=0.5, style=".-") plt.show()
Depending on the accuracy levels you wish to attain, it is worthwhile to consider other approaches. These comprise of :
For instance, using a prediction model, random forest or k-Nearest neighbor, for prediction when dealing with explanatory variables.
- Forecasting the missing values if you have enough past observations
- Backcast the missing values in case of enough future observations.
- Using previous cycle’s counterparts to forecast.
Autocorrelation and partial autocorrelation functions
In cases of a highly correlated series, the current values can easily be predicted using the previous values of the series or lags. Based on that, autocorrelation refers to the correlation of a given series with its lags.
As much as partial correlation also conveys the same information, it purely conveys the series’s correlation and its lag. This excludes the correlation contributions made from the intermediate lags.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.stattools import acf, pacf from statsmodels.graphics.tsaplots import plot_acf, plot_pacf df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv') # ACF and PACF Calculation up to 50 lags # acf_50 = acf(df.value, nlags=50) # pacf_50 = pacf(df.value, nlags=50) # Plot Drawing fig, axes = plt.subplots(1,2,figsize=(16,3), dpi= 100) plot_acf(df.value.tolist(), lags=50, ax=axes[0]) plot_pacf(df.value.tolist(), lags=50, ax=axes[1]) plt.show()
Computing partial autocorrelation functionality
Y’s autoregressive equation is the linear regression of Y having the predictors at its lags. Thus, the lag’s partial autocorrelation in a particular series is the given lag’s coefficient in Y’s autoregression equation.
For instance, given X_t is the series, and X_t-1 is the lag of 1 of X then, 3(X_t-3) is the coefficient’s partial autocorrelation of the coefficient $\alpha_3$ of X_t-3 in the equation that follows below.
Lag Plots
It refers to a scatter plot of a time series versus its lag. The reason for its use is checking autocorrelation. In case of pattern existence in the series at hand, then the series is autocorrelated. In case such a pattern does not exist, then high chance the series is random white noise.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas.plotting import lag_plot plt.rcParams.update({'ytick.left' : False, 'axes.titlepad':10}) # Import ss = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv') a10 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv') # Plot fig, axes = plt.subplots(1, 4, figsize=(10,3), sharex=True, sharey=True, dpi=100) for i, ax in enumerate(axes.flatten()[:4]): lag_plot(ss.value, lag=i+1, ax=ax, c='firebrick') ax.set_title('Lag ' + str(i+1)) fig.suptitle('Sun Spots Area' Lag Plots \n(with an increasing lag, points get wide and scattered -> correlation is lesser)\n', y=1.15) fig, axes = plt.subplots(1, 4, figsize=(10,3), sharex=True, sharey=True, dpi=100) for i, ax in enumerate(axes.flatten()[:4]): lag_plot(a10.value, lag=i+1, ax=ax, c='firebrick') ax.set_title('Lag ' + str(i+1)) fig.suptitle('Drug Sales' Lag Plots ', y=1.05) plt.show()
In the example above on Sunspots area time series, as the n_lag increases, the plots become more and more scattered.
Estimating the forecastability of a time series
The process of forecasting in a time series is easier when the patterns are repeatable and more regular. The approximate entropy is essential in regularity’s quantification and unpredictability of time series fluctuations.
It is tough to make forecasts when the approximate entropy keeps increasing.
The other alternative is the sample entropy, whose consistency in complexity estimation is more for even the smaller time series. However, it is similar to the approximate entropy. Consider a random time series with fewer data points with a smaller approximate entropy than a more regular time series. On the other hand, a series with a long random will have an increased approximate entropy.
The problem is nicely handled by using a sample entropy.
import numpy as np import pandas as pd import matplotlib.pyplot as plt ss = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv') a10 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv') rand_small = np.random.randint(0, 100, size=36) rand_big = np.random.randint(0, 100, size=136) def ApEn(U, m, r): """Computation of the aproximate entropy""" def _maxdist(x_i, x_j): return max([abs(ua - va) for ua, va in zip(x_i, x_j)]) def _phi(m): x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)] C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x] return (N - m + 1.0)**(-1) * sum(np.log(C)) N = len(U) return abs(_phi(m+1) - _phi(m)) print(ApEn(ss.value, m=2, r=0.2*np.std(ss.value))) print(ApEn(a10.value, m=2, r=0.2*np.std(a10.value))) print(ApEn(rand_small, m=2, r=0.2*np.std(rand_small))) print(ApEn(rand_big, m=2, r=0.2*np.std(rand_big)))
import numpy as np import pandas as pd ss = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv') a10 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv') rand_small = np.random.randint(0, 100, size=36) rand_big = np.random.randint(0, 100, size=136) def SampEn(U, m, r): """Sample entropy computation""" def _maxdist(x_i, x_j): return max([abs(ua - va) for ua, va in zip(x_i, x_j)]) def _phi(m): x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)] C = [len([1 for j in range(len(x)) if i != j and _maxdist(x[i], x[j]) <= r]) for i in range(len(x))] return sum(C) N = len(U) return -np.log(_phi(m+1) / _phi(m)) print(SampEn(ss.value, m=2, r=0.2*np.std(ss.value))) print(SampEn(a10.value, m=2, r=0.2*np.std(a10.value))) print(SampEn(rand_small, m=2, r=0.2*np.std(rand_small))) print(SampEn(rand_big, m=2, r=0.2*np.std(rand_big)))
Smoothening a time series
Smoothening is a time series that is essential when minimizing the effect of noise in a signal to get a fair approximation of noise-filtered series. Further, it can be an essential feature in explaining the original series. It can also make a better visualization of the underlying trend.
There are various ways to smoothen a series, and they comprise of:
- Taking a moving average.
- Implement a LOESS smoothing – localized regression
- Implement LOESS smoothing, which is a locally weighted regression
Moving average is just the average of a rolling window whose width is defined. The window-width selection process should be meticulous because large window-sizes lead to over-smoothing of the series, and a window size equal to the seasonal duration nullifies the seasonal effect.
Localized regrESSion is the long-form of LOESS, and it fits multiple regressions in every point of the local neighborhood.
Its implementation is in the statsmodels package, where the frac argument can be used to control the degree of smoothing. It also indicates the percentage of the nearby data points that need consideration to fit a regression model.
from statsmodels.nonparametric.smoothers_lowess import lowess plt.rcParams.update({'xtick.bottom' : False, 'axes.titlepad':5}) # Import df_orig = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/elecequip.csv', parse_dates=['date'], index_col='date') # 1. The Moving Average df_ma = df_orig.value.rolling(3, center=True, closed='both').mean() # 2. Loess Smoothing is 5% and 15% df_loess_5 = pd.DataFrame(lowess(df_orig.value, np.arange(len(df_orig.value)), frac=0.05)[:, 1], index=df_orig.index, columns=['value']) df_loess_15 = pd.DataFrame(lowess(df_orig.value, np.arange(len(df_orig.value)), frac=0.15)[:, 1], index=df_orig.index, columns=['value']) # Plotting fig, axes = plt.subplots(4,1, figsize=(7, 7), sharex=True, dpi=120) df_orig['value'].plot(ax=axes[0], color='k', title='The Original Series') df_loess_5['value'].plot(ax=axes[1], title='Loess Smoothed at 5%') df_loess_15['value'].plot(ax=axes[2], title='Loess Smoothed at 15%') df_ma.plot(ax=axes[3], title='Moving Average (3)') fig.suptitle('Time Series Smoothening ', y=0.95, fontsize=14) plt.show()
Using Granger Causality to test if one time series can forecast another
At no point should Granger causality be used to test if Y’s lag causes Y. On the contrary, it finds application on exogenous variables only and not Y lag.
Granger causality test emanates from the idea that if X causes Y, then the Y’s forecast is based on Y’s previous value. X’s previous value should outperform Y’s forecast founded on Y’s previous values only.
Granger causality test is implemented in the package statsmodel.
It accepts two columns as the core argument and a 2D array. Whereas the predictor(X) is in the second column, the values are in the first column.
In this case, the null hypothesis refers to the series in the second column. If the P-Values are less than a significant level of about 0.05, then the null hypothesis is rejected and concludes that the said lag of X is useful.
Further, the second argument, which is the maxlag, indicates how many Y lags are included in the test.
from statsmodels.tsa.stattools import grangercausalitytests df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date']) df['month'] = df.date.dt.month grangercausalitytests(df[['value', 'month']], maxlag=2)
Based on the scenario above, all the tests posted zero for P-Values. Thus, the month is solely for forecasting the air passengers.
Summary
This article has expounded on the basics of time series analysis in Python and laid a solid foundation for you to understand time series as a whole. The foundation laid will enable anyone who has gone through this article to forecast uncertainty bounds successfully, do time-series forecast with an external data source and do anomaly detection. We have also examined series models at the top being autoregressive and moving average models.
Remember that data in time series is available in so many domains, and data scientists need to work with this increasing demand effectively. The most obvious and available time-series application is in finance, forecasting stock prices, dealing with climate data, retail sales, economic, weather, and making estimations.