Time Series Analysis in Python

Time series analysis is an essential concept that has practical applications in our day-to-day living. The skills you acquire from mastering time series analysis will place you in a good position to work as a data scientist, a finance analyst, or a data analyst. By acquiring these skills, you will help handle issues such as estimating the risk of a portfolio’s stock, predicting certain properties in real estate, and making predictions on the performance of things like loans.

In illustrating time series concepts, we will use the Python programming language, which has suitable libraries with ideal functionality to handle time series. The libraries include NumPy, StatsModels, pmdarima, ARCH, matplotlib, and pandas.

Time Series Analysis in Python

The essential time series models include:

  • autoregressive model (AR )
  • moving-average model (MA)
  • autoregressive-moving-average model (ARMA)
  • autoregressive integrated moving average model (ARIMA)
  • autoregressive integrated moving average model with exogenous variables (ARIMAX)
  • seasonal autoregressive moving average model (SARIA)
  • seasonal autoregressive integrated moving average model (SARIMA)
  • seasonal autoregressive integrated moving average model with exogenous variables (SARIMAX)
  • autoregressive conditional heteroscedasticity model (ARCH)
  • generalized autoregressive conditional heteroscedasticity model (GARCH)
  • vector autoregressive moving average model (VARMA)

When observations are made and recorded at regular time intervals that form a sequence, we get a time series. These observations’ frequency can vary from minutes, hours, days, weeks, months, quarterly, or even yearly. Time series analysis is a preparatory step necessary to establish a forecast of the specific series. This is the main reason we analyze a time series.

As indicated earlier, we will use Python in this article to help you analyze and understand the characteristics of a specified time series. Inherently, this aims at forming a solid understanding and establish a solid creation of forecasts that is very accurate around various core concepts on the series.

Importing time series in Python

Data on any time series is mostly presented in the form of a date and a measured value. The latter is usually in spreadsheet formats such as the .csv.

Pandas library has a function called read_csv() that is essential in reading a time series in .csv format. The latter reading forms a pandas dataframe.

In the example below, we can parse as a date field the date column adding the argument parse_dates=[‘date’].

from dateutil.parser import parse
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
plt.rcParams.update({'figure.figsize': (10, 7), 'figure.dpi': 120})

Import the data as Dataframe
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df.head()

 

parse as a date field the date column
parse as a date field the date column


The other option is to use the date as the index by explicitly indicating this in the read_csv(). This approach is called the pandas series.

ser = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')
ser.head()

Further, placing the ‘value’ column higher than the date indicates that it is a series.

Panel data

Besides being a time-based dataset, panel data is added to the time and has either one or many related variables and is usually measured for the same periods.

Panel data columns’ have variables that are explicitly explanatory and essential in making the Y predictions. However, it is dependent on the availability of these columns at the forecasting period in the future.

Below is an example of panel data,

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/MarketArrivals.csv')
df = df.loc[df.market=='LUCKNOW', :]
df.head()</pre>
</div>
example of panel data
example of panel data

Time Series Visualization

Visualization of the series is made possible by using matplotlib as follows:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># Time series data source: fpp pacakge in R.
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')

# Draw the Plot
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.show()

plot_df(df, x=df.index, y=df.value, title='Monthly anti-diabetic drug sales in Australia from 1992 to 2008.')    </pre>
</div>
Time Series Visualization using matplotlib
Time Series Visualization using matplotlib

Representation of the values on both sides of the Y-axis helps to cement the concept of growth when all values are positive.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># Import the dataset
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')
df.reset_index(inplace=True)

#data preparation
df['year'] = [d.year for d in df.date]
df['month'] = [d.strftime('%%b') for d in df.date]
years = df['year'].unique()

#Colors  prep
np.random.seed(100)
mycolors = np.random.choice(list(mpl.colors.XKCD_COLORS.keys()), len(years), replace=False)

# Drawing the Plot
plt.figure(figsize=(16,12), dpi= 80)
for i, y in enumerate(years):
    if i > 0:        
        plt.plot('month', 'value', data=df.loc[df.year==y, :], color=mycolors[i], label=y)
        plt.text(df.loc[df.year==y, :].shape[0]-.9, df.loc[df.year==y, 'value'][-1:].values[0], y, fontsize=12, color=mycolors[i])

# Decor
plt.gca().set(xlim=(-0.3, 11), ylim=(2, 30), ylabel='$Drug Sales$', xlabel='$Month$')
plt.yticks(fontsize=12, alpha=.7)
plt.title("Drug Sales Time Series' Seasonal Plot ", fontsize=20)
plt.show()</pre>
</div>
Time Series Seasonal Plot
Time Series Seasonal Plot

According to the data, the drug sales fall almost every February, rises in March, fall in April, with a clear repeat of the same each year.

Also, note the overall rise in drug sales over the years. Visualizing the trend and each year’s variation can be accompanied by a monthly boxplot visualization of the distributions.

Month-wise (seasonal) and year-wise (trend) distribution boxplot

Data can be grouped at seasonal intervals and visualize how the distribution compares for a given period within a specific year or a month.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')
df.reset_index(inplace=True)

# data  preparation
df['year'] = [d.year for d in df.date]
df['month'] = [d.strftime('%%b') for d in df.date]
years = df['year'].unique()

# plot drawing
fig, axes = plt.subplots(1, 2, figsize=(20,7), dpi= 80)
sns.boxplot(x='year', y='value', data=df, ax=axes[0])
sns.boxplot(x='month', y='value', data=df.loc[~df.year.isin([1991, 2008]), :])

# Set Title
axes[0].set_title('year-wise Box Plot\n( Trend)', fontsize=18);
axes[1].set_title('month-wise Box Plot\n( Seasonality)', fontsize=18)
plt.show()</pre>
</div>
Month-wise (seasonal) and year-wise (trend) distribution boxplot
Month-wise (seasonal) and year-wise (trend) distribution boxplot

Based on the representation, both month-wise and year-wise distributions are utterly evident. In fact, both January and December have higher drug sales when you examine the boxplot, which ties to the discounted holiday season.

Patterns in a time series

The four main components of a time series include :

  • Base Level
  • Trend
  • Seasonality
  • Error

Trend
Is represented by a changing decrease or increase slope in the given time series.

Seasonality

Seasonality occurs when there are patterns of distinct repetition at certain intervals regularly, which is attributed to factors termed seasonal. Most occurrences manifest as the time of the day, weekdays, the day of the month, or the year.

On the contrary, a time series can have either seasonality or a trend, or both.

It is not surprising to see a time series being imagined as a combination of seasonality, the trend, and the error.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>fig, axes = plt.subplots(1,3, figsize=(20,4), dpi=100)
pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/guinearice.csv', parse_dates=['date'], index_col='date').plot(title='Trend Only', legend=False, ax=axes[0])

pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv', parse_dates=['date'], index_col='date').plot(title='Seasonality Only', legend=False, ax=axes[1])

pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/AirPassengers.csv', parse_dates=['date'], index_col='date').plot(title='Trend and Seasonality', legend=False, ax=axes[2])
plt.show()</pre>
</div>
Patterns in a time series
Patterns in a time series

Differentiating between a Seasonal and a Cyclic Pattern

Cyclic behavior is manifested when the fall and rise pattern in the time series does not occur in fixed calendar-based intervals. Most of these influences are business-based, among other factors that are socio-economic based. On the other hand, seasonal is identified by distinct repetition at certain regular intervals, attributed to seasonal factors.

Multiplicative and Addictive time series

The nature of trend and seasonality determines if the time series is multiplicative or additive. However, every observation in the series is represented as a product or a sum of the individual components.

Multiplicative Time Series:

value = Base Level * Trend * Seasonality * Error

Additive time series:

value = Base Level + Trend +Seasonality + Error

Decomposition of a time series into its components

Considering a time series either as a multiplicative or additive combination results in a classical decomposition of the series as the trend, base level, the residual, and the seasonal index.

There is a convenient implementation of this in statsmodels which is found in seasonal_decompose.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse
import pandas as pd
import matplotlib.pyplot as plt


# Import the Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', # Decomposition is Multiplicative
result_mul = seasonal_decompose(df['value'], model='multiplicative', extrapolate_trend='freq')
parse_dates=['date'], index_col='date')


# Decomposition is  Additive
result_add = seasonal_decompose(df['value'], model='additive', extrapolate_trend='freq')

# Plotting
plt.rcParams.update({'figure.figsize': (10,10)})
result_mul.plot().suptitle('Multiplicative Decompose', fontsize=22)
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()
</pre>
</div>

 

multiplicative decompose
multiplicative decompose
additive decompose
additive decompose

At the beginning of the series, set extraplocate_trend= ‘freq’ to take care of the trend’s missing values.

A closer look at the additive decomposition of the residuals shows leftover patterns. On the contrary, the multiplicative decomposition shows a random representation. In such a scenario, the preferred option for the time series is the multiplicative decomposition.

result_mul stores the residual and the components, as well as the trend’s numerical output. In the example below, we show how to extract them and put them in a dataframe.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># Extract the Components
# Actual Values = Product of (Seasonal * Trend * Resid)
df_reconstructed = pd.concat([result_mul.seasonal, result_mul.trend, result_mul.resid, result_mul.observed], axis=1)
df_reconstructed.columns = ['season', 'trend', 'residual', 'actual_values']
df_reconstructed.head()</pre>
</div>
how to extract residual and the components and put them in a dataframe
how to extract residual and the components and put them in a dataframe

The actual_values should be representative of a trend, residual, and season product columns.

Stationary and Non-Stationary Time Series

If the values of the series are not a representative function of time, then we are dealing with a stationary series. Thus, stationary is a time series’ property.

In essence, a statistical component’s properties such as the autocorrelation, variance, and the series’s mean are constant over time.

In this case, auto-correlation is simply the series’s correlation with its previous values in the given series.

A typical case is the id of a stationary time series without the seasonal effects.

Any time series can be made stationary through relevant transformation. Coincidentally, most of the forecasting methods are innately designed to work on a stationary time series.

In fact, in forecasting, the initial step entails transformation that seeks to convert a non-stationary series to a stationary one.

Making a time series Stationary

There are various ways to help make a series stationary. Below are some proven techniques

  • -Series differencing
  • -taking the series’ log
  • -Take the series’ nth root
  • -combination of all of the above

The most obvious and common way is to stationarize a given series is to use the concept of differencing the series once at a minimum until it becomes stationary.

In differencing, the next value is subtracted by the current value. So, after getting the first difference, you can opt for the second or advanced if the first one does not make the series stationary.

Lets consider the series, [1, 5, 2, 12, 20]

The first differencing is as follows [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]

The second differencing makes: [-3-4, -10-3, 8-10] = [-7, -13, -2

Is there a reason to make a non-stationary series stationary before forecasting?

Forecasts from a stationary time series are usually more reliable and easy as well.

Linear regression is best when the predictors, which are the X variables, have no-correlation against each other. What stationarizing the series does is removing any traces of persistent auto-correlation. As a result, the predictors are the series’s lags in the models’ forecasting almost independent.

Therefore, the problem is solved because auto-regressive forecasting models are typically linear regression models that use the lag(s) of the given series as predictors.

Check if a given series is stationary or not?

Here, it is important to examine the plot of the series. However, statistical tests called ‘Unit Root Tests’ are important in quantitatively determining if the series provided is stationary or not. It comes with several variations. The tests ascertain if a given time series is not stationary and possesses a unit root.

The common implementations of a Unit Root test include:

  • Philips Perron test (PP Test)
  • Augmented Dickey-Fuller Test ( ADH Test)
  • Kwiatkowski -Phillips-Schmidt-Shin (KPSS) or the trend stationary

Among the three presented here, the ADH test is common. It mostly involves the null hypothesis where the time series has a unit root and is not stationary. The null hypothesis is rejected when the P-Value in the ADH test is smaller than the significance level of 0.05.

Trend stationery is tested using the KPSS test. Here, both the null hypothesis and the P-Value interpretation are the opposite of the ADH test. These two tests can both use the statsmodels package for its implementation.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>from statsmodels.tsa.stattools import adfuller, kpss
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])

# The ADF Test
result = adfuller(df.value.values, autolag='AIC')
print(f'The ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
for key, value in result[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')

# The KPSS Test
result = kpss(df.value.values, regression='c')
print('\nThe KPSS Statistic: %%f' %% result[0])
print('p-value: %%f' %% result[1])
for key, value in result[3].items():
    print('The Critical Values:')
    print(f'   {key}, {value}')
</pre>
</div>

The output

Check if a given series is stationary or not output
Check if a given series is stationary or not output

Difference between white noise and a stationary series

White noise is similar to a stationary series since it is not a function of time. What it means is white noise’s variance and mean do not change with a change in time. However, the fact that white noise is fully random and has a mean of 0. That is the differentiating factor.

Also, there is no pattern at all in white noise.

Consider an Fm’s radio signal as a time series. In that case, the blank sound heard between the channels is white noise. In perspective, when a sequence of random numbers has a mean of zero in mathematics, then we are dealing with white noise.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>import pandas as pd
import numpy as np

randvals = np.random.randn(1000)
pd.Series(randvals).plot(title='White Noise at Random ', color='k')
plt.show()</pre>
</div>
Difference between white noise and a stationary series
Difference between white noise and a stationary series

Detrending a time series

In a time series, the process of removing the trend component is called detrending. We will examine various approaches below on how to extract the trend from the given time series. These include:

  • Subtraction of the line of best fit from the series. A Linear regression model is important in obtaining the best fit line while using the time steps as the predictor. In the case of complex trends, using quadratic terms, x^2 in the model is highly encouraged.
  • Subtraction of the resultant trend component from the time series decomposition.
  • Mean subtraction
  • For instance, in filter applications, the Baxter-King filter removes either the cyclic components or the moving average trend lines. Other filters include the Hodrick-Prescott filter.

Implementation of the initial two methods.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># Subtraction of the line of best fit with scipy

from scipy import signal
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
detrended = signal.detrend(df.value.values)
plt.plot(detrended)
plt.title('Drug sales were detrended by removing the least squares fit.', fontsize=16)
plt.show()</pre>
</div>
drug sales were detrended by removing the least squares fit
drug sales were detrended by removing the least squares fit
<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># Subtraction of the Trend Portion Using Statmodels

from statsmodels.tsa.seasonal import seasonal_decompose
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')
result_mul = seasonal_decompose(df['value'], model='multiplicative', extrapolate_trend='freq')
detrended = df.value.values - result_mul.trend
plt.plot(detrended)
plt.title('Drug sales were detrended by eliminating the pattern portion.', fontsize=16)
plt.show()</pre>
</div>

Drug sales were detrended by eliminating the pattern portion

Drug sales were detrended by eliminating the pattern portion.
Drug sales were detrended by eliminating the pattern portion.

Deseasonalizing a time series

Deasosonalizing a time series can take various dimensions, and we will look at some below.

1- Given a moving average that has the length as the seasonal window. In the process, this smoothens in series.
2 – Seasonal differentiation of the series involves subtraction of the previous’ season value from the given current value
3 – Dividing the given series by the season index that is obtained from the STL decomposition.

In case the seasonal index ends up not working well, you can make attempts on a log of the series. Subsequently, do deseasonalize. Later, do a restoration to the original scale by choosing to take an exponential.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># Taking the Trend Component away.
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')

# Decomposition of Time Series
result_mul = seasonal_decompose(df['value'], model='multiplicative', extrapolate_trend='freq')

# Deseasonalize- Remove the seasonings
deseasonalized = df.value.values / result_mul.seasonal

# Plot
plt.plot(deseasonalized)
plt.title('Deseasonalized Drug Sales', fontsize=16)
plt.plot()
plt.show()</pre>
</div>

 

Deseasonalizing a time series
Deseasonalizing a time series

Testing time series’ seasonality

The most obvious approach to ascertain if a given time series has seasonality is plotting the series and finding patterns of repetition at fixed intervals.

The determination of seasonality is by the clock or the calendar in the form of

1- day’s hour
2- day’s month
3- per week
4- per month
5- per year

The autocorrelation function (ACF) plot is applicable if you intend to a definitive seasonality inspection. This approach reveals spikes repeated many times during the seasonal window.

In most practical data sets, it is difficult to notice strong patterns of repetition because of noise distortion, and lots of care should be taken for you to stand any chance of capturing such patterns.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>import pandas as pd
import matplotlib.pyplot as plt

from pandas.plotting import autocorrelation_plot
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv')

# Plot  Drawing
plt.rcParams.update({'figure.figsize':(9,5), 'figure.dpi':120})
autocorrelation_plot(df.value.tolist())
plt.title('definitive inspection of the seasonality', fontsize=16)
plt.show()</pre>
</div>
Testing time series' seasonality
Testing time series’ seasonality

The other alternative is using CHTest, a statistical test that can easily establish any chances of seasonal differencing needed to stationarize the series.

Treating missing values in a time series

In some cases, the given time series will miss out on times and dates, which implies missing data for the given periods or the data was simply not captured. In other cases, the values could be zero, which is easier to fill up.

By any chance, do not replace the missing values in a time series with the particular series’s mean. More so if the given series is not a stationary one.

The easiest and dirtiest workaround in such a scenario is to forward-fill the previous value.

This, however, depends on the nature of the series you are dealing with at the given time. The best thing is to make attempts at different available approaches before reaching your conclusion. Some of the methods already tested and have been significantly useful in the past include:

  • Mean of the nearest neighbor
  • Quadratic interpolation
  • Mean of seasonal counterparts
  • Linear interpolation
  • Backward fill

Measuring imputation performance involves manually introducing missing values to the time series you are dealing with. Subsequently, impute the series with the methods listed above and measure the mean squared error of the imputed versus the real values.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

'''
dataset Generation
'''

from scipy.interpolate import interp1d
from sklearn.metrics import mean_squared_error

df_orig = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date').head(100)
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10_missings.csv', parse_dates=['date'], index_col='date')

fig, axes = plt.subplots(7, 1, sharex=True, figsize=(10, 12))
plt.rcParams.update({'xtick.bottom' : False})

'''
Actual
'''
df_orig.plot(title='Actual', ax=axes[0], label='Actual', color='red', style=".-")
df.plot(title='Actual', ax=axes[0], label='Actual', color='green', style=".-")
axes[0].legend(["Missing Data", "Available Data"])


'''
Forward Fill
'''
df_ffill = df.ffill()
error = np.round(mean_squared_error(df_orig['value'], df_ffill['value']), 2)
df_ffill['value'].plot(title='Forward Fill (MSE: ' + str(error) +")", ax=axes[1], label='Forward Fill', style=".-")


'''
Backward Fill

'''
df_bfill = df.bfill()
error = np.round(mean_squared_error(df_orig['value'], df_bfill['value']), 2)
df_bfill['value'].plot(title="Backward Fill (MSE: " + str(error) +")", ax=axes[2], label='Back Fill', color='firebrick', style=".-")


'''
Linear Interpolation

'''
df['rownum'] = np.arange(df.shape[0])
df_nona = df.dropna(subset = ['value'])
f = interp1d(df_nona['rownum'], df_nona['value'])
df['linear_fill'] = f(df['rownum'])
error = np.round(mean_squared_error(df_orig['value'], df['linear_fill']), 2)
df['linear_fill'].plot(title="Linear Fill (MSE: " + str(error) +")", ax=axes[3], label='Cubic Fill', color='brown', style=".-")


'''
Cubic Interpolation
'''
f2 = interp1d(df_nona['rownum'], df_nona['value'], kind='cubic')
df['cubic_fill'] = f2(df['rownum'])
error = np.round(mean_squared_error(df_orig['value'], df['cubic_fill']), 2)
df['cubic_fill'].plot(title="Cubic Fill (MSE: " + str(error) +")", ax=axes[4], label='Cubic Fill', color='red', style=".-")

'''
Mean of 'n' Nearest Past Neighbors
'''
def knn_mean(ts, n):
    out = np.copy(ts)
    for i, val in enumerate(ts):
        if np.isnan(val):
            n_by_2 = np.ceil(n/2)
            lower = np.max([0, int(i-n_by_2)])
            upper = np.min([len(ts)+1, int(i+n_by_2)])
            ts_near = np.concatenate([ts[lower:i], ts[i:upper]])
            out[i] = np.nanmean(ts_near)
    return out

df['knn_mean'] = knn_mean(df.value.values, 8)
error = np.round(mean_squared_error(df_orig['value'], df['knn_mean']), 2)
df['knn_mean'].plot(title="KNN Mean (MSE: " + str(error) +")", ax=axes[5], label='KNN Mean', color='tomato', alpha=0.5, style=".-")


'''
Seasonal Mean
'''
def seasonal_mean(ts, n, lr=0.7):
    """
    Compute the mean of corresponding seasonal periods
    ts: 1D array-like of the time series
    n: Seasonal window length of the time series
    """
    out = np.copy(ts)
    for i, val in enumerate(ts):
        if np.isnan(val):
            ts_seas = ts[i-1::-n]  # handle previous seasons only
            if np.isnan(np.nanmean(ts_seas)):
                ts_seas = np.concatenate([ts[i-1::-n], ts[i::n]])  # track the previous and forward
            out[i] = np.nanmean(ts_seas) * lr
    return out

df['seasonal_mean'] = seasonal_mean(df.value, n=12, lr=1.25)
error = np.round(mean_squared_error(df_orig['value'], df['seasonal_mean']), 2)
df['seasonal_mean'].plot(title="Seasonal Mean (MSE: " + str(error) +")", ax=axes[6], label='Seasonal Mean', color='blue', alpha=0.5, style=".-")
plt.show()

</pre>
</div>
Treating missing values in a time series
Treating missing values in a time series

Depending on the accuracy levels you wish to attain, it is worthwhile to consider other approaches. These comprise of :

For instance, using a prediction model, random forest or k-Nearest neighbor, for prediction when dealing with explanatory variables.

  • Forecasting the missing values if you have enough past observations
  • Backcast the missing values in case of enough future observations.
  • Using previous cycle’s counterparts to forecast.

Autocorrelation and partial autocorrelation functions

In cases of a highly correlated series, the current values can easily be predicted using the previous values of the series or lags. Based on that, autocorrelation refers to the correlation of a given series with its lags.

As much as partial correlation also conveys the same information, it purely conveys the series’s correlation and its lag. This excludes the correlation contributions made from the intermediate lags.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv')

# ACF and PACF Calculation up to 50 lags

# acf_50 = acf(df.value, nlags=50)
# pacf_50 = pacf(df.value, nlags=50)

# Plot  Drawing
fig, axes = plt.subplots(1,2,figsize=(16,3), dpi= 100)
plot_acf(df.value.tolist(), lags=50, ax=axes[0])
plot_pacf(df.value.tolist(), lags=50, ax=axes[1])
plt.show()</pre>
</div>
Autocorrelation and partial autocorrelation functions
Autocorrelation and partial autocorrelation functions

Computing partial autocorrelation functionality

Y’s autoregressive equation is the linear regression of Y having the predictors at its lags. Thus, the lag’s partial autocorrelation in a particular series is the given lag’s coefficient in Y’s autoregression equation.

For instance, given X_t is the series, and X_t-1 is the lag of 1 of X then, 3(X_t-3) is the coefficient’s partial autocorrelation of the coefficient $\alpha_3$ of X_t-3 in the equation that follows below.

equation for autoregression
equation for autoregression

Lag Plots

It refers to a scatter plot of a time series versus its lag. The reason for its use is checking autocorrelation. In case of pattern existence in the series at hand, then the series is autocorrelated. In case such a pattern does not exist, then high chance the series is random white noise.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas.plotting import lag_plot
plt.rcParams.update({'ytick.left' : False, 'axes.titlepad':10})

# Import
ss = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv')
a10 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv')

# Plot
fig, axes = plt.subplots(1, 4, figsize=(10,3), sharex=True, sharey=True, dpi=100)
for i, ax in enumerate(axes.flatten()[:4]):
    lag_plot(ss.value, lag=i+1, ax=ax, c='firebrick')
    ax.set_title('Lag ' + str(i+1))

fig.suptitle('Sun Spots Area' Lag Plots  \n(with an increasing lag, points get wide and scattered  -> correlation is lesser)\n', y=1.15)    

fig, axes = plt.subplots(1, 4, figsize=(10,3), sharex=True, sharey=True, dpi=100)
for i, ax in enumerate(axes.flatten()[:4]):
    lag_plot(a10.value, lag=i+1, ax=ax, c='firebrick')
    ax.set_title('Lag ' + str(i+1))

fig.suptitle('Drug Sales' Lag Plots  ', y=1.05)    
plt.show()</pre>
</div>
Lag Plots: figure 1
Lag Plots: figure 1
Lag Plots: figure 2
Lag Plots: figure 2

In the example above on Sunspots area time series, as the n_lag increases, the plots become more and more scattered.

Estimating the forecastability of a time series

The process of forecasting in a time series is easier when the patterns are repeatable and more regular. The approximate entropy is essential in regularity’s quantification and unpredictability of time series fluctuations.

It is tough to make forecasts when the approximate entropy keeps increasing.

The other alternative is the sample entropy, whose consistency in complexity estimation is more for even the smaller time series. However, it is similar to the approximate entropy. Consider a random time series with fewer data points with a smaller approximate entropy than a more regular time series. On the other hand, a series with a long random will have an increased approximate entropy.

The problem is nicely handled by using a sample entropy.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

ss = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv')
a10 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv')
rand_small = np.random.randint(0, 100, size=36)
rand_big = np.random.randint(0, 100, size=136)

def ApEn(U, m, r):
    """Computation of the aproximate entropy"""
    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return (N - m + 1.0)**(-1) * sum(np.log(C))

    N = len(U)
    return abs(_phi(m+1) - _phi(m))

print(ApEn(ss.value, m=2, r=0.2*np.std(ss.value)))    
print(ApEn(a10.value, m=2, r=0.2*np.std(a10.value)))   
print(ApEn(rand_small, m=2, r=0.2*np.std(rand_small)))
print(ApEn(rand_big, m=2, r=0.2*np.std(rand_big)))    </pre>
</div>
Estimating the forecastability of a time series: example 1
Estimating the forecastability of a time series: example 1
<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>import numpy as np
import pandas as pd

ss = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/sunspotarea.csv')
a10 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv')
rand_small = np.random.randint(0, 100, size=36)
rand_big = np.random.randint(0, 100, size=136)

def SampEn(U, m, r):
    """Sample entropy computation"""
    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for j in range(len(x)) if i != j and _maxdist(x[i], x[j]) <= r]) for i in range(len(x))]
        return sum(C)

    N = len(U)
    return -np.log(_phi(m+1) / _phi(m))

print(SampEn(ss.value, m=2, r=0.2*np.std(ss.value)))      
print(SampEn(a10.value, m=2, r=0.2*np.std(a10.value)))   
print(SampEn(rand_small, m=2, r=0.2*np.std(rand_small)))
print(SampEn(rand_big, m=2, r=0.2*np.std(rand_big)))      </pre>
</div>
Estimating the forecastability of a time series: example 2
Estimating the forecastability of a time series: example 2

Smoothening a time series

Smoothening is a time series that is essential when minimizing the effect of noise in a signal to get a fair approximation of noise-filtered series. Further, it can be an essential feature in explaining the original series. It can also make a better visualization of the underlying trend.

There are various ways to smoothen a series, and they comprise of:

  • Taking a moving average.
  • Implement a LOESS smoothing – localized regression
  • Implement LOESS smoothing, which is a locally weighted regression

Moving average is just the average of a rolling window whose width is defined. The window-width selection process should be meticulous because large window-sizes lead to over-smoothing of the series, and a window size equal to the seasonal duration nullifies the seasonal effect.

Localized regrESSion is the long-form of LOESS, and it fits multiple regressions in every point of the local neighborhood.

Its implementation is in the statsmodels package, where the frac argument can be used to control the degree of smoothing. It also indicates the percentage of the nearby data points that need consideration to fit a regression model.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>from statsmodels.nonparametric.smoothers_lowess import lowess
plt.rcParams.update({'xtick.bottom' : False, 'axes.titlepad':5})

# Import
df_orig = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/elecequip.csv', parse_dates=['date'], index_col='date')

# 1. The Moving Average
df_ma = df_orig.value.rolling(3, center=True, closed='both').mean()

# 2. Loess Smoothing  is 5%% and 15%%
df_loess_5 = pd.DataFrame(lowess(df_orig.value, np.arange(len(df_orig.value)), frac=0.05)[:, 1], index=df_orig.index, columns=['value'])
df_loess_15 = pd.DataFrame(lowess(df_orig.value, np.arange(len(df_orig.value)), frac=0.15)[:, 1], index=df_orig.index, columns=['value'])

# Plotting
fig, axes = plt.subplots(4,1, figsize=(7, 7), sharex=True, dpi=120)
df_orig['value'].plot(ax=axes[0], color='k', title='The Original Series')
df_loess_5['value'].plot(ax=axes[1], title='Loess Smoothed at 5%%')
df_loess_15['value'].plot(ax=axes[2], title='Loess Smoothed at 15%%')
df_ma.plot(ax=axes[3], title='Moving Average (3)')
fig.suptitle('Time Series Smoothening ', y=0.95, fontsize=14)
plt.show()</pre>
</div>
time-series smoothening the original series
time-series smoothening the original series

Using Granger Causality to test if one time series can forecast another

At no point should Granger causality be used to test if Y’s lag causes Y. On the contrary, it finds application on exogenous variables only and not Y lag.

Granger causality test emanates from the idea that if X causes Y, then the Y’s forecast is based on Y’s previous value. X’s previous value should outperform Y’s forecast founded on Y’s previous values only.

Granger causality test is implemented in the package statsmodel.

It accepts two columns as the core argument and a 2D array. Whereas the predictor(X) is in the second column, the values are in the first column.

In this case, the null hypothesis refers to the series in the second column. If the P-Values are less than a significant level of about 0.05, then the null hypothesis is rejected and concludes that the said lag of X is useful.

Further, the second argument, which is the maxlag, indicates how many Y lags are included in the test.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>from statsmodels.tsa.stattools import grangercausalitytests
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df['month'] = df.date.dt.month
grangercausalitytests(df[['value', 'month']], maxlag=2)</pre>
</div>
Using Granger Causality to test if one time series can forecast another
Using Granger Causality to test if one time series can forecast another

Based on the scenario above, all the tests posted zero for P-Values. Thus, the month is solely for forecasting the air passengers.

Summary

This article has expounded on the basics of time series analysis in Python and laid a solid foundation for you to understand time series as a whole. The foundation laid will enable anyone who has gone through this article to forecast uncertainty bounds successfully, do time-series forecast with an external data source and do anomaly detection. We have also examined series models at the top being autoregressive and moving average models.

Remember that data in time series is available in so many domains, and data scientists need to work with this increasing demand effectively. The most obvious and available time-series application is in finance, forecasting stock prices, dealing with climate data, retail sales, economic, weather, and making estimations.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *