In this post, I’ll attempt to show how to forecast time series data using ARIMA (autoregressive integrated moving average). As usual, I try to practice with “real-world”, which can be obtained easily by downloading open data from government websites.
I chose the unemployment rate in the European Union’s 27 member countries. The data were obtained from the OECD data portal (http://dataportal.oecd.org/).
First of all, I’m going to try to clean up the data, in this case, it’s rather easy, and then try to modify it so that the data can be used in an ARIMA model. Finally, I will try to figure out how well the model fits our data.
Great explanations on the ARIMA model can be found on many internet sites, but I will present a brief summary here.
AR Term: AR is Auto Regressive. This is the number of previous observations that are correlated with the current one. (partial autocorrelation). This is the p parameter in the model (number of lags with significant correlation with the current observation).
I Term: I am Integrated. ARIMA needs the time series to be stationary in order to work properly, that is the data shouldn’t have trend or seasonality. If the data has a trend we must transform it in order to model this trend. Usually what we do is differencing, that is, modeling the difference between an observation and the previous one (or the nth previous ones) rather than the actual values. This is the parameter d in the model, which means the order of differencing.
MA Term: MA is Moving Average. It accounts for the error in the previous lags, that is, we take into account the errors (residuals) in the previous observations (difference between the expected and the actual data) to predict the future values. This is the parameter q in the model, which is the order of the moving average (the number of previous observations to take into account in the moving average).
Let's import the necessary libraries
import pandas as pd
import glob
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import pacf
Mount google drive.
from google.colab import drive
drive.mount('/content/drive')
Let's import the data, before doing it I've had to remove the double quotes from the csv file.
data = pd.read_csv('/content/drive/MyDrive/DP_LIVE_10082021082010744.csv', parse_dates=['TIME'], index_col='TIME', header=0,sep=",")
data.head()
Let's remove the unnecessary columns, we are only interested in TIME and Value
data.drop(['LOCATION','INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1,inplace=True)
Let's plot the data
data.plot(figsize=(15,5))
The first step is to check for stationarity. ARIMA models require stationary data, which means that the values in the time series do not depend on the time at which the data is observed. The variance in a stationary time series is constant. The Dickey–Fuller test will be used. The null hypothesis here is that the data contains a unit root, implying that the time series is not stationary. Because the p-value in the results is greater than 0.05, we cannot reject the null hypothesis (there is a unit root).
from statsmodels.tsa.stattools import adfuller
X = data.Value
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
Let's see what happens if we difference the data to make it stationary.
data_diff=data.diff().dropna()
The trend in the data has been removed, as shown in the plot.¶
data_diff.plot(figsize=(15,5))
On the differenced data, run the Dickey-Fuller test. As we can see, the p-value now allows us to reject the null hypothesis, allowing us to conclude that the data is stationary.
from statsmodels.tsa.stattools import adfuller
dataX = data_diff.Value
fuller = adfuller(dataX)
print('ADF Statistic: %f' % fuller[0])
print('p-value: %f' % fuller[1])
print('Critical Values:')
for key, value in fuller[4].items():
print('\t%s: %.3f' % (key, value))
The downloaded data is seasonally adjusted so we don't need to take into account seasonality.
Let us now examine the autocorrelation and partial autocorrelation plots. The autocorrelation plot (which assists us in determining the MA term) shows a gradually decreasing trend, indicating that no MA term is required, whereas the partial autocorrelation plot (which assists us in determining the AR term) drops abruptly after two lags (we do not count the first lag). This implies that we should use an AR(2) model.
plot_acf(np.array(data_diff))
plt.show()
plot_pacf(np.array(data_diff))
plt.show()
#In google colab we should install pmdarima in order to use it.
!pip install pmdarima
Let's try it. First we should split the data into train and test sets.
TEST_SIZE = 36
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)
from pmdarima.arima import ARIMA
from pmdarima.arima import ndiffs
from pmdarima.arima import nsdiffs
Let's check for the need for differencing and seasonally differencing the data using the ndiffs and nsdiffs functions.
First, the ndiffs function attempts to predict the order of differencing required for the data to be stationary; as we saw with the Dickey-Fuller tests, the data requires one order of differencing to be stationary.
ndiffs(data)
The nsdiffs function displays the D parameter, which determines whether the data requires seasonal differencing. We chose 12 as the m parameter because we have monthly data (The seasonal period to check). The result is 0, as expected, because the data is already seasonally adjusted, so we don't need to do anything about seasonality.
nsdiffs(data,12)
Let's fit now the ARIMA model. The p parameter is 2 (AR(2)), the d parameter is 1 (1 order of differencing), the q parameter in this case is 0 (no MA model).
model = ARIMA(order=(2,1,0))
results = model.fit(train)
model.summary()
According to the model summary:
Because the p-value of the Ljung-Box test is greater than 0.05, we cannot reject the null hypothesis that the residuals are independent.
Because the p-value of the heteroskedasticity test is less than 0.05, we can reject the null hypothesis that the error variances are equal. Because this is an assumption for a correct ARIMA model, we cannot completely rely on these results.
Let's make the prediction of the three years of test data and compare it.
# Forecast
prediction, confint = model.predict(n_periods=TEST_SIZE, return_conf_int=True)
prediction
The graph below compares the predictions and the test data. The grey zone represents the prediction's confidence interval, and as we can see, the observed data fits well within it.
cf= pd.DataFrame(confint)
prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
cf[0],
cf[1],color='grey',alpha=.3)
Finally, consider SMAPE (Symmetric Mean of Absolute Percentage Errors), which is similar to MAPE but avoids the issue of penalizing negative errors (actual < forcasted).
def calcsmape(actual, forecast):
return 1/len(actual) * np.sum(2 * np.abs(forecast-actual) / (np.abs(actual) + np.abs(forecast)))
smape=calcsmape(test.Value,prediction)
smape
Let's try with 1 year of data instead of three.
TEST_SIZE = 12
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)
model1y = ARIMA(order=(2,1,0))
results1y = model1y.fit(train)
# Forecast
prediction, confint = model1y.predict(n_periods=TEST_SIZE, return_conf_int=True)
prediction
cf= pd.DataFrame(confint)
prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
cf[0],
cf[1],color='grey',alpha=.3)
model1y.summary()
The model summary states:
We continue to hold true to the assumption that the residuals are not autocorrelated (Ljung-Box test)
We can also reject the null hypothesis of heteroskedasticity in the residuals this time (Prob(H) 0.05), making these results more reliable than the 3-year prediction results.
smape=calcsmape(test.Value,prediction)
smape
Conclusion
We learned how to obtain data from official sources so that we could test our data science skills with real-world data. The EU-27 unemployment figures were obtained from the OECD’s official website. We practiced with time series data and an ARIMA model this time. We first checked for stationarity in the data, then used differencing to make it stationary, and finally used autocorrelation and partial autocorrelation plots to estimate the ARIMA model parameters. Finally, using these parameters, we ran the model and validated it using the SMAPE metric and the autocorrelation plot of squared residuals (Symmetric Mean Absolute Error).
We also interpreted the model results summary to see how much we can trust these results. The conclusion is that a one-year prediction yields a better model than a three-year prediction, given that the three-year prediction still has unaccounted for variance in the residuals.