Let's import the necessary libraries

import pandas as pd
import glob
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import pacf

Mount google drive.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Let's import the data, before doing it I've had to remove the double quotes from the csv file.

data = pd.read_csv('/content/drive/MyDrive/DP_LIVE_10082021082010744.csv', parse_dates=['TIME'], index_col='TIME', header=0,sep=",")
data.head()

Let's remove the unnecessary columns, we are only interested in TIME and Value

data.drop(['LOCATION','INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1,inplace=True)

Let's plot the data

data.plot(figsize=(15,5))

The first step is to check for stationarity. ARIMA models require stationary data, which means that the values in the time series do not depend on the time at which the data is observed. The variance in a stationary time series is constant. The Dickey–Fuller test will be used. The null hypothesis here is that the data contains a unit root, implying that the time series is not stationary. Because the p-value in the results is greater than 0.05, we cannot reject the null hypothesis (there is a unit root).

from statsmodels.tsa.stattools import adfuller
X = data.Value
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

ADF Statistic: -1.925368
p-value: 0.320223
Critical Values:
	1%: -3.457
	5%: -2.873
	10%: -2.573

Let's see what happens if we difference the data to make it stationary.

data_diff=data.diff().dropna()

The trend in the data has been removed, as shown in the plot.¶

data_diff.plot(figsize=(15,5))

On the differenced data, run the Dickey-Fuller test. As we can see, the p-value now allows us to reject the null hypothesis, allowing us to conclude that the data is stationary.

from statsmodels.tsa.stattools import adfuller
dataX = data_diff.Value
fuller = adfuller(dataX)
print('ADF Statistic: %f' % fuller[0])
print('p-value: %f' % fuller[1])
print('Critical Values:')
for key, value in fuller[4].items():
	print('\t%s: %.3f' % (key, value))

ADF Statistic: -2.987032
p-value: 0.036123
Critical Values:
	1%: -3.457
	5%: -2.873
	10%: -2.573

The downloaded data is seasonally adjusted so we don't need to take into account seasonality.

Let us now examine the autocorrelation and partial autocorrelation plots. The autocorrelation plot (which assists us in determining the MA term) shows a gradually decreasing trend, indicating that no MA term is required, whereas the partial autocorrelation plot (which assists us in determining the AR term) drops abruptly after two lags (we do not count the first lag). This implies that we should use an AR(2) model.

plot_acf(np.array(data_diff))
plt.show()

plot_pacf(np.array(data_diff))
plt.show()

/usr/local/lib/python3.7/dist-packages/statsmodels/graphics/tsaplots.py:353: FutureWarning: The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
  FutureWarning,

#In google colab we should install pmdarima in order to use it.
!pip install pmdarima

Requirement already satisfied: pmdarima in /usr/local/lib/python3.7/dist-packages (1.8.3)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.0.1)
Requirement already satisfied: numpy>=1.19.3 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.19.5)
Requirement already satisfied: Cython!=0.29.18,>=0.29 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (0.29.24)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (0.22.2.post1)
Requirement already satisfied: statsmodels!=0.12.0,>=0.11 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (0.13.0)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.4.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.24.3)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (57.4.0)
Requirement already satisfied: pandas>=0.19 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.1.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.19->pmdarima) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.19->pmdarima) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.19->pmdarima) (1.15.0)
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.7/dist-packages (from statsmodels!=0.12.0,>=0.11->pmdarima) (0.5.2)

Let's try it. First we should split the data into train and test sets.

TEST_SIZE = 36
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape

((222, 1), (222,), (36, 1), (36,))

fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)

[]

from pmdarima.arima import ARIMA
from pmdarima.arima import ndiffs
from pmdarima.arima import nsdiffs

Let's check for the need for differencing and seasonally differencing the data using the ndiffs and nsdiffs functions.

First, the ndiffs function attempts to predict the order of differencing required for the data to be stationary; as we saw with the Dickey-Fuller tests, the data requires one order of differencing to be stationary.

ndiffs(data)

1

The nsdiffs function displays the D parameter, which determines whether the data requires seasonal differencing. We chose 12 as the m parameter because we have monthly data (The seasonal period to check). The result is 0, as expected, because the data is already seasonally adjusted, so we don't need to do anything about seasonality.

nsdiffs(data,12)

0

Let's fit now the ARIMA model. The p parameter is 2 (AR(2)), the d parameter is 1 (1 order of differencing), the q parameter in this case is 0 (no MA model).

model = ARIMA(order=(2,1,0))
results = model.fit(train)

model.summary()

According to the model summary:

Because the p-value of the Ljung-Box test is greater than 0.05, we cannot reject the null hypothesis that the residuals are independent.
Because the p-value of the heteroskedasticity test is less than 0.05, we can reject the null hypothesis that the error variances are equal. Because this is an assumption for a correct ARIMA model, we cannot completely rely on these results.

Let's make the prediction of the three years of test data and compare it.

# Forecast
prediction, confint = model.predict(n_periods=TEST_SIZE, return_conf_int=True)
prediction

array([7.21812198, 7.19328326, 7.15130377, 7.12723166, 7.10114674,
       7.0815733 , 7.06292766, 7.04708799, 7.03235417, 7.01901595,
       7.0064812 , 6.99470751, 6.98345078, 6.97262989, 6.96212741,
       6.95188043, 6.94182615, 6.93192324, 6.92213608, 6.91243903,
       6.90281131, 6.89323732, 6.88370479, 6.87420434, 6.86472867,
       6.85527217, 6.84583046, 6.8364002 , 6.82697879, 6.81756421,
       6.80815492, 6.79874971, 6.78934766, 6.77994805, 6.77055032,
       6.76115405])

The graph below compares the predictions and the test data. The grey zone represents the prediction's confidence interval, and as we can see, the observed data fits well within it.

cf= pd.DataFrame(confint)
prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
                cf[0],
                cf[1],color='grey',alpha=.3)

Finally, consider SMAPE (Symmetric Mean of Absolute Percentage Errors), which is similar to MAPE but avoids the issue of penalizing negative errors (actual < forcasted).

def calcsmape(actual, forecast):
    return 1/len(actual) * np.sum(2 * np.abs(forecast-actual) / (np.abs(actual) + np.abs(forecast)))

smape=calcsmape(test.Value,prediction)
smape

0.05145800665592428

Let's try with 1 year of data instead of three.

TEST_SIZE = 12
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)

[]

model1y = ARIMA(order=(2,1,0))
results1y = model1y.fit(train)
# Forecast
prediction, confint = model1y.predict(n_periods=TEST_SIZE, return_conf_int=True)
prediction

array([7.52485398, 7.73496936, 7.87703065, 7.99385755, 8.07854597,
       8.14466595, 8.19366651, 8.23091006, 8.25848968, 8.27896832,
       8.29385614, 8.30452695])

cf= pd.DataFrame(confint)
prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
                cf[0],
                cf[1],color='grey',alpha=.3)

model1y.summary()

The model summary states:

We continue to hold true to the assumption that the residuals are not autocorrelated (Ljung-Box test)
We can also reject the null hypothesis of heteroskedasticity in the residuals this time (Prob(H) 0.05), making these results more reliable than the 3-year prediction results.

smape=calcsmape(test.Value,prediction)
smape

0.08096459003790507

	LOCATION	INDICATOR	SUBJECT	MEASURE	FREQUENCY	Value	Flag Codes
TIME
2000-01-01	EU27_2020	HUR	TOT	PC_LF	M	9.2	NaN
2000-02-01	EU27_2020	HUR	TOT	PC_LF	M	9.2	NaN
2000-03-01	EU27_2020	HUR	TOT	PC_LF	M	9.2	NaN
2000-04-01	EU27_2020	HUR	TOT	PC_LF	M	9.1	NaN
2000-05-01	EU27_2020	HUR	TOT	PC_LF	M	9.1	NaN

Dep. Variable:	y	No. Observations:	222
Model:	SARIMAX(2, 1, 0)	Log Likelihood	228.519
Date:	Tue, 05 Oct 2021	AIC	-449.037
Time:	08:25:01	BIC	-435.445
Sample:	0	HQIC	-443.549
	- 222
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-0.0032	0.007	-0.441	0.659	-0.018	0.011
ar.L1	0.2641	0.044	6.040	0.000	0.178	0.350
ar.L2	0.3933	0.036	10.899	0.000	0.323	0.464
sigma2	0.0074	0.000	28.176	0.000	0.007	0.008

Ljung-Box (L1) (Q):	0.74	Jarque-Bera (JB):	3646.60
Prob(Q):	0.39	Prob(JB):	0.00
Heteroskedasticity (H):	0.40	Skew:	2.46
Prob(H) (two-sided):	0.00	Kurtosis:	22.28

Dep. Variable:	y	No. Observations:	246
Model:	SARIMAX(2, 1, 0)	Log Likelihood	236.568
Date:	Tue, 05 Oct 2021	AIC	-465.135
Time:	08:25:02	BIC	-451.130
Sample:	0	HQIC	-459.496
	- 246
Covariance Type:	opg

Ljung-Box (L1) (Q):	0.29	Jarque-Bera (JB):	2348.02
Prob(Q):	0.59	Prob(JB):	0.00
Heteroskedasticity (H):	0.81	Skew:	2.14
Prob(H) (two-sided):	0.35	Kurtosis:	17.55

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-0.0011	0.007	-0.158	0.874	-0.015	0.013
ar.L1	0.2920	0.035	8.424	0.000	0.224	0.360
ar.L2	0.3640	0.036	10.012	0.000	0.293	0.435
sigma2	0.0085	0.000	26.482	0.000	0.008	0.009