Let's import the necessary libraries

import pandas as pd
import glob
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import pacf

Mount google drive.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Let's import the data, before doing it I've had to remove the double quotes from the csv file.

data = pd.read_csv('/content/drive/MyDrive/DP_LIVE_10082021082010744.csv', parse_dates=['TIME'], index_col='TIME', header=0,sep=",")
data.head()

Let's remove the unnecessary columns, we are only interested in TIME and Value

data.drop(['LOCATION','INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1,inplace=True)

Let's plot the data

data.plot(figsize=(15,5))

<matplotlib.axes._subplots.AxesSubplot at 0x7fe2dd2afc10>

#In google colab we should install pmdarima in order to use it.
!pip install pmdarima

Requirement already satisfied: pmdarima in /usr/local/lib/python3.7/dist-packages (1.8.3)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.4.1)
Requirement already satisfied: statsmodels!=0.12.0,>=0.11 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (0.13.0)
Requirement already satisfied: pandas>=0.19 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.1.5)
Requirement already satisfied: numpy>=1.19.3 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.19.5)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (57.4.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.0.1)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (0.22.2.post1)
Requirement already satisfied: Cython!=0.29.18,>=0.29 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (0.29.24)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from pmdarima) (1.24.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.19->pmdarima) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.19->pmdarima) (2018.9)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.19->pmdarima) (1.15.0)
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.7/dist-packages (from statsmodels!=0.12.0,>=0.11->pmdarima) (0.5.2)

Let's try it. First we should split the data into train and test sets.

TEST_SIZE = 36
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape

((222, 1), (222,), (36, 1), (36,))

fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)

[<matplotlib.lines.Line2D at 0x7fe2dc640ed0>]

from pmdarima.arima import auto_arima

Let's fit the auto_arima model.

model = auto_arima(train, start_p=1, start_q=1,
                      test='adf',
                      max_p=5, max_q=5,
                      m=1,             
                      d=1,          
                      seasonal=False,   
                      start_P=0, 
                      D=None, 
                      trace=True,
                      error_action='ignore',  
                      suppress_warnings=True, 
                      stepwise=True)

Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=-456.381, Time=0.20 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=-368.929, Time=0.11 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=-414.310, Time=0.08 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=-392.956, Time=0.11 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=-369.426, Time=0.03 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=-458.020, Time=0.41 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=-449.037, Time=0.22 sec
 ARIMA(3,1,1)(0,0,0)[0] intercept   : AIC=-456.422, Time=0.64 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=-457.172, Time=0.58 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=-457.553, Time=0.52 sec
 ARIMA(3,1,0)(0,0,0)[0] intercept   : AIC=-452.103, Time=0.25 sec
 ARIMA(3,1,2)(0,0,0)[0] intercept   : AIC=-454.830, Time=0.70 sec
 ARIMA(2,1,1)(0,0,0)[0]             : AIC=-459.788, Time=0.21 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=-458.155, Time=0.16 sec
 ARIMA(2,1,0)(0,0,0)[0]             : AIC=-450.723, Time=0.10 sec
 ARIMA(3,1,1)(0,0,0)[0]             : AIC=-458.188, Time=0.30 sec
 ARIMA(2,1,2)(0,0,0)[0]             : AIC=-458.941, Time=0.25 sec
 ARIMA(1,1,0)(0,0,0)[0]             : AIC=-415.728, Time=0.05 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=-459.322, Time=0.21 sec
 ARIMA(3,1,0)(0,0,0)[0]             : AIC=-453.835, Time=0.15 sec
 ARIMA(3,1,2)(0,0,0)[0]             : AIC=-456.947, Time=0.50 sec

Best model:  ARIMA(2,1,1)(0,0,0)[0]          
Total fit time: 5.818 seconds

According to the algorithm the best ARIMA model is ARIMA (2,1,1)

model.summary()

According to the model summary, the model meets the condition of independence in the residuals (no correlation) because the p-value of the Ljung-Box test (Prob(Q)) is greater than 0.05, so we cannot reject the null hypothesis of independence, but we cannot say that the residual distribution is homoscedastic (constant variance) because the p-value of the Heteroskedasticity test (Prob(H)) is smaller than 0.05.

Let's make the prediction

# Forecast

prediction, confint = model.predict(n_periods=TEST_SIZE, return_conf_int=True)

prediction

array([7.22732391, 7.17428958, 7.12255592, 7.07529366, 7.03152758,
       6.99111462, 6.95377551, 6.91928079, 6.88741293, 6.85797207,
       6.83077333, 6.80564597, 6.78243223, 6.76098639, 6.74117379,
       6.72287006, 6.7059603 , 6.69033832, 6.67590608, 6.66257295,
       6.65025523, 6.63887559, 6.62836259, 6.61865023, 6.60967754,
       6.60138818, 6.59373011, 6.58665526, 6.58011921, 6.57408092,
       6.5685025 , 6.5633489 , 6.5585878 , 6.55418928, 6.55012574,
       6.54637167])

cf= pd.DataFrame(confint)

#Mostramos la gráfica con la predicción de los 2 últimos años en naranja
#sobre la serie real 
prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
                cf[0],
                cf[1],color='grey',alpha=.3)

<matplotlib.collections.PolyCollection at 0x7fe2cdb4c950>

model.plot_diagnostics(figsize=(14,10))
plt.show()

We can see from the model plots that the Correlogram does not show any significant correlation in the residuals. The standardized residual plot depicts variance change, while the normal Q-Q plot demonstrates that the residuals do not follow a normal distribution (but this is not a strict requirement to validate the model).

Let's calculate the SMAPE (Symmetric Mean Absolute Error)

def calcsmape(actual, forecast):
    return 1/len(actual) * np.sum(2 * np.abs(forecast-actual) / (np.abs(actual) + np.abs(forecast)))

smape=calcsmape(test.Value,prediction)
smape

0.0531805335780013

Now with 1 year prediction

TEST_SIZE = 12
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape

((246, 1), (246,), (12, 1), (12,))

fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)

[<matplotlib.lines.Line2D at 0x7fe2c05c5d10>]

Let's fit the new model.

model = auto_arima(train, start_p=1, start_q=1,
                      test='adf',
                      max_p=5, max_q=5,
                      m=1,             
                      d=1,          
                      seasonal=False,   
                      start_P=0, 
                      D=None, 
                      trace=True,
                      error_action='ignore',  
                      suppress_warnings=True, 
                      stepwise=True)

Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=-470.207, Time=0.32 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=-388.075, Time=0.17 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=-434.457, Time=0.07 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=-413.125, Time=0.37 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=-388.831, Time=0.04 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=-469.978, Time=0.38 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=-469.613, Time=0.48 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=-440.062, Time=0.23 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=-465.135, Time=0.27 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=-470.017, Time=0.67 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=-472.206, Time=0.16 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=-414.332, Time=0.17 sec
 ARIMA(1,1,0)(0,0,0)[0]             : AIC=-436.122, Time=0.05 sec
 ARIMA(2,1,1)(0,0,0)[0]             : AIC=-471.977, Time=0.19 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=-471.611, Time=0.20 sec
 ARIMA(0,1,2)(0,0,0)[0]             : AIC=-441.574, Time=0.12 sec
 ARIMA(2,1,0)(0,0,0)[0]             : AIC=-467.098, Time=0.10 sec
 ARIMA(2,1,2)(0,0,0)[0]             : AIC=-472.016, Time=0.32 sec

Best model:  ARIMA(1,1,1)(0,0,0)[0]          
Total fit time: 4.331 seconds

model.summary()

This time, the p-value of the Heteroskedasticity test is greater than 0.05, indicating that the model explains better the variance in the data. This model outperforms the previous one in terms of dependability. This demonstrates that ARIMA is superior for short-term forecasting.

Let's predict!

prediction, confint = model.predict(n_periods=TEST_SIZE, return_conf_int=True)

prediction

array([7.46670305, 7.62191067, 7.76641558, 7.90095578, 8.02621844,
       8.1428433 , 8.251426  , 8.35252111, 8.44664496, 8.53427827,
       8.6158686 , 8.69183267])

cf= pd.DataFrame(confint)

prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
                cf[0],
                cf[1],color='grey',alpha=.3)

<matplotlib.collections.PolyCollection at 0x7fe2c32ad890>

model.plot_diagnostics(figsize=(14,10))
plt.show()

Again, the plots show that the residuals are not correlated. The standardized residuals model seems similar to the previous one but as we have seen in the model summary this model is more reliable because we can't reject the null hypothesis of homoscedasticity (constant variance).

smape=calcsmape(test.Value,prediction)
smape

0.09249388439490931

	LOCATION	INDICATOR	SUBJECT	MEASURE	FREQUENCY	Value	Flag Codes
TIME
2000-01-01	EU27_2020	HUR	TOT	PC_LF	M	9.2	NaN
2000-02-01	EU27_2020	HUR	TOT	PC_LF	M	9.2	NaN
2000-03-01	EU27_2020	HUR	TOT	PC_LF	M	9.2	NaN
2000-04-01	EU27_2020	HUR	TOT	PC_LF	M	9.1	NaN
2000-05-01	EU27_2020	HUR	TOT	PC_LF	M	9.1	NaN

Dep. Variable:	y	No. Observations:	222
Model:	SARIMAX(2, 1, 1)	Log Likelihood	233.894
Date:	Wed, 06 Oct 2021	AIC	-459.788
Time:	07:56:45	BIC	-446.196
Sample:	0	HQIC	-454.300
	- 222
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	0.7297	0.108	6.739	0.000	0.518	0.942
ar.L2	0.1793	0.071	2.534	0.011	0.041	0.318
ma.L1	-0.5705	0.102	-5.594	0.000	-0.770	-0.371
sigma2	0.0070	0.000	34.608	0.000	0.007	0.007

Ljung-Box (L1) (Q):	0.01	Jarque-Bera (JB):	4298.76
Prob(Q):	0.94	Prob(JB):	0.00
Heteroskedasticity (H):	0.37	Skew:	2.83
Prob(H) (two-sided):	0.00	Kurtosis:	23.85

Dep. Variable:	y	No. Observations:	246
Model:	SARIMAX(1, 1, 1)	Log Likelihood	239.103
Date:	Wed, 06 Oct 2021	AIC	-472.206
Time:	07:56:51	BIC	-461.703
Sample:	0	HQIC	-467.977
	- 246
Covariance Type:	opg

Ljung-Box (L1) (Q):	0.60	Jarque-Bera (JB):	2419.79
Prob(Q):	0.44	Prob(JB):	0.00
Heteroskedasticity (H):	0.84	Skew:	2.31
Prob(H) (two-sided):	0.43	Kurtosis:	17.69

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	0.9310	0.046	20.226	0.000	0.841	1.021
ma.L1	-0.6716	0.062	-10.771	0.000	-0.794	-0.549
sigma2	0.0083	0.000	31.761	0.000	0.008	0.009