Data acquisition and cleaning¶

Let's start importing the required libraries.

import pandas as pd
import glob
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

Mounting the google drive unit where the data was uploaded.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Retrieving traffic data,simultaneously filtering the sensors we are interested in at the same time. Since the files are large I read them piecemeal to avoid memory errors. This has been executed several times in order to load all the files to google drive. The I save and load the dataset to pkl file to make faster subsequent executions.

#Import pollution data from January 2019. All files in the folder


#path = r'/content/drive/MyDrive/Traffic/'
#all_csv = glob.glob(path + "/*.csv")

#traffic_df=pd.DataFrame()
#id_list = [5067,5085,5090,10580,10250] #Selected traffic sensors.
#for filename in all_csv:
#    iter_csv = pd.read_csv(filename, index_col=None, header=0,sep=';', iterator=True, chunksize=10000)
#    aux_df = pd.concat((chunk for chunk in iter_csv), axis=0, ignore_index=True)
#    traffic_df=traffic_df.append(aux_df[aux_df['id'].isin(id_list)],ignore_index=True)

#save dataframe to load it faster in other executions.
#traffic_df.to_pickle(r'/content/drive/MyDrive/Traffic/traffic_df.pkl')

traffic_df=pd.read_pickle(r'/content/drive/MyDrive/Traffic/traffic_df.pkl')

#traffic_df=traffic_df.loc[traffic_df['id'].isin([5067,5085,5090])]

Let's check what we've got.

traffic_df.describe()

As we can see, the column vmed (average speed) is always 0 because the selected sensors do not measure traffic speed, so we will delete this column, also the column periodo_integracion can be deleted because it is always 15 minutes, which is the period of sensor measurements in each row. The tipo_elem column also always has the same value, it contains the sensor type, so it is of no interest to us. I will also translate the rest of the columns into English for better understanding.

traffic_df.drop(['tipo_elem','vmed','periodo_integracion'], inplace=True,axis=1)
traffic_df.rename(columns={'fecha':'DATE','intensidad':'INTENSITY','ocupacion':'OCCUPANCY','carga':'LOAD'},inplace=True)

According to the data definition found on the same website, an E or an S in the error column means that the readings are not valid, which we should check.

traffic_df[(traffic_df['error']=='E') | (traffic_df['error']=='S')]

The data seems to be in order, let's proceed with the preparation of our data set. Since we have 5 traffic sensors, we want to use the hourly average of all the readings from the sensors, so we will get the hour and day from the date and calculate the average from those columns. Then we get rid of the columns DATE and error.

traffic_df["HOUR"] = pd.DatetimeIndex(traffic_df.DATE).hour
traffic_df["DAY"] = pd.DatetimeIndex(traffic_df.DATE).date
traffic_df=traffic_df.drop(['DATE','error'],axis=1)

traffic_df.head()

We proceed to group the data and average all the sensors.

traffic_grouped_df=traffic_df.groupby(['HOUR','DAY']).mean()
traffic_grouped_df=traffic_grouped_df.drop(['id'],axis=1)
traffic_grouped_df.head()

We need to reset the index to ungroup the data once we have the mean.

traffic_grouped_df=traffic_grouped_df.reset_index()

traffic_grouped_df.head()

All right, we have the traffic data set where we want it. Let's move on to the pollution data set. Let's import the data.

#Import pollution data from January 2019. All files in the folder

#path = r'/content/drive/MyDrive/Pollution/'
#all_csv = glob.glob(path + "/*.csv")

#li = []

#for filename in all_csv:
#    df = pd.read_csv(filename, index_col=None, header=0,sep=';')
#    li.append(df)

#pollution_df = pd.concat(li, axis=0, ignore_index=True)

#pollution_df.to_pickle(r'/content/drive/MyDrive/Traffic/pollution_df.pkl')

pollution_df = pd.read_pickle(r'/content/drive/MyDrive/Traffic/pollution_df.pkl')

pollution_df.iloc[:,4:16].head()

This dataset is structured differently, containing the hourly data in columns. Let's start by filtering the data and picking out the sensor we are interested in.(Some columns are hidden for clarity)

pollution_df=pollution_df[pollution_df['PUNTO_MUESTREO'].str.contains('28079056')]

Also, we want to filter the magnitude we are interested in (NO2).

pollution_df=pollution_df.loc[pollution_df['MAGNITUD'] == 8]

#Get the Vxx columns
v_cols = [col for col in pollution_df.columns if col.startswith('V')]
print(v_cols)

['V01', 'V02', 'V03', 'V04', 'V05', 'V06', 'V07', 'V08', 'V09', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24']

According to the data specification, columns Vxx contain the validity of the measured values. If a column contains an N, the reading is invalid. Let's check how many we have.

#Let's count the V and N values
pollution_df[v_cols].stack().value_counts()

V    18050
N      190
dtype: int64

We have some invalid readings. To eliminate them, I will assign the previous day's value at the same hour. To do this, we create a function and apply it.

#Create a function to set the previous day value
def setPrevious(vcol,df):
    #Get the colname with the hour
    hcolname=vcol.name.replace('V','H')
    #Set the previous values
    df[hcolname]=np.where(df[vcol.name] == 'N',df[hcolname].shift(1), df[hcolname])

#Apply function to set the values. Use _ (drop away variable) to avoid output.
_ =pollution_df.apply(lambda x: setPrevious(x,pollution_df) if x.name.startswith('V') else x)

Now we can drop the V columns.

pollution_df=pollution_df.drop(v_cols,axis=1)

We need to have the NO$_{2}$ readings in one column to be able to merge with the traffic dataset. Let's get the Hxx columns, and melt the data.

#Get the Hxx columns
h_cols = [col for col in pollution_df.columns if col.startswith('H')]
print(h_cols)

['H01', 'H02', 'H03', 'H04', 'H05', 'H06', 'H07', 'H08', 'H09', 'H10', 'H11', 'H12', 'H13', 'H14', 'H15', 'H16', 'H17', 'H18', 'H19', 'H20', 'H21', 'H22', 'H23', 'H24']

pollution_df=pd.melt(pollution_df, id_vars=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','MAGNITUD'], value_vars=h_cols,var_name='Hour', value_name='NO2')

Good. We need a column DATE in the same format as in the traffic record. Also the hour (0 to 23)

pollution_df['DATE'] = pd.to_datetime(pollution_df['ANO'].astype(str)+'/'+pollution_df['MES'].astype(str)+'/'+pollution_df['DIA'].astype(str),dayfirst=True)

pollution_df.head()

Now we need to convert the format of the hour column in our traffic data to match the format in the pollution data set.

traffic_grouped_df['HOUR']=traffic_grouped_df['HOUR'].apply(lambda x: 'H'+str(x+1).zfill(2))

traffic_grouped_df.head()

Let's drop some unnecessary columns and change the column names in the pollution dataset.

pollution_df=pollution_df.drop(['ANO','MES','DIA'],axis=1)
pollution_df.rename(columns={'DATE':'DAY','Hour':'HOUR'},inplace=True)
traffic_grouped_df['DAY']=pd.to_datetime(traffic_grouped_df['DAY'])

pollution_df.head()

Now we can merge the data!. Let's also drop the unnecessary columns.

df=pd.merge(traffic_grouped_df, pollution_df, how="left", on=['DAY','HOUR'])
df=df.drop(['PROVINCIA','MUNICIPIO','MAGNITUD'],axis=1)

df.describe()

All right, we have got the traffic and pollution data together, we are almost done. Now let's read in the weather data. This data set is similar to the pollution data, so we need to repeat the same data cleaning and transformation.

#Import weather data


path = r'/content/drive/MyDrive/Meteo/'
all_csv = glob.glob(path + "/*.csv")

li = []

for filename in all_csv:
    auxdf = pd.read_csv(filename, index_col=None, header=0,sep=';')
    li.append(auxdf)

meteo_df = pd.concat(li, axis=0, ignore_index=True)

meteo_df.iloc[:,4:12].head()

Let's get the sensor we are interested in.

meteo_df=meteo_df.loc[meteo_df['PUNTO_MUESTREO'].str.contains('28079056')]

Let's fix the incorrect values and melt the data

#Apply function to set the previous values. Use _ (drop away variable) to avoid output.
_ =meteo_df.apply(lambda x: setPrevious(x,meteo_df) if x.name.startswith('V') else x)
#Get the Vxx columns
v_cols = [col for col in meteo_df.columns if col.startswith('V')]
meteo_df=meteo_df.drop(v_cols,axis=1)
#Get hcols
h_cols = [col for col in meteo_df.columns if col.startswith('H')]
#Melt the data
meteo_df=pd.melt(meteo_df, id_vars=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','MAGNITUD'], value_vars=h_cols,var_name='Hour', value_name='VALUE')

meteo_df.head()

Since we want to use multiple measures in this case, we need an extra step. Let's pivot the data to convert the measures into columns.

meteo_df=meteo_df.pivot(index=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','Hour'],columns='MAGNITUD',values='VALUE')
meteo_df.reset_index(inplace=True)
meteo_df.head()

We rename the columns so that they've proper names instead of codes, as specified in the data specification.

meteo_df.columns=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','Hour','WIND_SPEED','WIND_DIR','TEMPERATURE','REL_HUMIDITY','BAR_PRESSURE','PRECIPITATION']

Now we create a date column in DateTime format.

meteo_df['DATE'] = pd.to_datetime(meteo_df['ANO'].astype(str)+'/'+meteo_df['MES'].astype(str)+'/'+meteo_df['DIA'].astype(str),dayfirst=True)

meteo_df.iloc[:,5:].head()

Drop unnecessary columns and change the key column names to prepare the dataset for merging.

meteo_df=meteo_df.drop(['PROVINCIA','MUNICIPIO','ANO','MES','DIA'],axis=1)
meteo_df.rename(columns={'DATE':'DAY','Hour':'HOUR'},inplace=True)

meteo_df.head()

Great. Let's merge!

df=pd.merge(df, meteo_df, how="left", on=['DAY','HOUR'])

df.describe()

Finally, let's check for NA values.

df.isnull().any()

HOUR             False
DAY              False
INTENSITY        False
OCCUPANCY        False
LOAD             False
NO2               True
WIND_SPEED        True
WIND_DIR          True
TEMPERATURE       True
REL_HUMIDITY      True
BAR_PRESSURE      True
PRECIPITATION     True
dtype: bool

We have some null values in the column Wind Direction. Let us check how many there are.

df[df['WIND_DIR'].isnull()]

We have some rows with null value, since the set is not a large percentage of the data set, we just drop them.

df.dropna(inplace=True)

df.describe()

Right. To conclude this section, I will categorize the NO$_{2}$ data, so that the exercise is clearer and we can build a classification model.

The limit for the hourly level of NO$_{2}$ in the air triggers an alarm in Madrid is 180 µg/m3, so we will use this number to classify the lines into low and high (pollution level)

df['NO2_CAT']=pd.cut(df['NO2'], bins=[0, 181,330], include_lowest=True,right=False,labels=['Low','High'])

df.head()

df[['NO2_CAT']].value_counts()

NO2_CAT
Low        17356
High          95
dtype: int64

A little bit of data exploration.¶

Let's check the data in both categories (High and Low)

df[df['NO2_CAT']=='Low'].describe()

df[df['NO2_CAT']=='High'].describe()

The traffic-related variables (intensity, occupancy, and load) differ in both categories. The average, standard deviation and quantile of the two categories are far apart. Wind speed also has different values in the two categories, suggesting that these variables may be important in determining NO$_{2}$ values.

Let's check now the correlation among the variables.

correlation=df.corr()
fig, ax = plt.subplots(figsize=(10,10))
sn.heatmap(correlation,annot=True,linewidths=.5, ax=ax)
plt.show()

It seems that the variable most correlated with NO$_{2}$ is wind speed, which is really important. Traffic variables also seem to have an important effect on NO$_{2}$.

Let's look at some scatter plots. As we can see below, the low points aren't separated from the high points, only the INTENSITY -OCCUPANCY graphs show some low points, but mixed with the high ones, there's no clear line separating them.

#scatter plots
data=df[['WIND_SPEED','INTENSITY','OCCUPANCY','NO2_CAT']]
sn.pairplot(data,hue='NO2_CAT')

Predictive analysis.¶

In this section, we will use classification methods (SGD Classifier) to try to classify and predict the values of NO$_{2}$ levels using traffic and weather data. As you can see from the number of values, this is an imbalanced data set. Let us take a look at how the algorithm works. The values of some parameters were obtained by parameter optimization of the algorithm, which is not shown here.

df['NO2_CAT'].value_counts()

Low     17356
High       95
Name: NO2_CAT, dtype: int64

First, we split the data into training and testing datasets.Obviously we remove NO$_{2}$ values from the independent variables. We also select the most important variables impacting NO$_{2}$.

# Import train_test_split function
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight

x=df[['INTENSITY','OCCUPANCY','WIND_SPEED']]
#x=df[['INTENSITY','OCCUPANCY','WIND_SPEED']]
y=df[['NO2_CAT']]

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30,random_state=12,stratify=y)

Let's fit the model.

from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV


pipe=Pipeline([('ss',StandardScaler()),('sgd',SGDClassifier(random_state=42))])

params={'sgd__loss':['hinge','log','modified_huber','squared_hinge','perceptron','squared_loss','huber','epsilon_insensitive','squared_epsilon_insensitive'],
        'sgd__epsilon':[1e-3,1e-2,0.1,1],
        'sgd__fit_intercept':[True,False],
        'sgd__alpha':[0.0001,0.001,0.1,1],
        'sgd__max_iter':[100,1000,10000],
        'sgd__tol':[1e-3,1e-2,0.1,1],
        'sgd__early_stopping':[True,False],
        'sgd__validation_fraction':[0.1,0.3,0.5,0.7],
        'sgd__n_iter_no_change':[1,3,5,10],
        'sgd__average':[None,1,3,5,7,10]     
}
rscv = RandomizedSearchCV(pipe, param_distributions=params,
                                   n_iter=500,scoring='f1_macro',verbose=0,random_state=42)
rscv.fit(X_train, np.array(y_train).ravel())

rscv.best_score_

0.5138898258912509

from sklearn.metrics import classification_report
y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))

              precision    recall  f1-score   support

        High       0.00      0.00      0.00        29
         Low       0.99      1.00      1.00      5207

    accuracy                           0.99      5236
   macro avg       0.50      0.50      0.50      5236
weighted avg       0.99      0.99      0.99      5236

The results are really disappointing: low precission and recall scores. This is because we have a very imbalanced dataset, so the instances with high pollution are ignored. Let's try to fix that. We can use the parameter class_weight in order to make the algorithm take into account the imbalanced nature of the dataset.

from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV


pipe=Pipeline([('ss',StandardScaler()),('sgd',SGDClassifier(random_state=42,class_weight='balanced'))])

rscv = RandomizedSearchCV(pipe, param_distributions=params,
                                   n_iter=500,scoring='f1_macro',verbose=0,random_state=42)
rscv.fit(X_train, np.array(y_train).ravel())

rscv.best_score_

0.5833165625829128

y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))

              precision    recall  f1-score   support

        High       0.12      0.90      0.22        29
         Low       1.00      0.97      0.98      5207

    accuracy                           0.96      5236
   macro avg       0.56      0.93      0.60      5236
weighted avg       0.99      0.96      0.98      5236

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rscv).ravel()
(tn, fp, fn, tp)

(26, 3, 182, 5025)

Ok, the prediction has improved, the recall is much better which is what we are searching for, on the other hand there are many false negatives, (low points predicted as high points). In this case, it is not bad because if we use this algorithm, we want to be sure that we can predict all the high values, even if there are some false negatives, but we should try to improve the model.

Let's try other methods to deal with imbalanced datasets. Oversampling methods create synthetic instances of the minority class so that the dataset is more balanced. The imblearn library helps us with some algorithms. The SMOTETomek algorithm combines oversampling and undersampling.We only need to balance our training dataset to avoid overfitting our model, so we include the oversampling in our pipeline.

from imblearn.combine  import SMOTETomek

oversample=SMOTETomek()

from sklearn import svm
from imblearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV


pipe=Pipeline([('ss',StandardScaler()),('ov',oversample),('sgd',SGDClassifier(random_state=42))])

rscv = RandomizedSearchCV(pipe, param_distributions=params,
                                   n_iter=500,scoring='f1_macro',verbose=0)
rscv.fit(X_train, np.array(y_train).ravel())

y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))

              precision    recall  f1-score   support

        High       0.07      0.93      0.13        29
         Low       1.00      0.93      0.96      5207

    accuracy                           0.93      5236
   macro avg       0.53      0.93      0.55      5236
weighted avg       0.99      0.93      0.96      5236

tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rscv).ravel()
(tn, fp, fn, tp)

(27, 2, 364, 4843)

Worst results in my opinion, slightly better recall, much worst precision.

We have seen how to download real (open) data from an open data portal and clean and prepare it for analysis. We have also learned the impact of an imbalanced dataset and how we can try to improve our models. We have seen that we need to balance precision and recall depending on our goals. In this case, we may want to predict all cases of high pollution even if we have a high number of false alarms.

	id	intensidad	ocupacion	carga	vmed	periodo_integracion
count	302929.000000	302929.000000	302929.000000	302929.000000	302929.0	302929.000000
mean	6704.972069	323.311528	4.037372	14.713821	0.0	14.773277
std	2426.240416	241.518512	7.532080	12.639165	0.0	1.324177
min	5067.000000	0.000000	0.000000	0.000000	0.0	1.000000
25%	5085.000000	107.000000	1.000000	5.000000	0.0	15.000000
50%	5090.000000	325.000000	2.000000	12.000000	0.0	15.000000
75%	10250.000000	498.000000	4.000000	21.000000	0.0	15.000000
max	10580.000000	4332.000000	92.000000	99.000000	0.0	30.000000

		INTENSITY	OCCUPANCY	LOAD
HOUR	DAY
0	2019-01-01	207.3125	0.6875	7.0000
	2019-01-02	136.6875	0.4375	4.7500
	2019-01-03	165.4375	0.8750	5.7500
	2019-01-04	187.1875	1.3750	7.2500
	2019-01-05	235.1250	1.1250	8.3125

	DAY	INTENSITY	OCCUPANCY	LOAD
0	2019-01-01	207.3125	0.6875	7.0000
1	2019-01-02	136.6875	0.4375	4.7500
2	2019-01-03	165.4375	0.8750	5.7500
3	2019-01-04	187.1875	1.3750	7.2500
4	2019-01-05	235.1250	1.1250	8.3125

	PUNTO_MUESTREO	ANO	MES	DIA	H01	V01	H02	V02	H03	V03	H04	V04
0	28079004_1_38	2019	1	1	23.0	V	17.0	V	19.0	V	14.0	V
1	28079004_1_38	2019	1	2	15.0	V	14.0	V	12.0	V	11.0	V
2	28079004_1_38	2019	1	3	14.0	V	13.0	V	11.0	V	11.0	V
3	28079004_1_38	2019	1	4	17.0	V	14.0	V	12.0	V	11.0	V
4	28079004_1_38	2019	1	5	15.0	V	14.0	V	13.0	V	12.0	V

	HOUR	DAY	INTENSITY	OCCUPANCY	LOAD
0	H01	2019-01-01	207.3125	0.6875	7.0000
1	H01	2019-01-02	136.6875	0.4375	4.7500
2	H01	2019-01-03	165.4375	0.8750	5.7500
3	H01	2019-01-04	187.1875	1.3750	7.2500
4	H01	2019-01-05	235.1250	1.1250	8.3125

Comparing Data Augmentation Techniques to Deal with an Unbalanced Dataset (Pollution Levels Predictions)

Predicting NO₂ levels in Madrid

Data acquisition and cleaning¶

A little bit of data exploration.¶

Predictive analysis.¶

	id	INTENSITY	OCCUPANCY	LOAD	HOUR	DAY
0	5067	124	0.0	5	0	2019-01-01
1	5067	110	0.0	4	0	2019-01-01
2	5067	336	1.0	11	0	2019-01-01
3	5067	435	1.0	16	0	2019-01-01
4	5067	687	3.0	26	1	2019-01-01

	PROVINCIA	MUNICIPIO	ANO	MES	DIA	MAGNITUD	Hour	NO2	DATE
0	28	79	2019	1	1	8	H01	73.0	2019-01-01
1	28	79	2019	1	2	8	H01	69.0	2019-01-02
2	28	79	2019	1	3	8	H01	91.0	2019-01-03
3	28	79	2019	1	4	8	H01	78.0	2019-01-04
4	28	79	2019	1	5	8	H01	77.0	2019-01-05

	INTENSITY	OCCUPANCY	LOAD	NO2
count	18195.00000	18195.000000	18195.000000	18147.000000
mean	325.84632	4.001556	14.575198	47.140629
std	195.73730	4.709824	9.119553	33.169010
min	0.00000	0.000000	0.000000	2.000000
25%	137.43750	0.750000	5.687500	23.000000
50%	342.43750	2.857143	15.250000	40.000000
75%	493.87500	5.312500	21.250000	62.000000
max	2376.87500	67.687500	73.500000	328.000000

	PUNTO_MUESTREO	ANO	MES	DIA	H01	V01	H02	V02
0	28079102_81_98	2021	1	1	0.98	V	1.40	V
1	28079102_81_98	2021	1	2	1.18	V	1.65	V
2	28079102_81_98	2021	1	3	0.62	V	0.55	V
3	28079102_81_98	2021	1	4	1.93	V	1.82	V
4	28079102_81_98	2021	1	5	1.62	V	0.78	V

	PROVINCIA	MUNICIPIO	ANO	MES	DIA	MAGNITUD	Hour	VALUE
0	28	79	2021	1	1	81	H01	0.97
1	28	79	2021	1	2	81	H01	1.29
2	28	79	2021	1	3	81	H01	0.57
3	28	79	2021	1	4	81	H01	0.77
4	28	79	2021	1	5	81	H01	0.67

MAGNITUD	PROVINCIA	MUNICIPIO	ANO	MES	DIA	Hour	81	82	83	86	87
0	28	79	2019	1	1	H01	0.68	48.0	2.6	71.0	959.0
1	28	79	2019	1	1	H02	0.69	32.0	2.5	71.0	959.0
2	28	79	2019	1	1	H03	0.70	43.0	1.7	74.0	959.0
3	28	79	2019	1	1	H04	0.70	74.0	1.1	75.0	959.0
4	28	79	2019	1	1	H05	0.66	67.0	1.2	74.0	959.0

	HOUR	DAY	INTENSITY	OCCUPANCY	LOAD	NO2	WIND_SPEED	WIND_DIR	TEMPERATURE	REL_HUMIDITY	BAR_PRESSURE	PRECIPITATION
179	H01	2019-06-29	325.9375	2.2500	12.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
180	H01	2019-06-30	354.9375	2.2500	12.7500	NaN	NaN	NaN	NaN	NaN	NaN	NaN
341	H01	2019-12-08	305.1250	2.4375	11.1875	67.0	0.58	NaN	6.5	89.0	954.0	0.0
342	H01	2019-12-09	171.8125	0.6875	6.0625	54.0	0.62	NaN	7.1	97.0	955.0	0.0
343	H01	2019-12-10	148.5625	1.0625	5.5000	45.0	0.71	NaN	6.5	97.0	959.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
17824	H24	2020-01-23	262.5625	3.3125	11.1875	48.0	1.02	NaN	8.2	77.0	949.0	0.0
17825	H24	2020-01-24	299.2500	2.0000	11.3125	79.0	1.39	NaN	8.0	84.0	948.0	0.0
17826	H24	2020-01-25	339.5625	1.5000	12.3125	67.0	0.67	NaN	7.8	94.0	947.0	0.0
17827	H24	2020-01-26	275.2500	1.6250	10.0000	69.0	0.66	NaN	5.6	86.0	953.0	0.0
17828	H24	2020-01-27	177.1875	0.6250	6.6250	17.0	1.55	NaN	8.4	93.0	951.0	0.0

	INTENSITY	OCCUPANCY	LOAD	NO2	WIND_SPEED	WIND_DIR	TEMPERATURE	REL_HUMIDITY	BAR_PRESSURE	PRECIPITATION
count	17451.000000	17451.000000	17451.000000	17451.000000	17451.000000	17451.000000	17451.000000	17451.000000	17451.000000	17451.000000
mean	323.342970	3.984568	14.545641	46.703799	1.190489	153.019491	16.488872	55.012750	946.223540	0.908945
std	195.728403	4.709614	9.146820	33.212783	0.606131	86.354229	9.127438	22.945937	27.372493	28.671773
min	0.000000	0.000000	0.000000	2.000000	0.000000	0.000000	-55.000000	-25.000000	30.000000	0.000000
25%	134.968750	0.750000	5.625000	23.000000	0.690000	72.000000	9.900000	36.000000	943.000000	0.000000
50%	339.312500	2.812500	15.187500	40.000000	1.060000	160.000000	15.400000	53.000000	947.000000	0.000000
75%	489.604167	5.312500	21.250000	61.000000	1.520000	225.000000	22.900000	75.000000	951.000000	0.000000
max	2376.875000	67.687500	73.500000	328.000000	5.040000	359.000000	246.000000	100.000000	963.000000	958.000000

	INTENSITY	OCCUPANCY	LOAD	NO2	WIND_SPEED	WIND_DIR	TEMPERATURE	REL_HUMIDITY	BAR_PRESSURE	PRECIPITATION
count	17356.000000	17356.000000	17356.000000	17356.000000	17356.000000	17356.000000	17356.000000	17356.000000	17356.000000	17356.00000
mean	322.238115	3.952679	14.493762	45.804448	1.194040	152.678909	16.497620	55.067614	946.191461	0.91392
std	195.589285	4.661698	9.125898	30.924149	0.605866	86.299012	9.134572	22.966065	27.441345	28.75006
min	0.000000	0.000000	0.000000	2.000000	0.000000	0.000000	-55.000000	-25.000000	30.000000	0.00000
25%	133.859375	0.750000	5.562500	23.000000	0.700000	71.000000	9.900000	36.000000	943.000000	0.00000
50%	337.606250	2.800000	15.062500	40.000000	1.060000	158.000000	15.400000	53.000000	947.000000	0.00000
75%	487.515625	5.300000	21.200000	61.000000	1.520000	225.000000	22.900000	75.000000	951.000000	0.00000
max	2376.875000	67.687500	73.500000	180.000000	5.040000	359.000000	246.000000	100.000000	963.000000	958.00000

	INTENSITY	OCCUPANCY	LOAD	NO2	WIND_SPEED	WIND_DIR	TEMPERATURE	REL_HUMIDITY	BAR_PRESSURE	PRECIPITATION
count	95.000000	95.000000	95.000000	95.000000	95.000000	95.000000	95.000000	95.000000	95.000000	95.0
mean	525.194079	9.810526	24.023684	211.010526	0.541789	215.242105	14.890526	44.989474	952.084211	0.0
std	85.822097	8.415996	7.990609	27.970157	0.054869	73.227381	7.586131	16.106358	5.052202	0.0
min	231.500000	1.250000	8.437500	181.000000	0.500000	19.000000	1.300000	18.000000	938.000000	0.0
25%	496.656250	4.000000	18.656250	193.500000	0.510000	223.000000	9.200000	32.000000	948.000000	0.0
50%	538.125000	7.000000	22.312500	202.000000	0.520000	233.000000	13.100000	42.000000	953.000000	0.0
75%	583.250000	12.656250	28.343750	221.500000	0.560000	250.000000	20.000000	57.500000	957.000000	0.0
max	728.375000	46.125000	54.937500	328.000000	0.790000	323.000000	33.700000	86.000000	960.000000	0.0

Predicting NO2 levels in Madrid

Data acquisition and cleaning¶

A little bit of data exploration.¶

Predictive analysis.¶

Predicting NO₂ levels in Madrid