Dealing with a real-world imbalanced Dataset.

Posted by

Predicting NO2 levels in Madrid

While looking for data to develop my data science skills, I came up with the idea of searching open data portals. I wanted to look at actual datasets and find out what they were like. For this purpose, I chose open data from the Madrid Open Data Portal (https://datos.madrid.es/portal/site/egob). I will try to predict NO2 concentration using weather and traffic data.

This is not meant to be a definitive prediction algorithm, as I am only using a small dataset with specific geographic and temporal points, but just a small exercise with real data to learn how certain machine learning algorithms can work. After I prepared the data, I realized that I had encountered an unbalanced data set, so I tried to deal with that.

For this exercise, I obtained, cleaned, and merged three different datasets to create a model. As you will read later, the pollution dataset is really unbalanced because there are few cases with high NO2 concentrations. I will try to fix this using the subsampling technique.

The datasets I have chosen contain traffic, weather, and pollution data. I am trying to predict the pollution data (NO2) from traffic and weather data. The data ranges from January 2019 to January 2021, and since the weather data starts in 2019, earlier data is not available.

To shorten the datasets, I filtered the data and select only data from a specific area in Madrid. I select data from sensors in this area (Plaza Elíptica) and choose one pollution and weather sensor as well as 5 traffic sensors nearby.

Madrid_Pollution_Prediction_traffic

Data acquisition and cleaning

Let's start importing the required libraries.

In [152]:
import pandas as pd
import glob
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

Mounting the google drive unit where the data was uploaded.

In [153]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Retrieving traffic data,simultaneously filtering the sensors we are interested in at the same time. Since the files are large I read them piecemeal to avoid memory errors. This has been executed several times in order to load all the files to google drive. The I save and load the dataset to pkl file to make faster subsequent executions.

In [154]:
#Import pollution data from January 2019. All files in the folder


#path = r'/content/drive/MyDrive/Traffic/'
#all_csv = glob.glob(path + "/*.csv")

#traffic_df=pd.DataFrame()
#id_list = [5067,5085,5090,10580,10250] #Selected traffic sensors.
#for filename in all_csv:
#    iter_csv = pd.read_csv(filename, index_col=None, header=0,sep=';', iterator=True, chunksize=10000)
#    aux_df = pd.concat((chunk for chunk in iter_csv), axis=0, ignore_index=True)
#    traffic_df=traffic_df.append(aux_df[aux_df['id'].isin(id_list)],ignore_index=True)
In [155]:
#save dataframe to load it faster in other executions.
#traffic_df.to_pickle(r'/content/drive/MyDrive/Traffic/traffic_df.pkl')
In [156]:
traffic_df=pd.read_pickle(r'/content/drive/MyDrive/Traffic/traffic_df.pkl')
In [157]:
#traffic_df=traffic_df.loc[traffic_df['id'].isin([5067,5085,5090])]

Let's check what we've got.

In [158]:
traffic_df.describe()
Out[158]:
id intensidad ocupacion carga vmed periodo_integracion
count 302929.000000 302929.000000 302929.000000 302929.000000 302929.0 302929.000000
mean 6704.972069 323.311528 4.037372 14.713821 0.0 14.773277
std 2426.240416 241.518512 7.532080 12.639165 0.0 1.324177
min 5067.000000 0.000000 0.000000 0.000000 0.0 1.000000
25% 5085.000000 107.000000 1.000000 5.000000 0.0 15.000000
50% 5090.000000 325.000000 2.000000 12.000000 0.0 15.000000
75% 10250.000000 498.000000 4.000000 21.000000 0.0 15.000000
max 10580.000000 4332.000000 92.000000 99.000000 0.0 30.000000

As we can see, the column vmed (average speed) is always 0 because the selected sensors do not measure traffic speed, so we will delete this column, also the column periodo_integracion can be deleted because it is always 15 minutes, which is the period of sensor measurements in each row. The tipo_elem column also always has the same value, it contains the sensor type, so it is of no interest to us. I will also translate the rest of the columns into English for better understanding.

In [159]:
traffic_df.drop(['tipo_elem','vmed','periodo_integracion'], inplace=True,axis=1)
traffic_df.rename(columns={'fecha':'DATE','intensidad':'INTENSITY','ocupacion':'OCCUPANCY','carga':'LOAD'},inplace=True)

According to the data definition found on the same website, an E or an S in the error column means that the readings are not valid, which we should check.

In [160]:
traffic_df[(traffic_df['error']=='E') | (traffic_df['error']=='S')]
Out[160]:
id DATE INTENSITY OCCUPANCY LOAD error

The data seems to be in order, let's proceed with the preparation of our data set. Since we have 5 traffic sensors, we want to use the hourly average of all the readings from the sensors, so we will get the hour and day from the date and calculate the average from those columns. Then we get rid of the columns DATE and error.

In [161]:
traffic_df["HOUR"] = pd.DatetimeIndex(traffic_df.DATE).hour
traffic_df["DAY"] = pd.DatetimeIndex(traffic_df.DATE).date
traffic_df=traffic_df.drop(['DATE','error'],axis=1)
In [162]:
traffic_df.head()
Out[162]:
id INTENSITY OCCUPANCY LOAD HOUR DAY
0 5067 124 0.0 5 0 2019-01-01
1 5067 110 0.0 4 0 2019-01-01
2 5067 336 1.0 11 0 2019-01-01
3 5067 435 1.0 16 0 2019-01-01
4 5067 687 3.0 26 1 2019-01-01
In [162]:

We proceed to group the data and average all the sensors.

In [163]:
traffic_grouped_df=traffic_df.groupby(['HOUR','DAY']).mean()
traffic_grouped_df=traffic_grouped_df.drop(['id'],axis=1)
traffic_grouped_df.head()
Out[163]:
INTENSITY OCCUPANCY LOAD
HOUR DAY
0 2019-01-01 207.3125 0.6875 7.0000
2019-01-02 136.6875 0.4375 4.7500
2019-01-03 165.4375 0.8750 5.7500
2019-01-04 187.1875 1.3750 7.2500
2019-01-05 235.1250 1.1250 8.3125

We need to reset the index to ungroup the data once we have the mean.

In [164]:
traffic_grouped_df=traffic_grouped_df.reset_index()
In [165]:
traffic_grouped_df.head()
Out[165]:
HOUR DAY INTENSITY OCCUPANCY LOAD
0 0 2019-01-01 207.3125 0.6875 7.0000
1 0 2019-01-02 136.6875 0.4375 4.7500
2 0 2019-01-03 165.4375 0.8750 5.7500
3 0 2019-01-04 187.1875 1.3750 7.2500
4 0 2019-01-05 235.1250 1.1250 8.3125

All right, we have the traffic data set where we want it. Let's move on to the pollution data set. Let's import the data.

In [166]:
#Import pollution data from January 2019. All files in the folder

#path = r'/content/drive/MyDrive/Pollution/'
#all_csv = glob.glob(path + "/*.csv")

#li = []

#for filename in all_csv:
#    df = pd.read_csv(filename, index_col=None, header=0,sep=';')
#    li.append(df)

#pollution_df = pd.concat(li, axis=0, ignore_index=True)
In [167]:
#pollution_df.to_pickle(r'/content/drive/MyDrive/Traffic/pollution_df.pkl')
In [168]:
pollution_df = pd.read_pickle(r'/content/drive/MyDrive/Traffic/pollution_df.pkl')
In [169]:
pollution_df.iloc[:,4:16].head()
Out[169]:
PUNTO_MUESTREO ANO MES DIA H01 V01 H02 V02 H03 V03 H04 V04
0 28079004_1_38 2019 1 1 23.0 V 17.0 V 19.0 V 14.0 V
1 28079004_1_38 2019 1 2 15.0 V 14.0 V 12.0 V 11.0 V
2 28079004_1_38 2019 1 3 14.0 V 13.0 V 11.0 V 11.0 V
3 28079004_1_38 2019 1 4 17.0 V 14.0 V 12.0 V 11.0 V
4 28079004_1_38 2019 1 5 15.0 V 14.0 V 13.0 V 12.0 V

This dataset is structured differently, containing the hourly data in columns. Let's start by filtering the data and picking out the sensor we are interested in.(Some columns are hidden for clarity)

In [170]:
pollution_df=pollution_df[pollution_df['PUNTO_MUESTREO'].str.contains('28079056')]

Also, we want to filter the magnitude we are interested in (NO2).

In [171]:
pollution_df=pollution_df.loc[pollution_df['MAGNITUD'] == 8]
In [172]:
#Get the Vxx columns
v_cols = [col for col in pollution_df.columns if col.startswith('V')]
print(v_cols)
['V01', 'V02', 'V03', 'V04', 'V05', 'V06', 'V07', 'V08', 'V09', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24']

According to the data specification, columns Vxx contain the validity of the measured values. If a column contains an N, the reading is invalid. Let's check how many we have.

In [173]:
#Let's count the V and N values
pollution_df[v_cols].stack().value_counts()
Out[173]:
V    18050
N      190
dtype: int64

We have some invalid readings. To eliminate them, I will assign the previous day's value at the same hour. To do this, we create a function and apply it.

In [174]:
#Create a function to set the previous day value
def setPrevious(vcol,df):
    #Get the colname with the hour
    hcolname=vcol.name.replace('V','H')
    #Set the previous values
    df[hcolname]=np.where(df[vcol.name] == 'N',df[hcolname].shift(1), df[hcolname])
In [175]:
#Apply function to set the values. Use _ (drop away variable) to avoid output.
_ =pollution_df.apply(lambda x: setPrevious(x,pollution_df) if x.name.startswith('V') else x)

Now we can drop the V columns.

In [176]:
pollution_df=pollution_df.drop(v_cols,axis=1)

We need to have the NO$_{2}$ readings in one column to be able to merge with the traffic dataset. Let's get the Hxx columns, and melt the data.

In [177]:
#Get the Hxx columns
h_cols = [col for col in pollution_df.columns if col.startswith('H')]
print(h_cols)
['H01', 'H02', 'H03', 'H04', 'H05', 'H06', 'H07', 'H08', 'H09', 'H10', 'H11', 'H12', 'H13', 'H14', 'H15', 'H16', 'H17', 'H18', 'H19', 'H20', 'H21', 'H22', 'H23', 'H24']
In [178]:
pollution_df=pd.melt(pollution_df, id_vars=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','MAGNITUD'], value_vars=h_cols,var_name='Hour', value_name='NO2')

Good. We need a column DATE in the same format as in the traffic record. Also the hour (0 to 23)

In [179]:
pollution_df['DATE'] = pd.to_datetime(pollution_df['ANO'].astype(str)+'/'+pollution_df['MES'].astype(str)+'/'+pollution_df['DIA'].astype(str),dayfirst=True)
In [180]:
pollution_df.head()
Out[180]:
PROVINCIA MUNICIPIO ANO MES DIA MAGNITUD Hour NO2 DATE
0 28 79 2019 1 1 8 H01 73.0 2019-01-01
1 28 79 2019 1 2 8 H01 69.0 2019-01-02
2 28 79 2019 1 3 8 H01 91.0 2019-01-03
3 28 79 2019 1 4 8 H01 78.0 2019-01-04
4 28 79 2019 1 5 8 H01 77.0 2019-01-05

Now we need to convert the format of the hour column in our traffic data to match the format in the pollution data set.

In [181]:
traffic_grouped_df['HOUR']=traffic_grouped_df['HOUR'].apply(lambda x: 'H'+str(x+1).zfill(2))
In [182]:
traffic_grouped_df.head()
Out[182]:
HOUR DAY INTENSITY OCCUPANCY LOAD
0 H01 2019-01-01 207.3125 0.6875 7.0000
1 H01 2019-01-02 136.6875 0.4375 4.7500
2 H01 2019-01-03 165.4375 0.8750 5.7500
3 H01 2019-01-04 187.1875 1.3750 7.2500
4 H01 2019-01-05 235.1250 1.1250 8.3125

Let's drop some unnecessary columns and change the column names in the pollution dataset.

In [183]:
pollution_df=pollution_df.drop(['ANO','MES','DIA'],axis=1)
pollution_df.rename(columns={'DATE':'DAY','Hour':'HOUR'},inplace=True)
traffic_grouped_df['DAY']=pd.to_datetime(traffic_grouped_df['DAY'])
In [184]:
pollution_df.head()
Out[184]:
PROVINCIA MUNICIPIO MAGNITUD HOUR NO2 DAY
0 28 79 8 H01 73.0 2019-01-01
1 28 79 8 H01 69.0 2019-01-02
2 28 79 8 H01 91.0 2019-01-03
3 28 79 8 H01 78.0 2019-01-04
4 28 79 8 H01 77.0 2019-01-05

Now we can merge the data!. Let's also drop the unnecessary columns.

In [185]:
df=pd.merge(traffic_grouped_df, pollution_df, how="left", on=['DAY','HOUR'])
df=df.drop(['PROVINCIA','MUNICIPIO','MAGNITUD'],axis=1)
In [186]:
df.describe()
Out[186]:
INTENSITY OCCUPANCY LOAD NO2
count 18195.00000 18195.000000 18195.000000 18147.000000
mean 325.84632 4.001556 14.575198 47.140629
std 195.73730 4.709824 9.119553 33.169010
min 0.00000 0.000000 0.000000 2.000000
25% 137.43750 0.750000 5.687500 23.000000
50% 342.43750 2.857143 15.250000 40.000000
75% 493.87500 5.312500 21.250000 62.000000
max 2376.87500 67.687500 73.500000 328.000000

All right, we have got the traffic and pollution data together, we are almost done. Now let's read in the weather data. This data set is similar to the pollution data, so we need to repeat the same data cleaning and transformation.

In [187]:
#Import weather data


path = r'/content/drive/MyDrive/Meteo/'
all_csv = glob.glob(path + "/*.csv")

li = []

for filename in all_csv:
    auxdf = pd.read_csv(filename, index_col=None, header=0,sep=';')
    li.append(auxdf)

meteo_df = pd.concat(li, axis=0, ignore_index=True)
In [188]:
meteo_df.iloc[:,4:12].head()
Out[188]:
PUNTO_MUESTREO ANO MES DIA H01 V01 H02 V02
0 28079102_81_98 2021 1 1 0.98 V 1.40 V
1 28079102_81_98 2021 1 2 1.18 V 1.65 V
2 28079102_81_98 2021 1 3 0.62 V 0.55 V
3 28079102_81_98 2021 1 4 1.93 V 1.82 V
4 28079102_81_98 2021 1 5 1.62 V 0.78 V

Let's get the sensor we are interested in.

In [107]:
meteo_df=meteo_df.loc[meteo_df['PUNTO_MUESTREO'].str.contains('28079056')]

Let's fix the incorrect values and melt the data

In [108]:
#Apply function to set the previous values. Use _ (drop away variable) to avoid output.
_ =meteo_df.apply(lambda x: setPrevious(x,meteo_df) if x.name.startswith('V') else x)
#Get the Vxx columns
v_cols = [col for col in meteo_df.columns if col.startswith('V')]
meteo_df=meteo_df.drop(v_cols,axis=1)
#Get hcols
h_cols = [col for col in meteo_df.columns if col.startswith('H')]
#Melt the data
meteo_df=pd.melt(meteo_df, id_vars=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','MAGNITUD'], value_vars=h_cols,var_name='Hour', value_name='VALUE')
In [109]:
meteo_df.head()
Out[109]:
PROVINCIA MUNICIPIO ANO MES DIA MAGNITUD Hour VALUE
0 28 79 2021 1 1 81 H01 0.97
1 28 79 2021 1 2 81 H01 1.29
2 28 79 2021 1 3 81 H01 0.57
3 28 79 2021 1 4 81 H01 0.77
4 28 79 2021 1 5 81 H01 0.67

Since we want to use multiple measures in this case, we need an extra step. Let's pivot the data to convert the measures into columns.

In [110]:
meteo_df=meteo_df.pivot(index=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','Hour'],columns='MAGNITUD',values='VALUE')
meteo_df.reset_index(inplace=True)
meteo_df.head()
Out[110]:
MAGNITUD PROVINCIA MUNICIPIO ANO MES DIA Hour 81 82 83 86 87 89
0 28 79 2019 1 1 H01 0.68 48.0 2.6 71.0 959.0 0.0
1 28 79 2019 1 1 H02 0.69 32.0 2.5 71.0 959.0 0.0
2 28 79 2019 1 1 H03 0.70 43.0 1.7 74.0 959.0 0.0
3 28 79 2019 1 1 H04 0.70 74.0 1.1 75.0 959.0 0.0
4 28 79 2019 1 1 H05 0.66 67.0 1.2 74.0 959.0 0.0

We rename the columns so that they've proper names instead of codes, as specified in the data specification.

In [111]:
meteo_df.columns=['PROVINCIA','MUNICIPIO','ANO','MES','DIA','Hour','WIND_SPEED','WIND_DIR','TEMPERATURE','REL_HUMIDITY','BAR_PRESSURE','PRECIPITATION']

Now we create a date column in DateTime format.

In [112]:
meteo_df['DATE'] = pd.to_datetime(meteo_df['ANO'].astype(str)+'/'+meteo_df['MES'].astype(str)+'/'+meteo_df['DIA'].astype(str),dayfirst=True)
In [113]:
meteo_df.iloc[:,5:].head()
Out[113]:
Hour WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION DATE
0 H01 0.68 48.0 2.6 71.0 959.0 0.0 2019-01-01
1 H02 0.69 32.0 2.5 71.0 959.0 0.0 2019-01-01
2 H03 0.70 43.0 1.7 74.0 959.0 0.0 2019-01-01
3 H04 0.70 74.0 1.1 75.0 959.0 0.0 2019-01-01
4 H05 0.66 67.0 1.2 74.0 959.0 0.0 2019-01-01

Drop unnecessary columns and change the key column names to prepare the dataset for merging.

In [114]:
meteo_df=meteo_df.drop(['PROVINCIA','MUNICIPIO','ANO','MES','DIA'],axis=1)
meteo_df.rename(columns={'DATE':'DAY','Hour':'HOUR'},inplace=True)
In [115]:
meteo_df.head()
Out[115]:
HOUR WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION DAY
0 H01 0.68 48.0 2.6 71.0 959.0 0.0 2019-01-01
1 H02 0.69 32.0 2.5 71.0 959.0 0.0 2019-01-01
2 H03 0.70 43.0 1.7 74.0 959.0 0.0 2019-01-01
3 H04 0.70 74.0 1.1 75.0 959.0 0.0 2019-01-01
4 H05 0.66 67.0 1.2 74.0 959.0 0.0 2019-01-01

Great. Let's merge!

In [116]:
df=pd.merge(df, meteo_df, how="left", on=['DAY','HOUR'])
In [117]:
df.describe()
Out[117]:
INTENSITY OCCUPANCY LOAD NO2 WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION
count 18195.00000 18195.000000 18195.000000 18147.000000 18147.000000 17451.000000 18147.000000 18147.000000 18147.000000 18147.000000
mean 325.84632 4.001556 14.575198 47.140629 1.188928 153.019491 16.133868 55.755469 946.457321 0.875048
std 195.73730 4.709824 9.119553 33.169010 0.608182 86.354229 9.149526 23.052179 26.884344 28.117067
min 0.00000 0.000000 0.000000 2.000000 0.000000 0.000000 -55.000000 -25.000000 30.000000 0.000000
25% 137.43750 0.750000 5.687500 23.000000 0.690000 72.000000 9.500000 36.000000 943.000000 0.000000
50% 342.43750 2.857143 15.250000 40.000000 1.050000 160.000000 14.900000 55.000000 947.000000 0.000000
75% 493.87500 5.312500 21.250000 62.000000 1.510000 225.000000 22.500000 76.000000 951.000000 0.000000
max 2376.87500 67.687500 73.500000 328.000000 5.040000 359.000000 246.000000 100.000000 963.000000 958.000000

Finally, let's check for NA values.

In [118]:
df.isnull().any()
Out[118]:
HOUR             False
DAY              False
INTENSITY        False
OCCUPANCY        False
LOAD             False
NO2               True
WIND_SPEED        True
WIND_DIR          True
TEMPERATURE       True
REL_HUMIDITY      True
BAR_PRESSURE      True
PRECIPITATION     True
dtype: bool

We have some null values in the column Wind Direction. Let us check how many there are.

In [119]:
df[df['WIND_DIR'].isnull()]
Out[119]:
HOUR DAY INTENSITY OCCUPANCY LOAD NO2 WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION
179 H01 2019-06-29 325.9375 2.2500 12.0000 NaN NaN NaN NaN NaN NaN NaN
180 H01 2019-06-30 354.9375 2.2500 12.7500 NaN NaN NaN NaN NaN NaN NaN
341 H01 2019-12-08 305.1250 2.4375 11.1875 67.0 0.58 NaN 6.5 89.0 954.0 0.0
342 H01 2019-12-09 171.8125 0.6875 6.0625 54.0 0.62 NaN 7.1 97.0 955.0 0.0
343 H01 2019-12-10 148.5625 1.0625 5.5000 45.0 0.71 NaN 6.5 97.0 959.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
17824 H24 2020-01-23 262.5625 3.3125 11.1875 48.0 1.02 NaN 8.2 77.0 949.0 0.0
17825 H24 2020-01-24 299.2500 2.0000 11.3125 79.0 1.39 NaN 8.0 84.0 948.0 0.0
17826 H24 2020-01-25 339.5625 1.5000 12.3125 67.0 0.67 NaN 7.8 94.0 947.0 0.0
17827 H24 2020-01-26 275.2500 1.6250 10.0000 69.0 0.66 NaN 5.6 86.0 953.0 0.0
17828 H24 2020-01-27 177.1875 0.6250 6.6250 17.0 1.55 NaN 8.4 93.0 951.0 0.0

744 rows × 12 columns

We have some rows with null value, since the set is not a large percentage of the data set, we just drop them.

In [120]:
df.dropna(inplace=True)
In [122]:
df.describe()
Out[122]:
INTENSITY OCCUPANCY LOAD NO2 WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION
count 17451.000000 17451.000000 17451.000000 17451.000000 17451.000000 17451.000000 17451.000000 17451.000000 17451.000000 17451.000000
mean 323.342970 3.984568 14.545641 46.703799 1.190489 153.019491 16.488872 55.012750 946.223540 0.908945
std 195.728403 4.709614 9.146820 33.212783 0.606131 86.354229 9.127438 22.945937 27.372493 28.671773
min 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 -55.000000 -25.000000 30.000000 0.000000
25% 134.968750 0.750000 5.625000 23.000000 0.690000 72.000000 9.900000 36.000000 943.000000 0.000000
50% 339.312500 2.812500 15.187500 40.000000 1.060000 160.000000 15.400000 53.000000 947.000000 0.000000
75% 489.604167 5.312500 21.250000 61.000000 1.520000 225.000000 22.900000 75.000000 951.000000 0.000000
max 2376.875000 67.687500 73.500000 328.000000 5.040000 359.000000 246.000000 100.000000 963.000000 958.000000

Right. To conclude this section, I will categorize the NO$_{2}$ data, so that the exercise is clearer and we can build a classification model.

The limit for the hourly level of NO$_{2}$ in the air triggers an alarm in Madrid is 180 µg/m3, so we will use this number to classify the lines into low and high (pollution level)

In [123]:
df['NO2_CAT']=pd.cut(df['NO2'], bins=[0, 181,330], include_lowest=True,right=False,labels=['Low','High'])
In [124]:
df.head()
Out[124]:
HOUR DAY INTENSITY OCCUPANCY LOAD NO2 WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION NO2_CAT
0 H01 2019-01-01 207.3125 0.6875 7.0000 73.0 0.68 48.0 2.6 71.0 959.0 0.0 Low
1 H01 2019-01-02 136.6875 0.4375 4.7500 69.0 0.83 83.0 1.8 77.0 959.0 0.0 Low
2 H01 2019-01-03 165.4375 0.8750 5.7500 91.0 0.63 74.0 2.7 73.0 956.0 0.0 Low
3 H01 2019-01-04 187.1875 1.3750 7.2500 78.0 0.66 83.0 2.3 71.0 958.0 0.0 Low
4 H01 2019-01-05 235.1250 1.1250 8.3125 77.0 0.71 74.0 1.5 74.0 960.0 0.0 Low
In [125]:
df[['NO2_CAT']].value_counts()
Out[125]:
NO2_CAT
Low        17356
High          95
dtype: int64

A little bit of data exploration.

Let's check the data in both categories (High and Low)

In [149]:
df[df['NO2_CAT']=='Low'].describe()
Out[149]:
INTENSITY OCCUPANCY LOAD NO2 WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION
count 17356.000000 17356.000000 17356.000000 17356.000000 17356.000000 17356.000000 17356.000000 17356.000000 17356.000000 17356.00000
mean 322.238115 3.952679 14.493762 45.804448 1.194040 152.678909 16.497620 55.067614 946.191461 0.91392
std 195.589285 4.661698 9.125898 30.924149 0.605866 86.299012 9.134572 22.966065 27.441345 28.75006
min 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 -55.000000 -25.000000 30.000000 0.00000
25% 133.859375 0.750000 5.562500 23.000000 0.700000 71.000000 9.900000 36.000000 943.000000 0.00000
50% 337.606250 2.800000 15.062500 40.000000 1.060000 158.000000 15.400000 53.000000 947.000000 0.00000
75% 487.515625 5.300000 21.200000 61.000000 1.520000 225.000000 22.900000 75.000000 951.000000 0.00000
max 2376.875000 67.687500 73.500000 180.000000 5.040000 359.000000 246.000000 100.000000 963.000000 958.00000
In [150]:
df[df['NO2_CAT']=='High'].describe()
Out[150]:
INTENSITY OCCUPANCY LOAD NO2 WIND_SPEED WIND_DIR TEMPERATURE REL_HUMIDITY BAR_PRESSURE PRECIPITATION
count 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.0
mean 525.194079 9.810526 24.023684 211.010526 0.541789 215.242105 14.890526 44.989474 952.084211 0.0
std 85.822097 8.415996 7.990609 27.970157 0.054869 73.227381 7.586131 16.106358 5.052202 0.0
min 231.500000 1.250000 8.437500 181.000000 0.500000 19.000000 1.300000 18.000000 938.000000 0.0
25% 496.656250 4.000000 18.656250 193.500000 0.510000 223.000000 9.200000 32.000000 948.000000 0.0
50% 538.125000 7.000000 22.312500 202.000000 0.520000 233.000000 13.100000 42.000000 953.000000 0.0
75% 583.250000 12.656250 28.343750 221.500000 0.560000 250.000000 20.000000 57.500000 957.000000 0.0
max 728.375000 46.125000 54.937500 328.000000 0.790000 323.000000 33.700000 86.000000 960.000000 0.0

The traffic-related variables (intensity, occupancy, and load) differ in both categories. The average, standard deviation and quantile of the two categories are far apart. Wind speed also has different values in the two categories, suggesting that these variables may be important in determining NO$_{2}$ values.

Let's check now the correlation among the variables.

In [128]:
correlation=df.corr()
fig, ax = plt.subplots(figsize=(10,10))
sn.heatmap(correlation,annot=True,linewidths=.5, ax=ax)
plt.show()

It seems that the variable most correlated with NO$_{2}$ is wind speed, which is really important. Traffic variables also seem to have an important effect on NO$_{2}$.

Let's look at some scatter plots. As we can see below, the low points aren't separated from the high points, only the INTENSITY -OCCUPANCY graphs show some low points, but mixed with the high ones, there's no clear line separating them.

In [129]:
#scatter plots
data=df[['WIND_SPEED','INTENSITY','OCCUPANCY','NO2_CAT']]
sn.pairplot(data,hue='NO2_CAT')
Out[129]:

Predictive analysis.

In this section, we will use classification methods (SGD Classifier) to try to classify and predict the values of NO$_{2}$ levels using traffic and weather data. As you can see from the number of values, this is an imbalanced data set. Let us take a look at how the algorithm works. The values of some parameters were obtained by parameter optimization of the algorithm, which is not shown here.

In [130]:
df['NO2_CAT'].value_counts()
Out[130]:
Low     17356
High       95
Name: NO2_CAT, dtype: int64

First, we split the data into training and testing datasets.Obviously we remove NO$_{2}$ values from the independent variables. We also select the most important variables impacting NO$_{2}$.

In [131]:
# Import train_test_split function
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight

x=df[['INTENSITY','OCCUPANCY','WIND_SPEED']]
#x=df[['INTENSITY','OCCUPANCY','WIND_SPEED']]
y=df[['NO2_CAT']]
In [132]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30,random_state=12,stratify=y)

Let's fit the model.

In [ ]:
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV


pipe=Pipeline([('ss',StandardScaler()),('sgd',SGDClassifier(random_state=42))])

params={'sgd__loss':['hinge','log','modified_huber','squared_hinge','perceptron','squared_loss','huber','epsilon_insensitive','squared_epsilon_insensitive'],
        'sgd__epsilon':[1e-3,1e-2,0.1,1],
        'sgd__fit_intercept':[True,False],
        'sgd__alpha':[0.0001,0.001,0.1,1],
        'sgd__max_iter':[100,1000,10000],
        'sgd__tol':[1e-3,1e-2,0.1,1],
        'sgd__early_stopping':[True,False],
        'sgd__validation_fraction':[0.1,0.3,0.5,0.7],
        'sgd__n_iter_no_change':[1,3,5,10],
        'sgd__average':[None,1,3,5,7,10]     
}
rscv = RandomizedSearchCV(pipe, param_distributions=params,
                                   n_iter=500,scoring='f1_macro',verbose=0,random_state=42)
rscv.fit(X_train, np.array(y_train).ravel())
In [135]:
rscv.best_score_
Out[135]:
0.5138898258912509
In [136]:
from sklearn.metrics import classification_report
y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))
              precision    recall  f1-score   support

        High       0.00      0.00      0.00        29
         Low       0.99      1.00      1.00      5207

    accuracy                           0.99      5236
   macro avg       0.50      0.50      0.50      5236
weighted avg       0.99      0.99      0.99      5236

The results are really disappointing: low precission and recall scores. This is because we have a very imbalanced dataset, so the instances with high pollution are ignored. Let's try to fix that. We can use the parameter class_weight in order to make the algorithm take into account the imbalanced nature of the dataset.

In [ ]:
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV


pipe=Pipeline([('ss',StandardScaler()),('sgd',SGDClassifier(random_state=42,class_weight='balanced'))])

rscv = RandomizedSearchCV(pipe, param_distributions=params,
                                   n_iter=500,scoring='f1_macro',verbose=0,random_state=42)
rscv.fit(X_train, np.array(y_train).ravel())
In [141]:
rscv.best_score_
Out[141]:
0.5833165625829128
In [142]:
y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))
              precision    recall  f1-score   support

        High       0.12      0.90      0.22        29
         Low       1.00      0.97      0.98      5207

    accuracy                           0.96      5236
   macro avg       0.56      0.93      0.60      5236
weighted avg       0.99      0.96      0.98      5236

In [143]:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rscv).ravel()
(tn, fp, fn, tp)
Out[143]:
(26, 3, 182, 5025)

Ok, the prediction has improved, the recall is much better which is what we are searching for, on the other hand there are many false negatives, (low points predicted as high points). In this case, it is not bad because if we use this algorithm, we want to be sure that we can predict all the high values, even if there are some false negatives, but we should try to improve the model.

Let's try other methods to deal with imbalanced datasets. Oversampling methods create synthetic instances of the minority class so that the dataset is more balanced. The imblearn library helps us with some algorithms. The SMOTETomek algorithm combines oversampling and undersampling.We only need to balance our training dataset to avoid overfitting our model, so we include the oversampling in our pipeline.

In [ ]:
from imblearn.combine  import SMOTETomek

oversample=SMOTETomek()
In [ ]:
from sklearn import svm
from imblearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV


pipe=Pipeline([('ss',StandardScaler()),('ov',oversample),('sgd',SGDClassifier(random_state=42))])

rscv = RandomizedSearchCV(pipe, param_distributions=params,
                                   n_iter=500,scoring='f1_macro',verbose=0)
rscv.fit(X_train, np.array(y_train).ravel())
In [147]:
y_pred_rscv = rscv.predict(X_test)
print(classification_report(y_test, y_pred_rscv))
              precision    recall  f1-score   support

        High       0.07      0.93      0.13        29
         Low       1.00      0.93      0.96      5207

    accuracy                           0.93      5236
   macro avg       0.53      0.93      0.55      5236
weighted avg       0.99      0.93      0.96      5236

In [148]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rscv).ravel()
(tn, fp, fn, tp)
Out[148]:
(27, 2, 364, 4843)

Worst results in my opinion, slightly better recall, much worst precision.

We have seen how to download real (open) data from an open data portal and clean and prepare it for analysis. We have also learned the impact of an imbalanced dataset and how we can try to improve our models. We have seen that we need to balance precision and recall depending on our goals. In this case, we may want to predict all cases of high pollution even if we have a high number of false alarms.