Aggregation methods in R. – Data Science Portfolio

In this post we use three clustering methods (kmeans, hierarchical clustering and model based clustering) to evaluate their accuracy.

We see how to select the optimal number of clusters in each method and obtain metrics to select the best of them.

Minería de datos: PEC2 - Métodos no supervisados

For this exercise I have selected the dataset "seeds" which can be found in the repository https://archive.ics.uci.edu/ml/datasets/seeds it is a dataset with the characteristics of 3 types of wheat seeds.

The fields are the area of the seed, its perimeter, its compactness, its kernel length and width, the kernel asymmetry coefficient and the kernel groove length.

Reading and transforming data

We start by reading the dataset, in this case the file is a txt which is separated by spaces, we read with the appropriate function.

data<-read.table("./Dataset/seeds_dataset.txt",sep = "" , header = F)


str(data)

## 'data.frame':    210 obs. of  8 variables:
##  $ V1: num  15.3 14.9 14.3 13.8 16.1 ...
##  $ V2: num  14.8 14.6 14.1 13.9 15 ...
##  $ V3: num  0.871 0.881 0.905 0.895 0.903 ...
##  $ V4: num  5.76 5.55 5.29 5.32 5.66 ...
##  $ V5: num  3.31 3.33 3.34 3.38 3.56 ...
##  $ V6: num  2.22 1.02 2.7 2.26 1.35 ...
##  $ V7: num  5.22 4.96 4.83 4.8 5.17 ...
##  $ V8: int  1 1 1 1 1 1 1 1 1 1 ...

It seems that the file has been read correctly, we include the column names.

names(data)<-c('Area','Perimeter','Compactness','length_of_kernel','width_of_kernel','asymmetry_coefficient','lenght_of_kernel_groove','class')
str(data)

## 'data.frame':    210 obs. of  8 variables:
##  $ Area                   : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ Perimeter              : num  14.8 14.6 14.1 13.9 15 ...
##  $ Compactness            : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ length_of_kernel       : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ width_of_kernel        : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ asymmetry_coefficient  : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ lenght_of_kernel_groove: num  5.22 4.96 4.83 4.8 5.17 ...
##  $ class                  : int  1 1 1 1 1 1 1 1 1 1 ...

We delete the class variable to adapt the dataset to unsupervised methods.

We save the original dataset first.

dataorig<-data
data$class<-NULL
str(data)

## 'data.frame':    210 obs. of  7 variables:
##  $ Area                   : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ Perimeter              : num  14.8 14.6 14.1 13.9 15 ...
##  $ Compactness            : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ length_of_kernel       : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ width_of_kernel        : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ asymmetry_coefficient  : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ lenght_of_kernel_groove: num  5.22 4.96 4.83 4.8 5.17 ...

All data are numeric, which is correct. Let's look for missing values (NA or blanks).

colSums(is.na(data))

##                    Area               Perimeter             Compactness 
##                       0                       0                       0 
##        length_of_kernel         width_of_kernel   asymmetry_coefficient 
##                       0                       0                       0 
## lenght_of_kernel_groove 
##                       0

colSums(data=="")

##                    Area               Perimeter             Compactness 
##                       0                       0                       0 
##        length_of_kernel         width_of_kernel   asymmetry_coefficient 
##                       0                       0                       0 
## lenght_of_kernel_groove 
##                       0

We observe that there are no null or empty values, we obtain a summary of the dataset.

summary(data)

##       Area         Perimeter      Compactness     length_of_kernel
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899   
##  1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262   
##  Median :14.36   Median :14.32   Median :0.8734   Median :5.524   
##  Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629   
##  3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980   
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675   
##  width_of_kernel asymmetry_coefficient lenght_of_kernel_groove
##  Min.   :2.630   Min.   :0.7651        Min.   :4.519          
##  1st Qu.:2.944   1st Qu.:2.5615        1st Qu.:5.045          
##  Median :3.237   Median :3.5990        Median :5.223          
##  Mean   :3.259   Mean   :3.7002        Mean   :5.408          
##  3rd Qu.:3.562   3rd Qu.:4.7687        3rd Qu.:5.877          
##  Max.   :4.033   Max.   :8.4560        Max.   :6.550

Let's standardize the data to improve the results

data<-as.data.frame(scale(data))
str(data)

## 'data.frame':    210 obs. of  7 variables:
##  $ Area                   : num  0.1418 0.0112 -0.1916 -0.3463 0.4442 ...
##  $ Perimeter              : num  0.2149 0.0082 -0.3593 -0.4742 0.3298 ...
##  $ Compactness            : num  6.05e-05 4.27e-01 1.44 1.04 1.37 ...
##  $ length_of_kernel       : num  0.3035 -0.1682 -0.7618 -0.6873 0.0665 ...
##  $ width_of_kernel        : num  0.141 0.197 0.208 0.319 0.803 ...
##  $ asymmetry_coefficient  : num  -0.984 -1.784 -0.666 -0.959 -1.56 ...
##  $ lenght_of_kernel_groove: num  -0.383 -0.92 -1.186 -1.227 -0.474 ...

We obtain a correlation matrix of the variables, we observe that area and perimeter have an almost perfect correlation, also according to the definition of the dataset, the variable Compactness is found from them (C = 4piA/P^2,), so we can delete the variables area and perimeter since the information is in compactness.

library(corrplot)
corrplot(cor(data), type = "upper", method = "number", tl.cex = 0.9)

Delete the variables

data$Area<-NULL
data$Perimeter<-NULL

It seems that everything ok, we can start with the analysis process.

Clustering with K-Means

We obtain the mean silhouettes plot, which measures the separation distance between clusters (how close each point of a cluster is to points of neighboring clusters).

plot(2:10,resultados[2:10],type="o",col="blue",pch=0,xlab="Numero de clusters",ylab="Silueta")

It is see that the best number of clusters is 2.

We test with the best model, obtaining the sum of squares to define the intra-cluster variance.

d<-daisy(data)
resultados <- rep(0, 10)
for (i in c(2,3,4,5,6,7,8,9,10))
{
  fit           <- kmeans(data, i)
  y_cluster     <- fit$cluster
  sk            <- silhouette(y_cluster, d)
  resultados[i] <- fit$tot.withinss
}
plot(2:10,resultados[2:10],type="o",col="blue",pch=0,xlab="Número de clusters",ylab="tot.tot.withinss")

In this case we can say that the most appropriate value is 3 which is where the bend of the curve is located.

We continue testing, with the function kmeansruns with the criteria asw (mean silhouette) and ch (Calinski-Harabasz), which estimates the ratio between the distance between clusters and the intra-cluster distance, the higher the ratio, the better the model.

library(fpc)
fit_ch  <- kmeansruns(data, krange = 1:10, criterion = "ch") 
fit_asw <- kmeansruns(data, krange = 1:10, criterion = "asw")

We obtain the 2 values

fit_ch$bestk

## [1] 3

fit_asw$bestk

## [1] 2

We see that the values are 3 Calinski-Harabasz for and 2 for the average silhouette, shown in the graphs.

plot(1:10,fit_asw$crit,type="o",col="blue",pch=0,xlab="Número de clústers",ylab="Criterio silueta media")

plot(1:10,fit_ch$crit,type="o",col="blue",pch=0,xlab="Número de clústers",ylab="Criterio Calinski-Harabasz")

Let's now see how k-means classifies with 3 classes, which are the classes into which it divides the original dataset.

We show the classification by pairs of variables.

kmeansclusters <- kmeans(data, 3)

# Kernel length and width
plot(data[c(2,3)], col=kmeansclusters$cluster)

As we can see the classification according to the length and width attributes shows the 3 differentiated groups, it seems a good classification criterion, although two groups appear somewhat mixed. We can also see it in the following graph where it shows the classes in original dataset.

plot(data[c(2,3)], col=dataorig$class)

Let's see the case of the coefficient of asymmetry with core length.

plot(data[c(2,4)], col=kmeansclusters$cluster)

plot(data[c(2,4)], col=dataorig$class)

We see the percentage of success.

table(kmeansclusters$cluster,dataorig$class)

##    
##      1  2  3
##   1  6  0 62
##   2  3 68  0
##   3 61  2  8

Although the assigned group number is not the same, we the combinations that have the most data to be correct. We can see that we have a 90% correctness rate.

porc=100-19*100/188
porc

## [1] 89.89362

To evaluate the model further we will obtain the quality with the Silhouette parameter that determines the cohesion of the clusters, measuring the similarity of each data with those of its cluster compared to the data of other clusters. We observe that there are no negative values, indicating that the clusters are well cohesive.

We will now inspect the clusters.

Let's also inspect the silhouette plots, we see few values below zero, which indicates good quality in the model.The mean silhouette value is also a good indicator, we will compare it with the rest of the models.

library(factoextra)
sil<-silhouette(kmeansclusters$cluster,daisy(data))
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)

Finally we will obtain the dunn index which also performs an evaluation by obtaining a ratio between the minimum distance of data in different clusters and the maximum intra-cluster distance. The higher the dunn, the better the cohesion of the clusters.

clust_stats<-cluster.stats(d,kmeansclusters$cluster)
clust_stats$dunn

## [1] 0.1176404

Let's repeat the operation with k=2 which is the ideal number of clusters according to the indexes tested above

kmeansclusters <- kmeans(data, 2)

We evaluate the Silhouette graphs, we see that it is also a good classification given that few values of 0 are seen.The average silhouette value, improves very little in comparison to the model with k=3

library(factoextra)
sil<-silhouette(kmeansclusters$cluster,daisy(data))
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)

We evaluate the dunn index, in this case the index is lower than in the previous model, so this classification provides a lower cohesion.

clust_stats<-cluster.stats(d,kmeansclusters$cluster)
clust_stats$dunn

## [1] 0.07510732

Let's repeat the operation with k=6.

kmeansclusters <- kmeans(data, 6)

Evaluamos las gráficas Silhouette, vemos que también se ven algunos valores de 0 y el valor de la silueta media baja bastante.

library(factoextra)
sil<-silhouette(kmeansclusters$cluster,daisy(data))
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)

En este caso el índice dunn da menor puntuación que antes.

clust_stats<-cluster.stats(d,kmeansclusters$cluster)
clust_stats$dunn

## [1] 0.1196824

We have seen that according to the indexes used, the best scoring kmeans model is the one that uses 3 clusters.

Hierarchical Aggregation

Let's use another function to decide the ideal number of groups, which estimates them based on the sum of the squares, in this case the ideal value seems to be 3 which is where the bend of the curve is.

library(factoextra)

num_clusters<- fviz_nbclust(data,kmeans,k.max = 10, 
          method = "wss", linecolor ="steelblue")
num_clusters

We will start by testing a hierarchical aggregation method, in this case we do not need to decide the number of clusters a priori. For this we will use the hclust function with the ward method, which seeks to minimize the variance within each cluster, starting with clusters of only 1 element and joining them with the lowest variance growth criterion (by the Lance-Williams algorithm).

#Let's apply the algorithm with Euclidean distance and the ward method.
d <- daisy(data)
fit2 <- hclust(d, method="ward")

Let's obtain a graphical representation of the result, dividing the tree into 3 clusters.

#install.packages("factoextra")
library("factoextra")

fviz_dend(fit2, k = 3, # Cut in four groups
          cex = 0.5, # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

Finally we obtain a dataset with cutting the tree to obtain 3 clusters to compare with other algorithms.

groups<-cutree(fit2, k = 3)
groups

##   [1] 1 1 1 1 1 1 1 1 2 1 1 1 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 3 1 1 1 2 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2
## [141] 3 3 3 3 3 3 1 1 1 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 1 3 3 3 3 3 3 3
## [176] 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 1 1 3 1 3 3 3 1 3 3 3 3

hclustclusters<-cbind(data,groups)
str(hclustclusters)

## 'data.frame':    210 obs. of  6 variables:
##  $ Compactness            : num  6.05e-05 4.27e-01 1.44 1.04 1.37 ...
##  $ length_of_kernel       : num  0.3035 -0.1682 -0.7618 -0.6873 0.0665 ...
##  $ width_of_kernel        : num  0.141 0.197 0.208 0.319 0.803 ...
##  $ asymmetry_coefficient  : num  -0.984 -1.784 -0.666 -0.959 -1.56 ...
##  $ lenght_of_kernel_groove: num  -0.383 -0.92 -1.186 -1.227 -0.474 ...
##  $ groups                 : int  1 1 1 1 1 1 1 1 2 1 ...

We evaluate the method, we obtain graphs like the previous ones, in the case of the variables we see that a classification similar to the kmeans method is obtained.

plot(data[c(2,3)], col=hclustclusters$groups)

We obtain the table to compare the classification success.

table(hclustclusters$groups,dataorig$class)

##    
##      1  2  3
##   1 63  5 13
##   2  2 65  0
##   3  5  0 57

In the same way as before we consider the highest numbers as correct assignments. We see that the percentage of correct assignments drops in this case, it remains at 85%.

porc=100-25*100/188
porc

## [1] 86.70213

Let's see the Silhouette plots. In this case we see values below zero, which indicates that the clusters in this case are less cohesive than in the previous case, also the average value is lower.

library(factoextra)
sil<-silhouette(hclustclusters$groups,d)
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)

We obtain the dunn index to continue with the evaluation. We note that this index also decreases with respect to the kmeans for k=3.

clust_stats<-cluster.stats(d,hclustclusters$groups)
clust_stats$dunn

## [1] 0.06646049

Model-based approach

The last model to test is a model-based approach, with the mclust library. In this case it uses a Gaussian mixture model, which estimates the probability of a sample belonging to each of the distributions, defined by its mean and variance. To assign the data to the distributions it uses the Expectation-Maximization algorithm.

The algorithm creates 3 clusters.

library(mclust)
fit3 <- Mclust(data)
summary(fit3)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EVE (ellipsoidal, equal volume and orientation) model with 3
## components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -708.1013 210 40 -1630.087 -1641.914
## 
## Clustering table:
##  1  2  3 
## 61 73 76

From the plots we can see that the algorithm estimates that the optimal number of clusters is 3 given the EVE model (clusters of equal volume, variable shape and equal orientation). We can also see the classification and uncertainty plots (the larger the points, the higher the uncertainty).

library(factoextra)
# BIC values used to choose the number of clusters.
fviz_mclust(fit3, "BIC", palette = "jco")

# Classification 
fviz_mclust(fit3, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")

# Classification uncertainty
fviz_mclust(fit3, "uncertainty", palette = "jco")

Evaluating the silhouette graph we see that the model also obtains less cohesion between the groups and a somewhat lower mean silhouette value than the previous one.

library(factoextra)
sil<-silhouette(fit3$classification,d)
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)

We obtain the assignment table, in this case we see that the hit rate is quite high.

table(fit3$classification,dataorig$class)

##    
##      1  2  3
##   1 61  0  0
##   2  3 70  0
##   3  6  0 70

We reach a hit rate of 95%.

porc=100-9*100/188
porc

## [1] 95.21277

We obtain the dunn index, we see that in this case the value drops with respect to kmeans although it improves the hierarchical model.

clust_stats<-cluster.stats(d,fit3$classification)
clust_stats$dunn

## [1] 0.07206722

Discussion

We have seen that the evaluation of a clustering model is not easy, depending on the criteria with which the clustering is performed and evaluated we can have different views of which model gives better quality.

In this case, since we have access to real data already classified we can decide that the best model is the one in the last example, the model-based one, since it has correctly classified up to 95% of the data. However, if we did not have such data, taking into account only the indices used for the evaluation, we could consider kmeans with k=3 as the best quality model to perform the clustering, since according to the evaluation indices it is the one that shows the best cluster cohesion.