In this post we use three clustering methods (kmeans, hierarchical clustering and model based clustering) to evaluate their accuracy.
We see how to select the optimal number of clusters in each method and obtain metrics to select the best of them.
For this exercise I have selected the dataset "seeds" which can be found in the repository https://archive.ics.uci.edu/ml/datasets/seeds it is a dataset with the characteristics of 3 types of wheat seeds.
The fields are the area of the seed, its perimeter, its compactness, its kernel length and width, the kernel asymmetry coefficient and the kernel groove length.
Reading and transforming data
We start by reading the dataset, in this case the file is a txt which is separated by spaces, we read with the appropriate function.
data<-read.table("./Dataset/seeds_dataset.txt",sep = "" , header = F)
str(data)
## 'data.frame': 210 obs. of 8 variables:
## $ V1: num 15.3 14.9 14.3 13.8 16.1 ...
## $ V2: num 14.8 14.6 14.1 13.9 15 ...
## $ V3: num 0.871 0.881 0.905 0.895 0.903 ...
## $ V4: num 5.76 5.55 5.29 5.32 5.66 ...
## $ V5: num 3.31 3.33 3.34 3.38 3.56 ...
## $ V6: num 2.22 1.02 2.7 2.26 1.35 ...
## $ V7: num 5.22 4.96 4.83 4.8 5.17 ...
## $ V8: int 1 1 1 1 1 1 1 1 1 1 ...
It seems that the file has been read correctly, we include the column names.
names(data)<-c('Area','Perimeter','Compactness','length_of_kernel','width_of_kernel','asymmetry_coefficient','lenght_of_kernel_groove','class')
str(data)
## 'data.frame': 210 obs. of 8 variables:
## $ Area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ Perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ Compactness : num 0.871 0.881 0.905 0.895 0.903 ...
## $ length_of_kernel : num 5.76 5.55 5.29 5.32 5.66 ...
## $ width_of_kernel : num 3.31 3.33 3.34 3.38 3.56 ...
## $ asymmetry_coefficient : num 2.22 1.02 2.7 2.26 1.35 ...
## $ lenght_of_kernel_groove: num 5.22 4.96 4.83 4.8 5.17 ...
## $ class : int 1 1 1 1 1 1 1 1 1 1 ...
We delete the class variable to adapt the dataset to unsupervised methods.
We save the original dataset first.
dataorig<-data
data$class<-NULL
str(data)
## 'data.frame': 210 obs. of 7 variables:
## $ Area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ Perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ Compactness : num 0.871 0.881 0.905 0.895 0.903 ...
## $ length_of_kernel : num 5.76 5.55 5.29 5.32 5.66 ...
## $ width_of_kernel : num 3.31 3.33 3.34 3.38 3.56 ...
## $ asymmetry_coefficient : num 2.22 1.02 2.7 2.26 1.35 ...
## $ lenght_of_kernel_groove: num 5.22 4.96 4.83 4.8 5.17 ...
All data are numeric, which is correct. Let's look for missing values (NA or blanks).
colSums(is.na(data))
## Area Perimeter Compactness
## 0 0 0
## length_of_kernel width_of_kernel asymmetry_coefficient
## 0 0 0
## lenght_of_kernel_groove
## 0
colSums(data=="")
## Area Perimeter Compactness
## 0 0 0
## length_of_kernel width_of_kernel asymmetry_coefficient
## 0 0 0
## lenght_of_kernel_groove
## 0
We observe that there are no null or empty values, we obtain a summary of the dataset.
summary(data)
## Area Perimeter Compactness length_of_kernel
## Min. :10.59 Min. :12.41 Min. :0.8081 Min. :4.899
## 1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 1st Qu.:5.262
## Median :14.36 Median :14.32 Median :0.8734 Median :5.524
## Mean :14.85 Mean :14.56 Mean :0.8710 Mean :5.629
## 3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 3rd Qu.:5.980
## Max. :21.18 Max. :17.25 Max. :0.9183 Max. :6.675
## width_of_kernel asymmetry_coefficient lenght_of_kernel_groove
## Min. :2.630 Min. :0.7651 Min. :4.519
## 1st Qu.:2.944 1st Qu.:2.5615 1st Qu.:5.045
## Median :3.237 Median :3.5990 Median :5.223
## Mean :3.259 Mean :3.7002 Mean :5.408
## 3rd Qu.:3.562 3rd Qu.:4.7687 3rd Qu.:5.877
## Max. :4.033 Max. :8.4560 Max. :6.550
Let's standardize the data to improve the results
data<-as.data.frame(scale(data))
str(data)
## 'data.frame': 210 obs. of 7 variables:
## $ Area : num 0.1418 0.0112 -0.1916 -0.3463 0.4442 ...
## $ Perimeter : num 0.2149 0.0082 -0.3593 -0.4742 0.3298 ...
## $ Compactness : num 6.05e-05 4.27e-01 1.44 1.04 1.37 ...
## $ length_of_kernel : num 0.3035 -0.1682 -0.7618 -0.6873 0.0665 ...
## $ width_of_kernel : num 0.141 0.197 0.208 0.319 0.803 ...
## $ asymmetry_coefficient : num -0.984 -1.784 -0.666 -0.959 -1.56 ...
## $ lenght_of_kernel_groove: num -0.383 -0.92 -1.186 -1.227 -0.474 ...
We obtain a correlation matrix of the variables, we observe that area and perimeter have an almost perfect correlation, also according to the definition of the dataset, the variable Compactness is found from them (C = 4piA/P^2,), so we can delete the variables area and perimeter since the information is in compactness.
library(corrplot)
corrplot(cor(data), type = "upper", method = "number", tl.cex = 0.9)
Delete the variables
data$Area<-NULL
data$Perimeter<-NULL
It seems that everything ok, we can start with the analysis process.
Clustering with K-Means
We obtain the mean silhouettes plot, which measures the separation distance between clusters (how close each point of a cluster is to points of neighboring clusters).
plot(2:10,resultados[2:10],type="o",col="blue",pch=0,xlab="Numero de clusters",ylab="Silueta")
It is see that the best number of clusters is 2.
We test with the best model, obtaining the sum of squares to define the intra-cluster variance.
d<-daisy(data)
resultados <- rep(0, 10)
for (i in c(2,3,4,5,6,7,8,9,10))
{
fit <- kmeans(data, i)
y_cluster <- fit$cluster
sk <- silhouette(y_cluster, d)
resultados[i] <- fit$tot.withinss
}
plot(2:10,resultados[2:10],type="o",col="blue",pch=0,xlab="Número de clusters",ylab="tot.tot.withinss")
In this case we can say that the most appropriate value is 3 which is where the bend of the curve is located.
We continue testing, with the function kmeansruns with the criteria asw (mean silhouette) and ch (Calinski-Harabasz), which estimates the ratio between the distance between clusters and the intra-cluster distance, the higher the ratio, the better the model.
library(fpc)
fit_ch <- kmeansruns(data, krange = 1:10, criterion = "ch")
fit_asw <- kmeansruns(data, krange = 1:10, criterion = "asw")
We obtain the 2 values
.fit_ch$bestk
## [1] 3
fit_asw$bestk
## [1] 2
We see that the values are 3 Calinski-Harabasz for and 2 for the average silhouette, shown in the graphs.
plot(1:10,fit_asw$crit,type="o",col="blue",pch=0,xlab="Número de clústers",ylab="Criterio silueta media")
plot(1:10,fit_ch$crit,type="o",col="blue",pch=0,xlab="Número de clústers",ylab="Criterio Calinski-Harabasz")
Let's now see how k-means classifies with 3 classes, which are the classes into which it divides the original dataset.
We show the classification by pairs of variables.
kmeansclusters <- kmeans(data, 3)
# Kernel length and width
plot(data[c(2,3)], col=kmeansclusters$cluster)
As we can see the classification according to the length and width attributes shows the 3 differentiated groups, it seems a good classification criterion, although two groups appear somewhat mixed. We can also see it in the following graph where it shows the classes in original dataset.
plot(data[c(2,3)], col=dataorig$class)
Let's see the case of the coefficient of asymmetry with core length.
plot(data[c(2,4)], col=kmeansclusters$cluster)
plot(data[c(2,4)], col=dataorig$class)
We see the percentage of success.
table(kmeansclusters$cluster,dataorig$class)
##
## 1 2 3
## 1 6 0 62
## 2 3 68 0
## 3 61 2 8
Although the assigned group number is not the same, we the combinations that have the most data to be correct. We can see that we have a 90% correctness rate.
porc=100-19*100/188
porc
## [1] 89.89362
To evaluate the model further we will obtain the quality with the Silhouette parameter that determines the cohesion of the clusters, measuring the similarity of each data with those of its cluster compared to the data of other clusters. We observe that there are no negative values, indicating that the clusters are well cohesive.
We will now inspect the clusters.
Let's also inspect the silhouette plots, we see few values below zero, which indicates good quality in the model.The mean silhouette value is also a good indicator, we will compare it with the rest of the models.
library(factoextra)
sil<-silhouette(kmeansclusters$cluster,daisy(data))
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)