UNSUPERVISED LEARNING

K-MEANS CLUSTERING AND DBSCAN

İrem Tanrıverdi
15 min readJun 28, 2022

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

K-means clustering one of the clustering technique is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, plot_confusion_matrix ,matthews_corrcoef
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from dtreeviz.trees import dtreeviz
from sklearn import tree
import graphviz
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.cluster import DBSCAN
from sklearn.metrics import adjusted_rand_score
data= pd.read_csv("data.csv")
data
png

Description of the data

  • There are 3 variables and 3000 observations in the dataset. All the variables are numeric.
  • To fit machine learning models to your data set should be complete. As seen from the outputs below, there is no missing observation in the data.
data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x0 3000 non-null float64
1 x1 3000 non-null float64
2 y 3000 non-null int64
dtypes: float64(2), int64(1)
memory usage: 70.4 KB
for columns in data:
print(columns, "--> # of missing value", data[columns].isna().sum())
x0 --> # of missing value 0
x1 --> # of missing value 0
y --> # of missing value 0
ax = sns.countplot(x="y", data=data)
png
data['y'].value_counts(normalize=True)2    0.183333
3 0.183333
4 0.166667
5 0.166667
0 0.150000
1 0.150000
Name: y, dtype: float64

We obtain the proportion and frequency plot of target variable (y), in above output. Proportion for each level of y seems closer.

K-MEANS CLUSTERING

Lets generate k-means models for k= 2, 3, 4, 5, 6, 7, 8, 9, and 10. After that we should determine the optimal value of k by looking at the “wss graph” and silhouette scores.

The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. The first step is to randomly select k centroids, where k is equal to the number of clusters you choose. Centroids are data points representing the center of a cluster.

Data sets usually contain numerical features that have been measured in different units. The values for all features must be transformed to the same scale. Now the data are ready to be clustered.

data_cl = data.drop('y', 1)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data_cl)
scaled_features
array([[ 0.86075756, 1.56677603],
[ 1.83659342, 0.98439746],
[ 2.14353576, 1.6081422 ],
...,
[-0.42644192, -0.2772736 ],
[-0.69111598, -0.62017857],
[-0.66321758, -0.64222315]])
  • init controls the initialization technique.
  • n_clusters sets k for the clustering step. This is the most important parameter for k-means.
  • n_init sets the number of initializations to perform. This is important because two runs can converge on different cluster assignments. The default behavior for the scikit-learn algorithm is to perform ten k-means runs and return the results of the one with the lowest SSE.
  • max_iter sets the number of maximum iterations for each initialization of the k-means algorithm.

Choosing the Appropriate Number of Clusters

The elbow method and silhouette coefficient are often used as complementary evaluation techniques rather than one being preferred over the other.

To perform the Elbow method , run several k-means, increment k with each iteration, and record the SSE.

The Silhouette Coefficient is a measure of cluster cohesion and separation. It quantifies how well a data point fits into its assigned cluster. Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their clusters than they are to other clusters.

Let conduct a k-means models for each k between the values 2 and 10. Then, obtain silhouette score for each k. Then, choose optimal value of k which has the highest average silhouette score.

for k in range(2, 11, 1):
kmeans = KMeans(init="random", n_clusters=k, n_init=10, max_iter=300,random_state=42)
kmeans.fit(scaled_features)
# Compute the silhouette scores for each algorithm
kmeans_silhouette = silhouette_score(scaled_features, kmeans.labels_).round(2)
print("kmeans silhouette score is", kmeans_silhouette, "when k is", k)
kmeans silhouette score is 0.67 when k is 2
kmeans silhouette score is 0.64 when k is 3
kmeans silhouette score is 0.71 when k is 4
kmeans silhouette score is 0.67 when k is 5
kmeans silhouette score is 0.58 when k is 6
kmeans silhouette score is 0.5 when k is 7
kmeans silhouette score is 0.42 when k is 8
kmeans silhouette score is 0.57 when k is 9
kmeans silhouette score is 0.49 when k is 10
# A list holds the silhouette coefficients for each k
silhouette_coefficients = []
for k in range(2, 11):
kmeans = KMeans(init="random", n_clusters=k, n_init=10, max_iter=300,random_state=42)
kmeans.fit(scaled_features)
score = silhouette_score(scaled_features, kmeans.labels_)
silhouette_coefficients.append(score)

plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
png

As seen from the plot, when k=4, we reach the highes average silhouette score.

Then, let look at WSS plot for optimal k.

cluster_wss = []
for k in range(2, 11):
kmeans = KMeans(init="random", n_clusters=k, n_init=10, max_iter=300,random_state=42)
kmeans.fit(scaled_features)
cluster_wss.append(kmeans.inertia_)
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), cluster_wss, marker = 'o')
plt.xticks(range(2, 11))
plt.xlabel('Number of Clusters')
plt.ylabel('WSS')
plt.show()
png

Now we can see that there is not so much decrease in WSS even after we increase the number of clusters beyond 5. 4 is also seem good. Thus, by WSS plot and silhouette score, optimal k is choosen as 4.

DBSCAN models

Then, let generate DBSCAN models for epsilon values between 0.1 (included) and 2.5(included) incrementing it with steps=0.10 and for min_samples=5, 10, 15 and 20. Then, we should find the optimal value of k by adjusted rand scores.

While calculating adjusted rand scores, we will need true labels. Please see

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score

DBScan is a density-based clustering algorithm. The key fact of this algorithm is that the neighbourhood of each point in a cluster which is within a given radius (R) must have a minimum number of points (M). This algorithm has proved extremely efficient in detecting outliers and handling noise.

Our data contains y variable which has 6 labels which are [“0”, “1”, “2”, “3”, “4”, “5”]. We will use these labels in the evaluation methods (Adjusted rand score). They are our “true label values”.

label_encoder = LabelEncoder()
true_labels = label_encoder.fit_transform(data.y)
true_labels
array([1, 1, 1, ..., 5, 4, 4])

Then, we generate DBSCAN models for each value of epsilon between the range (0.1, 2.5) and each value of minimum sample (5, 10, 15, 20). After generating DBSCAN for each possible value of epsilon and min_sample, we calculate Adjusted rand score to choose optimal value of k and epsilon.

for i in range(5, 25, 5):
for j in np.arange(0.1, 2.6, 0.10):
dbscan = DBSCAN(eps=j, min_samples=i)
dbscan.fit(scaled_features)
ari_dbscan = adjusted_rand_score(true_labels, dbscan.labels_).round(4)
print("Adjusted rand score is", ari_dbscan, "when min sample is", i, "and epsilon is", round(j,1))
Adjusted rand score is 0.9965 when min sample is 5 and epsilon is 0.1
Adjusted rand score is 0.824 when min sample is 5 and epsilon is 0.2
Adjusted rand score is 0.7091 when min sample is 5 and epsilon is 0.3
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 0.4
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 0.5
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 0.6
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 0.7
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 0.8
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 0.9
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 1.0
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 1.1
Adjusted rand score is 0.2545 when min sample is 5 and epsilon is 1.2
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 1.3
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 1.4
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 1.5
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 1.6
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 1.7
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 1.8
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 1.9
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 2.0
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 2.1
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 2.2
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 2.3
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 2.4
Adjusted rand score is 0.0 when min sample is 5 and epsilon is 2.5
Adjusted rand score is 0.9948 when min sample is 10 and epsilon is 0.1
Adjusted rand score is 0.824 when min sample is 10 and epsilon is 0.2
Adjusted rand score is 0.7091 when min sample is 10 and epsilon is 0.3
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 0.4
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 0.5
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 0.6
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 0.7
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 0.8
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 0.9
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 1.0
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 1.1
Adjusted rand score is 0.2545 when min sample is 10 and epsilon is 1.2
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 1.3
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 1.4
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 1.5
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 1.6
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 1.7
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 1.8
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 1.9
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 2.0
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 2.1
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 2.2
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 2.3
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 2.4
Adjusted rand score is 0.0 when min sample is 10 and epsilon is 2.5
Adjusted rand score is 0.9939 when min sample is 15 and epsilon is 0.1
Adjusted rand score is 0.8236 when min sample is 15 and epsilon is 0.2
Adjusted rand score is 0.7091 when min sample is 15 and epsilon is 0.3
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 0.4
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 0.5
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 0.6
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 0.7
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 0.8
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 0.9
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 1.0
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 1.1
Adjusted rand score is 0.2545 when min sample is 15 and epsilon is 1.2
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 1.3
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 1.4
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 1.5
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 1.6
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 1.7
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 1.8
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 1.9
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 2.0
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 2.1
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 2.2
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 2.3
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 2.4
Adjusted rand score is 0.0 when min sample is 15 and epsilon is 2.5
Adjusted rand score is 0.9921 when min sample is 20 and epsilon is 0.1
Adjusted rand score is 0.8236 when min sample is 20 and epsilon is 0.2
Adjusted rand score is 0.7091 when min sample is 20 and epsilon is 0.3
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 0.4
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 0.5
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 0.6
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 0.7
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 0.8
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 0.9
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 1.0
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 1.1
Adjusted rand score is 0.2545 when min sample is 20 and epsilon is 1.2
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 1.3
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 1.4
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 1.5
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 1.6
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 1.7
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 1.8
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 1.9
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 2.0
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 2.1
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 2.2
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 2.3
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 2.4
Adjusted rand score is 0.0 when min sample is 20 and epsilon is 2.5

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The adjusted Rand values range between -1 and 1. A score close to 0.0 indicates random assignments, and a score close to 1 indicates perfectly labeled clusters.

When we look at the ARI values, we see that the closest value to 1 is 0.9965. Maximun ARI value obtained when epsilon value is 0.1 and minimum sample is 5. Thus, optimal k value is selelcted as 5.

Finally, let plot the clusters for only the optimal k values based on predicted clusters that we obtain with k-means and DBSCAN. Then, we should plot the original data based on the true labels. The objective of doing this is to find did we really obtain the same optimal k values in two methods, Can we predict the true label of the data by using clustering and which algorithm gave a better result.

In k-means clustering was applied and optimal k was choosen as 4. Besides, in DBSCAN model was applied and optimal k was choosen as 5 and optimal epsilon choosen as 0.1. We obtain different optimal k values for different methods. Let fit the models again with their optimal values respectively. After that, we will visualize the clusters with the true labels which is shown in Q3 (we used them to calculate ARI). Then, let visualize clusters for each method.

# k means
kmeans = KMeans(init="random", n_clusters=4, n_init=10, max_iter=300,random_state=42)
label_kmeans = kmeans.fit_predict(scaled_features)
label_kmeans
array([3, 3, 3, ..., 0, 0, 0], dtype=int32)#Getting unique labels
u_labels_kmeans = np.unique(label_kmeans)

#plotting k-means:
plt.figure(figsize=(8,5))
for i in u_labels_kmeans:
plt.scatter(scaled_features[label_kmeans == i , 0] , scaled_features[label_kmeans == i , 1] , label = i)
plt.legend()
plt.title("Cluster plot for K-means when optimal k=4")
plt.show()
png
# DBSCAN
dbscan = DBSCAN(eps=0.1, min_samples=5)
label_dbscan = dbscan.fit_predict(scaled_features)
label_dbscan
array([0, 0, 0, ..., 5, 4, 4])u_labels_dbscan = np.unique(label_dbscan)

#plotting DBSCAN:
plt.figure(figsize=(8,5))
for i in u_labels_dbscan:
plt.scatter(scaled_features[label_dbscan == i , 0] , scaled_features[label_dbscan == i , 1] , label = i)
plt.legend()
plt.title("Cluster plot for DBSCAN when optimal min_samples=5")
plt.show()
png
u_labels_true = np.unique(true_labels)

#plotting true labels:
plt.figure(figsize=(8,5))
for i in u_labels_true:
plt.scatter(scaled_features[true_labels == i , 0] , scaled_features[true_labels == i , 1] , label = i)
plt.legend()
plt.title("Cluster plot with true labels")
plt.show()
png

Between these 3 cluster plots in above there is slight differences. The number of clusters in the plots is different, respectively 4=Kmeans, 7=DBSCAN, 6=True labels. When we look at the predicted labels in the K-means model, we see that there are 4 labels which are [“0”, “1”, “2”, “3”]. When we look at the DBSCAN model, we see that there are 7 predicted labels, which are [“-1”, “0”, “1”, “2”, “3”, “4”, “5”]. Our true label values are [ “0”, “1”, “2”, “3”, “4”, “5”] (original data values for y). The K-means model underestimated 2 labels based on true label values. The DBSCAN model, on the other hand, overestimate 1 more label based on true label values. If we consider that underestimating is not good, we can say that the DBSCAN model is better in terms of including all true label values. Let’s obtain the numerical methods to understand which model is better. Let’s look at the adjusted rand scores and silhouette coefficients of these two models.

kmeans_silhouette = silhouette_score(scaled_features, kmeans.labels_).round(4)
dbscan_silhouette = silhouette_score(scaled_features, dbscan.labels_).round(4)
ari_dbscan = adjusted_rand_score(true_labels, dbscan.labels_).round(4)
ari_kmeans = adjusted_rand_score(true_labels, kmeans.labels_).round(4)
print("Adjusted rand score for kmeans is", ari_kmeans, "and silhouette coefficient for kmeans is",kmeans_silhouette)
print("Adjusted rand score for dbscan is", ari_dbscan, "and silhouette coefficient for dbscan is",dbscan_silhouette)
Adjusted rand score for kmeans is 0.7091 and silhouette coefficient for kmeans is 0.7129
Adjusted rand score for dbscan is 0.9965 and silhouette coefficient for dbscan is 0.4357

The silhouette coefficient is higher for the k-means algorithm. The DBSCAN algorithm appears to find more natural clusters according to the shape of the data. Based on the above output, we can see that the silhouette coefficient was misleading. ARI shows that DBSCAN is the best choice for the synthetic crescents example as compared to k-means. (Adjusted rand score for dbscan is very close to 1, that indicates perfectly labeled clusters). Hence, we can predict the true label of the data by using dbscan clustering. As stated in Q3 DBSCAN algorithm is extremely efficient in detecting outliers and handling noise. If our dataset have many outlier, that may be reason of DBSCAN outperform the kmeans.

--

--

İrem Tanrıverdi

Research and Teaching Assistant. MSc in Statistics. Interested in programming, machine learning and artificial inteligence.