Silhouette Coefficient Approach in Python For K-Means Clustering

Deciding the optimal number of clusters while clustering datasets using partition-based clustering algorithms is essential. In this article, we will discuss the silhouette coefficient in python to decide the optimal number of clusters in k-means clustering.

What is Silhouette Coefficient?

The silhouette coefficient is a measure of cohesion between the data points in the clusters of a dataset. Mathematically, we calculate the silhouette coefficient for a cluster using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.

The Silhouette Coefficient for a cluster is calculated as is (b – a) / max(a, b). Here, b is the distance between a data point and the nearest cluster that the data point is not a part of. Note that the Silhouette Coefficient is only defined if the number of labels is 2 <= n_labels <= n_samples – 1. In other words, you can define the silhouette coefficient only if the number of labels in the dataset is greater than or equal to two and less than the number of data points. 

We use the average silhouette score of a dataset while choosing the optimal number of clusters. The average silhouette score is the average of the silhouette coefficient of all the clusters in the dataset. It is a good metric to decide the quality of clusters made from a partitioning clustering algorithm and its value lies between -1 and 1. 

  • If the average silhouette score for a dataset is close to 1, we say that the clusters in the dataset are in a good shape and all the clusters have maximum cohesion among them.
  • If the average silhouette score for a dataset is close to 0, we say that the clusters in the dataset are not in a good shape and the clusters have almost no dissimilarity among them.
  • If the average silhouette score for a dataset is close to -1, we say that the clusters in the dataset are not in a bad shape and the clusters are assigned in the wrong way.

From the above points, you can conclude that the number of clusters for which the average silhouette score is maximum for a given dataset is considered optimal. 

Steps to Calculate Silhouette Coefficient For a Cluster 

We use the following steps to calculate the silhouette coefficient for a cluster. 

  • First, we calculate the mean distance between data points in the current cluster and with other points in the same cluster (intra-cluster distance). Let us denote the value by a.
  • Then, we calculate the mean distance between the data point in the current cluster and data points in the nearest cluster other than the current cluster. Let us denote this value by b.
  • Next, we calculate the maximum average intra-cluster distance i.e. the value a, and the average distance between the current data point and data points in the other clusters i.e. the value b.
  • The Silhouette coefficient for the current cluster is given by (b-a) divided by max(b, a).

After finding the Silhouette coefficient for each cluster in the dataset, we calculate the average Silhouette coefficient by taking the mean of the Silhouette coefficient of each cluster in the dataset. The number of clusters for which the Silhouette coefficient has maximum value is considered the optimal number of clusters. 

Need of Silhouette coefficient Approach in Clustering

We need the silhouette coefficient approach in clustering for the following reasons.

  • In partitioning clustering, we don’t know the optimal number of clusters upfront. We need to give the number of clusters as input to the clustering algorithm. Here, clustering the dataset into a random number of clusters won’t serve our purpose. So, we use the average silhouette score metric to calculate the optimal number of clusters for a given dataset.
  • Apart from the average silhouette coefficient approach, we can also use the elbow method to find the optimal number of clusters in a dataset. It uses cluster variance as a metric to select the optimal number of clusters. However, the elbow method has a major drawback. While plotting an elbow graph, we often fail to obtain an elbow point. If the graph of the cluster variance vs the number of clusters is smooth in the elbow method, we cannot obtain the optimal number of clusters for the given dataset. The average silhouette coefficient approach doesn’t rely on data visualization and uses the absolute number for the silhouette coefficient. Hence, the average silhouette coefficient approach guarantees to provide the optimal number of clusters in a dataset.

The silhoutte_score() Method in Python

In this article, we will use the silhouette coefficient approach in python to find the optimal number of clusters for the k-means clustering algorithm. The sklearn module in python provides us with many tools for machine learning. We can use the silhoutte_score() function in python to calculate the average silhouette score for a given dataset and a specified number of clusters. The silhoutte_score() function has the following syntax.

sklearn.metrics.silhouette_score(X, labels, *, metric='euclidean', sample_size=None, random_state=None, **kwds)

Here, 

  • Parameter X takes an array containing the training data as its input argument. You can also pass an array containing the pairwise distance of the data points if the parameter “metric” is set to “precomputed”
  • The labels parameter takes the predicted cluster label of each data point in the input array. 
  • The metric parameter is used to compute the distance between data points.  If the input data contains only numerical values, the metric parameter is set to “euclidean” which is its default value. You can also use one of the following metrics.
    • From scikit-learn module, you can use the ‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, or ‘manhattan’ metric. These metrics support sparse matrix inputs. You can also use the ‘nan_euclidean’ metric but it does not yet support sparse matrices.
    • From scipy.spatial.distance module you can use ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, and ‘yule’ metrics.
  • If the input data contains categorical attributes or has mixed data types, you can calculate the distance matrix and pass the value “precomputed” to the metric parameter and the distance matrix to the parameter X as input arguments.  You can observe this in this article on silhouette score analysis for k-modes clustering.
  • The parameter sample_size takes the size of the sample to use when computing the Silhouette Coefficient on a random subset of the data. If sample_size is None, no sampling is used.
  • The parameter random_state is used when we use sampling while calculating the silhouette coefficient. It determines random number generation for selecting a subset of samples. You can use the random_state parameter when sample_size is not None. To produce reproducible results across multiple function calls, you can pass an integer to the random_state parameter. 

Now that we know what the silhouette coefficient is, let us implement a python program to determine the optimal number of clusters for k-means clustering using the silhouette coefficient approach. 

Average Silhouette Coefficient Approach For K-Means Clustering in Python

For implementing the python program to find the optimal number of clusters in k-means clustering, we will use the steps discussed below. Here, we will directly implement the silhouette score approach to find the optimal number of clusters.

Following are the steps to implement the average silhouette score approach to find the optimal number of clusters in k-means clustering.

  • First, we will create a python dictionary say silhouette_scores to store the average Silhouette coefficient for each value of k in the k-means clustering.
  • Now, we will use a for loop to find the Silhouette coefficient for each k.  In the for loop, we will vary the value of k from 2 to the total number of points in the dataset. You can also choose to vary it from 2 to the possible maximum number of clusters in the dataset.
  • Inside the for loop, we will perform the following operations.
    • We will create an untrained machine-learning model for k-means clustering using the KMeans() function and the current value of k. The KMeans() function is defined in the sklearn module. 
    • Then, we will train the machine learning model using the given dataset and the fit() method. The fit() method, when invoked on the model returned by the KMeans() function, takes the dataset as its input and returns a trained machine-learning model. 
    • After training the model, we will obtain the cluster labels for the data points in the dataset using the labels_ attribute of the machine learning model. 
    • Now we will use the silhouette_score() function defined in the sklearn.metrics module to calculate the average silhouette score for each k. The silhouette_score() function takes the dataset as its first input argument and cluster labels as its second input argument. After execution, it returns the silhouette score for the given k.
    • After obtaining the silhouette score, we will store the current value of k as the key and the silhouette score as the associated value in the silhouette_scores dictionary.

After execution of the for loop, we will get the silhouette score for each k in the silhouette_scores dictionary.

The value of k for which the silhouette score is maximum will be considered as the suitable number of clusters in the k-means clustering algorithm. 

You can implement the entire procedure for silhouette score analysis to find the optimal number of clusters for k-means clustering as shown below.

from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
training_data=[(2,10),(2,6),(11,11),(6,9),(6,4),(1,2),(5,10),(4,9),(10,12),(7,5),(9,11),(4,6),(3,10),(3,8),(6,11)]
silhouette_scores = dict()
range_of_k = range(2,10) 
for k in range_of_k :
    untrained_model = KMeans(n_clusters=k)
    trained_model=untrained_model.fit(training_data)
    cluster_labels = trained_model.labels_
    score=silhouette_score(training_data, cluster_labels)
    silhouette_scores[k]=score
print("The k and associated Silhouette scores are: ",silhouette_scores)

Output:

The k and associated Silhouette scores are:  {2: 0.40955609233587087, 3: 0.4656478498683258, 4: 0.4716346171963804, 5: 0.43177618347526187, 6: 0.4428140295927011, 7: 0.4302882476308248, 8: 0.3790387001105711, 9: 0.2706160005549127}

In the above output, you can observe that we get the highest average silhouette score for k=4. Hence, you can cluster the data points in the dataset given in the above example into four clusters for the best results.

Conclusion

In this article, we have discussed the average silhouette coefficient approach for K-Means clustering and its implementation in Python. To learn more about clustering, you can read this article on clustering mixed data types in python. You might also like this article on the elbow method for k-prototypes clustering in python.

To explore data visualization, you can have a look at this article on how to avoid chart junk.

I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!

Similar Posts