# Silhouette Coefficient For K-Modes and K-Prototypes Clustering

The K-Modes and K-Prototypes clustering algorithms are partitioning clustering algorithms. We need to specify the required number of clusters while clustering datasets using these algorithms. However, we don’t know the optimal number of clusters beforehand. For this, we can use the silhouette coefficient approach. In this article, we will discuss the Silhouette Coefficient approach for K-Modes and K-prototypes clustering for finding the optimal number of clusters.

## What Is the Silhouette Coefficient?

The silhouette coefficient is a measure of cohesion among the data points in a cluster. It is a great metric to calculate the quality of clusters formed using partitioning-based clustering algorithms like k-means and k-medoids clustering algorithms.  The silhouette coefficient is calculated using the following steps.

• First, we calculate the mean intra-cluster distance for a given cluster. Let the value of mean intra-cluster distance be a.
• Next, we calculate the mean nearest-cluster distance for a given cluster. Here, the nearest cluster distance is calculated as the distance between a data point of the current cluster and the cluster that the data point is not a part of.  Let the value of the mean nearest cluster distance be b.
• The Silhouette Coefficient for a cluster is calculated as is (b – a) / max(a, b).

After calculating the silhouette coefficient for a given cluster, we can calculate the average silhouette score for the entire dataset by taking the average silhouette coefficient of each cluster.

For a given dataset, the Silhouette Coefficient is defined only if the number of cluster labels “n_labels” is 2 <= n_labels <= n_samples – 1. Here, n_samples is the total number of data points in the dataset.

The average silhouette score for a given dataset lies between -1 to 1.

• The clusters in the dataset are said to be in good shape and have the greatest possible cohesion between them if the average silhouette score for the dataset is closer to 1.
• We can say that the clusters in the dataset are not in good shape and there is very little dissimilarity between the clusters if the average silhouette score for the dataset is close to 0.
• If the average silhouette score is closer to -1, we say that the clusters are in bad shape and the data points within a cluster have no similarity to each other.

In python, we can calculate the average silhouette coefficient for a dataset using the `silhouette_score()` function defined in the `sklearn.metrics` module. I have already discussed the syntax of the function in this article on the silhouette coefficient in python for k-means clustering. You must read this article before proceeding ahead.

We will now focus on implementing the Silhouette Coefficient approach for K-Modes and K-prototypes clustering for finding the optimal number of clusters. To calculate the average silhouette coefficient for k-modes clustering, we will use the `silhouette_score()` function in `"precomputed"` mode. For this, we will set the `“metric”` parameter in the `silhouette_score()` function to `“precomputed”`. In this mode, the `silhouette_score()` function takes the distance matrix of the data points as its input argument. So, we need to calculate the distance matrix separately for k-modes and k-prototypes clustering algorithms.

## Silhouette Coefficient Approach for K-Modes Clustering in Python

To implement the Silhouette Coefficient approach for K-Modes Clustering in Python, we first need to calculate the distance matrix.

### Calculation of Distance Matrix for K-Modes Clustering

• To calculate the matrix, we will use the dissimilarity score of two data points. I have discussed the calculation of dissimilarity scores for categorical data in the article on k-modes clustering with a numerical example.
• For the python implementation of the calculation of dissimilarity score, we will use the `kprototypes.matching_dissim()` function. The working of this function is discussed in this article on clustering mixed data types in Python.

To calculate the distance matrix, we will define a function `“create_dm()”`. It takes a pandas dataframe or a 2-D numpy array as its input argument. Then, it executes the following steps.

1. First, the `create_dm()` function checks if the input argument is a dataframe. If yes, it extracts a numpy array of the values in the dataframe using the values attribute.
2. Then, it creates an empty 2-D numpy array of shape `lenDataset*lenDataset`. Here, `lenDataset` is the total number of data points in the dataset. The array is used as the distance matrix.
3. After creating the empty distance matrix, the distance between each data point is calculated using the `kprototypes.matching_dissim()` function and the distance matrix is populated.
4. Finally, the `create_dm()` function returns the distance matrix.

### Python Implementation of the Average Silhouette Coefficient Approach for K-Modes Clustering

After implementing the `create_dm() `function to create the distance matrix, we will implement the program to find the average silhouette coefficient for k-modes clustering in Python. For this, we will use the following steps.

• First, we will import the required python libraries and modules using the import statement and load the dataset using the `read_csv()` function.
• Then, we will create an empty dictionary named `silhouette_scores` to store the number of clusters and respective silhouette scores.
• Next, we will calculate the distance matrix for the dataset using the `create_dm() `function.
• Next, we will use a for loop to find the Silhouette coefficient for each k.  In the for loop, we will vary the value of k from 2 to 10. You can also choose to vary it from 2 to the possible maximum number of clusters in the dataset.
• Inside the for loop, we will perform the following operations.
• We will create an untrained machine-learning model for k-modes clustering using the `KModes()` function and the current value of k. The `KModes()` function is defined in the kmodes module of the kmodes library.
• Then, we will train the machine learning model using the given dataset and the `fit()` method. The `fit()` method, when invoked on the model returned by the `KModes() `function, takes the dataset as its input and returns a trained machine-learning model.
• After training the model, we will obtain the cluster labels for the data points in the dataset using the `labels_` attribute of the machine learning model.
• Now we will use the `silhouette_score()` function defined in the `sklearn.metrics` module to calculate the average silhouette score for each k. The `silhouette_score()` function takes the distance matrix as its first input argument and cluster labels as its second input argument. We will also set the `“metric”` parameter to `“precomputed”` to specify that we are passing the distance matrix as input and not the entire dataset.  After execution, the `silhouette_score()` function returns the silhouette score for the given k.
• After obtaining the silhouette score, we will store the current value of k as the key and the silhouette score as the associated value in the `silhouette_scores` dictionary.

After execution of the for loop, we will get the silhouette score for each k in the `silhouette_scores` dictionary.

``````from kmodes.kmodes import KModes
from kmodes import kprototypes
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
def create_dm(dataset):
#if the input dataset is a dataframe, we take out the values as a numpy.
#If the input dataset is a numpy array, we use it as is.
if type(dataset).__name__=='DataFrame':
dataset=dataset.values
lenDataset=len(dataset)
distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
for i in range(lenDataset):
for j in range(lenDataset):
x1= dataset[i].reshape(1,-1)
x2= dataset[j].reshape(1,-1)
distance=kprototypes.matching_dissim(x1, x2)
distance_matrix[i][j]=distance
distance_matrix[j][i]=distance
return distance_matrix
silhouette_scores = dict()
K = range(2,10)
distance_matrix=create_dm(data)
for k in K:
untrained_model=KModes(n_clusters=k, n_init=4)
trained_model=untrained_model.fit(data)
cluster_labels = trained_model.labels_
score=silhouette_score(distance_matrix, cluster_labels,metric="precomputed")
silhouette_scores[k]=score
print("The k and associated Silhouette scores are: ",silhouette_scores)``````

Output:

``The k and associated Silhouette scores are:  {2: 0.08989411125120053, 3: -0.002660312326978979, 4: 0.09695211275522664, 5: 0.1604582084582084, 6: 0.13385309679427324, 7: 0.044937752878929337, 8: 0.025396825396825383, 9: 0.12079365079365077}``

In the output, you can observe that the average silhouette score is highest for k= 5. Hence, for the best results, we need to make 5 clusters from the input dataset. We can also use the elbow method to find the optimal number of clusters in k-modes clustering. However, the elbow method often fails to give a specific value for the optimal number of clusters if the input dataset has abnormal distribution. You can observe that in this article on elbow method for k-modes clustering in Python.

## Silhouette Coefficient Approach for K-Prototypes Clustering in Python

The k-prototypes clustering algorithm is used for mixed data types. Although we have functions to calculate the distance between data points having categorical and numerical attributes, we don’t have functions to calculate the distance between data points having attributes of mixed data types.

### Calculation of Distance Between Two Data Points Having Mixed Attributes

We will first write a function `mixed_distance()` to calculate the distance between the data points. The `mixed_distance() `function has the following syntax.

``mixed_distance(a,b,categorical=None, alpha=0.01)``
• Here, a and b are data points.
• The `categorical` parameter takes a list containing indices of categorical attributes in the input data points.
• The `alpha` parameter is used to scale the distance between numerical attributes. The need for scaling has been explained in the article on k-prototypes clustering with a numerical example.

While execution, the` mixed_distance()` function first checks if the categorical parameter is None. If yes, it means that the data points have no categorical attributes. Hence, we calculate the squared euclidean distance between the data points using the `kprototypes.euclidean_dissim()` function and returns the distance.

If the categorical parameter is not `None`, we will separate the categorical and numerical attributes. Then, we will calculate the distance between categorical and numerical attributes separately using the `kprototypes.matching_dissim() `and` kprototypes.euclidean_dissim()` function respectively.

Finally, we will scale the value returned by the `kprototypes.euclidean_dissim()` function using `alpha`, add it to the value returned by the `kprototypes.matching_dissim() `function and return the value.

### Calculation of Distance Matrix for K-Prototypes Clustering

To calculate the distance matrix for k-prototypes clustering, we will define a function `“dm_prototypes()”`. It takes a pandas dataframe or a 2-D numpy array, a list containing indices of categorical attributes, and the scaling factor `alpha` as its input argument. Then, it executes the following steps.

1. First, the `dm_prototypes()` function checks if the input argument is a dataframe. If yes, it extracts a numpy array of the values in the dataframe using the values attribute.
2. Then, it creates an empty 2-D numpy array of shape `lenDataset*lenDataset`. Here, `lenDataset` is the total number of data points in the dataset. The array is used as the distance matrix.
3. After creating the empty distance matrix, the distance between each data point is calculated using the `mixed_distance()` function and the distance matrix is populated.
4. Finally, the `dm_prototypes()` function returns the distance matrix.

### Python Implementation of the Average Silhouette Coefficient Approach for K-Prototypes Clustering

After implementing the function to calculate the distance matrix, we can implement the python program for the silhouette coefficient approach for k-prototypes clustering. For this, we will use the following steps.

• First, we will import the required python libraries and modules using the import statement and load the dataset using the `read_csv()` function.
• After this, we will perform data preprocessing to normalize the data values and specify the data type of the categorical and numeric attributes.
• Then, we will create an empty dictionary named `silhouette_scores` to store the number of clusters and respective silhouette scores.
• Next, we will calculate the distance matrix for the dataset using the `dm_prototypes() `function.
• Now, we will use a for loop to find the Silhouette coefficient for each k.  In the for loop, we will vary the value of k from 2 to 10. You can also choose to vary it from 2 to the possible maximum number of clusters in the dataset.
• Inside the for loop, we will perform the following operations.
• We will create an untrained machine learning model for k-modes clustering using the `KPrototypes()` function and the current value of k. The `KPrototypes() `function is defined in the `kprototypes` module of the `kmodes` library.
• Then, we will train the machine learning model using the given dataset and the `fit()` method. The `fit()` method, when invoked on the model returned by the `KPrototypes()` function, takes the dataset as its input and returns a trained machine-learning model.
• After training the model, we will obtain the cluster labels for the data points in the dataset using the `labels_` attribute of the machine learning model.
• Now we will use the `silhouette_score()` function defined in the `sklearn.metrics` module to calculate the average silhouette score for each k. The `silhouette_score()` function takes the distance matrix as its first input argument and cluster labels as its second input argument. We will also set the `“metric”` parameter to `“precomputed”` to specify that we are passing the distance matrix as input and not the entire dataset.  After execution, the `silhouette_score()` function returns the silhouette score for the given k.
• After obtaining the silhouette score, we will store the current value of k as the key and the silhouette score as the associated value in the `silhouette_scores` dictionary.

After execution of the for loop, we will get the silhouette score for each k in the `silhouette_scores` dictionary.

``````from kmodes.kmodes import KModes
from kmodes import kprototypes
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
def mixed_distance(a,b,categorical=None, alpha=0.01):
if categorical is None:
num_score=kprototypes.euclidean_dissim(a,b)
return num_score
else:
cat_index=categorical
a_cat=[]
b_cat=[]
for index in cat_index:
a_cat.append(a[index])
b_cat.append(b[index])
a_num=[]
b_num=[]
l=len(a)
for index in range(l):
if index not in cat_index:
a_num.append(a[index])
b_num.append(b[index])

a_cat=np.array(a_cat).reshape(1,-1)
a_num=np.array(a_num).reshape(1,-1)
b_cat=np.array(b_cat).reshape(1,-1)
b_num=np.array(b_num).reshape(1,-1)
cat_score=kprototypes.matching_dissim(a_cat,b_cat)
num_score=kprototypes.euclidean_dissim(a_num,b_num)
return cat_score+num_score*alpha
def dm_prototypes(dataset,categorical=None,alpha=0.1):
#if the input dataset is a dataframe, we take out the values as a numpy.
#If the input dataset is a numpy array, we use it as is.
if type(dataset).__name__=='DataFrame':
dataset=dataset.values
lenDataset=len(dataset)
distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
for i in range(lenDataset):
for j in range(lenDataset):
x1= dataset[i]
x2= dataset[j]
distance=mixed_distance(x1, x2,categorical=categorical,alpha=alpha)
distance_matrix[i][j]=distance
distance_matrix[j][i]=distance
return distance_matrix
import pandas as pd
import numpy as np
from sklearn.metrics import silhouette_score
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
silhouette_scores = dict()
K = range(2,10)
distance_matrix=dm_prototypes(data_array,categorical=[0, 1, 2],alpha=0.1)
for k in K:
untrained_model = kprototypes.KPrototypes(n_clusters=k,max_iter=20)
trained_model = untrained_model.fit(data_array, categorical=[0, 1, 2])
cluster_labels = trained_model.labels_
score=silhouette_score(distance_matrix, cluster_labels,metric="precomputed")
silhouette_scores[k]=score
print("The k and associated Silhouette scores are: ",silhouette_scores)``````

Output:

``The k and associated Silhouette scores are:  {2: -0.04908330150579255, 3: 0.12082918653033445, 4: -0.04908330150579255, 5: -0.04908330150579255, 6: 0.036441846876001924, 7: -0.04908330150579255, 8: 0.040496970169131355, 9: 0.12082918653033445}``

In the output, you can observe that the average silhouette score is highest for k= 3. Hence, for the best results, we need to make 3 clusters from the input dataset. We can also use the elbow method to find the optimal number of clusters in k-prototypes clustering. However, the elbow method often fails to give a specific value for the optimal number of clusters if the input dataset has abnormal distribution. You can observe that in this article on the elbow method for k-prototypes clustering in Python.

## Conclusion

In this article, we have discussed the implementation of the average silhouette coefficient approach for k-modes and k-prototypes clustering to find the optimal number of clusters in a dataset. To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

To explore data visualization, you can have a look at this article on how to avoid chart junk.