Silhouette Coefficient For K-Modes and K-Prototypes Clustering

The K-Modes and K-Prototypes clustering algorithms are partitioning clustering algorithms. We need to specify the required number of clusters while clustering datasets using these algorithms. However, we don’t know the optimal number of clusters beforehand. For this, we can use the silhouette coefficient approach. In this article, we will discuss the Silhouette Coefficient approach for K-Modes and K-prototypes clustering for finding the optimal number of clusters. 

What Is the Silhouette Coefficient?

The silhouette coefficient is a measure of cohesion among the data points in a cluster. It is a great metric to calculate the quality of clusters formed using partitioning-based clustering algorithms like k-means and k-medoids clustering algorithms.  The silhouette coefficient is calculated using the following steps.

  • First, we calculate the mean intra-cluster distance for a given cluster. Let the value of mean intra-cluster distance be a.
  • Next, we calculate the mean nearest-cluster distance for a given cluster. Here, the nearest cluster distance is calculated as the distance between a data point of the current cluster and the cluster that the data point is not a part of.  Let the value of the mean nearest cluster distance be b.
  • The Silhouette Coefficient for a cluster is calculated as is (b – a) / max(a, b).

After calculating the silhouette coefficient for a given cluster, we can calculate the average silhouette score for the entire dataset by taking the average silhouette coefficient of each cluster.

For a given dataset, the Silhouette Coefficient is defined only if the number of cluster labels “n_labels” is 2 <= n_labels <= n_samples – 1. Here, n_samples is the total number of data points in the dataset. 

The average silhouette score for a given dataset lies between -1 to 1. 

  • The clusters in the dataset are said to be in good shape and have the greatest possible cohesion between them if the average silhouette score for the dataset is closer to 1.
  • We can say that the clusters in the dataset are not in good shape and there is very little dissimilarity between the clusters if the average silhouette score for the dataset is close to 0.
  • If the average silhouette score is closer to -1, we say that the clusters are in bad shape and the data points within a cluster have no similarity to each other.

In python, we can calculate the average silhouette coefficient for a dataset using the silhouette_score() function defined in the sklearn.metrics module. I have already discussed the syntax of the function in this article on the silhouette coefficient in python for k-means clustering. You must read this article before proceeding ahead.

We will now focus on implementing the Silhouette Coefficient approach for K-Modes and K-prototypes clustering for finding the optimal number of clusters. To calculate the average silhouette coefficient for k-modes clustering, we will use the silhouette_score() function in "precomputed" mode. For this, we will set the “metric” parameter in the silhouette_score() function to “precomputed”. In this mode, the silhouette_score() function takes the distance matrix of the data points as its input argument. So, we need to calculate the distance matrix separately for k-modes and k-prototypes clustering algorithms.

Silhouette Coefficient Approach for K-Modes Clustering in Python

To implement the Silhouette Coefficient approach for K-Modes Clustering in Python, we first need to calculate the distance matrix. 

Calculation of Distance Matrix for K-Modes Clustering

  • To calculate the matrix, we will use the dissimilarity score of two data points. I have discussed the calculation of dissimilarity scores for categorical data in the article on k-modes clustering with a numerical example.
  • For the python implementation of the calculation of dissimilarity score, we will use the kprototypes.matching_dissim() function. The working of this function is discussed in this article on clustering mixed data types in Python.

To calculate the distance matrix, we will define a function “create_dm()”. It takes a pandas dataframe or a 2-D numpy array as its input argument. Then, it executes the following steps.

  1. First, the create_dm() function checks if the input argument is a dataframe. If yes, it extracts a numpy array of the values in the dataframe using the values attribute. 
  2. Then, it creates an empty 2-D numpy array of shape lenDataset*lenDataset. Here, lenDataset is the total number of data points in the dataset. The array is used as the distance matrix.
  3. After creating the empty distance matrix, the distance between each data point is calculated using the kprototypes.matching_dissim() function and the distance matrix is populated.
  4. Finally, the create_dm() function returns the distance matrix. 

Python Implementation of the Average Silhouette Coefficient Approach for K-Modes Clustering

After implementing the create_dm() function to create the distance matrix, we will implement the program to find the average silhouette coefficient for k-modes clustering in Python. For this, we will use the following steps.

  • First, we will import the required python libraries and modules using the import statement and load the dataset using the read_csv() function.
  • Then, we will create an empty dictionary named silhouette_scores to store the number of clusters and respective silhouette scores. 
  • Next, we will calculate the distance matrix for the dataset using the create_dm() function. 
  • Next, we will use a for loop to find the Silhouette coefficient for each k.  In the for loop, we will vary the value of k from 2 to 10. You can also choose to vary it from 2 to the possible maximum number of clusters in the dataset.
  • Inside the for loop, we will perform the following operations.
    • We will create an untrained machine-learning model for k-modes clustering using the KModes() function and the current value of k. The KModes() function is defined in the kmodes module of the kmodes library. 
    • Then, we will train the machine learning model using the given dataset and the fit() method. The fit() method, when invoked on the model returned by the KModes() function, takes the dataset as its input and returns a trained machine-learning model. 
    • After training the model, we will obtain the cluster labels for the data points in the dataset using the labels_ attribute of the machine learning model. 
    • Now we will use the silhouette_score() function defined in the sklearn.metrics module to calculate the average silhouette score for each k. The silhouette_score() function takes the distance matrix as its first input argument and cluster labels as its second input argument. We will also set the “metric” parameter to “precomputed” to specify that we are passing the distance matrix as input and not the entire dataset.  After execution, the silhouette_score() function returns the silhouette score for the given k.
    • After obtaining the silhouette score, we will store the current value of k as the key and the silhouette score as the associated value in the silhouette_scores dictionary.

After execution of the for loop, we will get the silhouette score for each k in the silhouette_scores dictionary.

For implementing the program, you can download the dataset from the following link.

from kmodes.kmodes import KModes
from kmodes import kprototypes
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
def create_dm(dataset):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i].reshape(1,-1)
            x2= dataset[j].reshape(1,-1)
            distance=kprototypes.matching_dissim(x1, x2)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
data=pd.read_csv("KModes-dataset.csv", index_col=["Student"])
silhouette_scores = dict()
K = range(2,10)
distance_matrix=create_dm(data)
for k in K:
    untrained_model=KModes(n_clusters=k, n_init=4)
    trained_model=untrained_model.fit(data)
    cluster_labels = trained_model.labels_
    score=silhouette_score(distance_matrix, cluster_labels,metric="precomputed")
    silhouette_scores[k]=score
print("The k and associated Silhouette scores are: ",silhouette_scores)

Output:

The k and associated Silhouette scores are:  {2: 0.08989411125120053, 3: -0.002660312326978979, 4: 0.09695211275522664, 5: 0.1604582084582084, 6: 0.13385309679427324, 7: 0.044937752878929337, 8: 0.025396825396825383, 9: 0.12079365079365077}

In the output, you can observe that the average silhouette score is highest for k= 5. Hence, for the best results, we need to make 5 clusters from the input dataset. We can also use the elbow method to find the optimal number of clusters in k-modes clustering. However, the elbow method often fails to give a specific value for the optimal number of clusters if the input dataset has abnormal distribution. You can observe that in this article on elbow method for k-modes clustering in Python.

Silhouette Coefficient Approach for K-Prototypes Clustering in Python

The k-prototypes clustering algorithm is used for mixed data types. Although we have functions to calculate the distance between data points having categorical and numerical attributes, we don’t have functions to calculate the distance between data points having attributes of mixed data types.

Calculation of Distance Between Two Data Points Having Mixed Attributes

We will first write a function mixed_distance() to calculate the distance between the data points. The mixed_distance() function has the following syntax.

mixed_distance(a,b,categorical=None, alpha=0.01)
  • Here, a and b are data points.
  • The categorical parameter takes a list containing indices of categorical attributes in the input data points.
  • The alpha parameter is used to scale the distance between numerical attributes. The need for scaling has been explained in the article on k-prototypes clustering with a numerical example. 

While execution, the mixed_distance() function first checks if the categorical parameter is None. If yes, it means that the data points have no categorical attributes. Hence, we calculate the squared euclidean distance between the data points using the kprototypes.euclidean_dissim() function and returns the distance. 

If the categorical parameter is not None, we will separate the categorical and numerical attributes. Then, we will calculate the distance between categorical and numerical attributes separately using the kprototypes.matching_dissim() and kprototypes.euclidean_dissim() function respectively.

Finally, we will scale the value returned by the kprototypes.euclidean_dissim() function using alpha, add it to the value returned by the kprototypes.matching_dissim() function and return the value. 

Calculation of Distance Matrix for K-Prototypes Clustering

To calculate the distance matrix for k-prototypes clustering, we will define a function “dm_prototypes()”. It takes a pandas dataframe or a 2-D numpy array, a list containing indices of categorical attributes, and the scaling factor alpha as its input argument. Then, it executes the following steps.

  1. First, the dm_prototypes() function checks if the input argument is a dataframe. If yes, it extracts a numpy array of the values in the dataframe using the values attribute. 
  2. Then, it creates an empty 2-D numpy array of shape lenDataset*lenDataset. Here, lenDataset is the total number of data points in the dataset. The array is used as the distance matrix.
  3. After creating the empty distance matrix, the distance between each data point is calculated using the mixed_distance() function and the distance matrix is populated.
  4. Finally, the dm_prototypes() function returns the distance matrix. 

Python Implementation of the Average Silhouette Coefficient Approach for K-Prototypes Clustering

After implementing the function to calculate the distance matrix, we can implement the python program for the silhouette coefficient approach for k-prototypes clustering. For this, we will use the following steps.

  • First, we will import the required python libraries and modules using the import statement and load the dataset using the read_csv() function.
  • After this, we will perform data preprocessing to normalize the data values and specify the data type of the categorical and numeric attributes.
  • Then, we will create an empty dictionary named silhouette_scores to store the number of clusters and respective silhouette scores. 
  • Next, we will calculate the distance matrix for the dataset using the dm_prototypes() function. 
  • Now, we will use a for loop to find the Silhouette coefficient for each k.  In the for loop, we will vary the value of k from 2 to 10. You can also choose to vary it from 2 to the possible maximum number of clusters in the dataset.
  • Inside the for loop, we will perform the following operations.
    • We will create an untrained machine learning model for k-modes clustering using the KPrototypes() function and the current value of k. The KPrototypes() function is defined in the kprototypes module of the kmodes library. 
    • Then, we will train the machine learning model using the given dataset and the fit() method. The fit() method, when invoked on the model returned by the KPrototypes() function, takes the dataset as its input and returns a trained machine-learning model. 
    • After training the model, we will obtain the cluster labels for the data points in the dataset using the labels_ attribute of the machine learning model. 
    • Now we will use the silhouette_score() function defined in the sklearn.metrics module to calculate the average silhouette score for each k. The silhouette_score() function takes the distance matrix as its first input argument and cluster labels as its second input argument. We will also set the “metric” parameter to “precomputed” to specify that we are passing the distance matrix as input and not the entire dataset.  After execution, the silhouette_score() function returns the silhouette score for the given k.
    • After obtaining the silhouette score, we will store the current value of k as the key and the silhouette score as the associated value in the silhouette_scores dictionary.

After execution of the for loop, we will get the silhouette score for each k in the silhouette_scores dictionary.

For the implementation of the python program, you can download the dataset from the following link.

from kmodes.kmodes import KModes
from kmodes import kprototypes
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
def mixed_distance(a,b,categorical=None, alpha=0.01):
    if categorical is None:
        num_score=kprototypes.euclidean_dissim(a,b)
        return num_score
    else:
        cat_index=categorical
        a_cat=[]
        b_cat=[]
        for index in cat_index:
            a_cat.append(a[index])
            b_cat.append(b[index])
        a_num=[]
        b_num=[]
        l=len(a)
        for index in range(l):
            if index not in cat_index:
                a_num.append(a[index])
                b_num.append(b[index])
                
        a_cat=np.array(a_cat).reshape(1,-1)
        a_num=np.array(a_num).reshape(1,-1)
        b_cat=np.array(b_cat).reshape(1,-1)
        b_num=np.array(b_num).reshape(1,-1)
        cat_score=kprototypes.matching_dissim(a_cat,b_cat)
        num_score=kprototypes.euclidean_dissim(a_num,b_num)
        return cat_score+num_score*alpha
def dm_prototypes(dataset,categorical=None,alpha=0.1):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i]
            x2= dataset[j]
            distance=mixed_distance(x1, x2,categorical=categorical,alpha=alpha)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
import pandas as pd
import numpy as np
from sklearn.metrics import silhouette_score
df=pd.read_csv("Kprototypes_dataset.csv", index_col=["Student"])
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
silhouette_scores = dict()
K = range(2,10)
distance_matrix=dm_prototypes(data_array,categorical=[0, 1, 2],alpha=0.1)
for k in K:
    untrained_model = kprototypes.KPrototypes(n_clusters=k,max_iter=20)
    trained_model = untrained_model.fit(data_array, categorical=[0, 1, 2])
    cluster_labels = trained_model.labels_
    score=silhouette_score(distance_matrix, cluster_labels,metric="precomputed")
    silhouette_scores[k]=score
print("The k and associated Silhouette scores are: ",silhouette_scores)

Output:

The k and associated Silhouette scores are:  {2: -0.04908330150579255, 3: 0.12082918653033445, 4: -0.04908330150579255, 5: -0.04908330150579255, 6: 0.036441846876001924, 7: -0.04908330150579255, 8: 0.040496970169131355, 9: 0.12082918653033445}

In the output, you can observe that the average silhouette score is highest for k= 3. Hence, for the best results, we need to make 3 clusters from the input dataset. We can also use the elbow method to find the optimal number of clusters in k-prototypes clustering. However, the elbow method often fails to give a specific value for the optimal number of clusters if the input dataset has abnormal distribution. You can observe that in this article on the elbow method for k-prototypes clustering in Python.

Conclusion

In this article, we have discussed the implementation of the average silhouette coefficient approach for k-modes and k-prototypes clustering to find the optimal number of clusters in a dataset. To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

To explore data visualization, you can have a look at this article on how to avoid chart junk.

I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!

Similar Posts