In machine learning, we often need to analyze datasets having categorical variables. Generally, K-Means clustering is used as the partitioning clustering technique for numerical data. However, we cannot apply k-means clustering to categorical data. In this article, we will discuss another partitioning technique called K-Modes clustering. We will also discuss a numerical example to understand how the K-Modes clustering technique actually works.

What is K-Modes Clustering?

K-Modes clustering is an unsupervised machine learning technique. It is a partition clustering algorithm used to group a dataset into K clusters.  K-Modes clustering can be used in machine learning applications that need to partition data having categorical variables. 

K-Modes clustering is an iterative algorithm that starts by selecting k initial data points as centroids of the cluster. After that, each data point in the dataset is assigned to a cluster based on its similarity with the centroids. After creating clusters for the first time, we select a new centroid in each cluster using the mod of each feature in the data. After selecting new clusters, we calculate their dissimilarity from each data point and regroup the clusters. This process continues until the process converges and there is no change to the clusters in two consecutive iterations. 

The K-Modes clustering partitions the data into two mutually exclusive groups. Hence, it is termed a partitioning clustering algorithm. 

Why Cannot We Use K-means Clustering on Categorical Data?

To calculate the distance or dissimilarity between two data points in the dataset, the K-means clustering algorithm uses euclidean distance or Manhattan distance. Similarly, the k-means algorithm uses the mean of the features of the data points in a cluster to calculate centroids for new clusters. I have discussed the entire process of k-means clustering with a numerical example. If required, you can go through this article.

We are able to calculate the Euclidean distance, Manhattan distance, or mean of the features in a cluster because K-means clustering is primarily used for numeric data. These operations cannot be defined for a dataset having categorical features. 

Even if you annotate the data while data preprocessing, there is not much meaning to the assigned numerical labels in the dataset if there is no order. Hence, we cannot use k-means clustering on a dataset with categorical features. Instead, we use the K-Modes clustering algorithm.

Calculation of Dissimilarity Score Between Two Data Points in K-Modes Clustering

In K-Modes clustering, we cannot use Euclidean or Manhattan distance as a dissimilarity measure for the data points. Hence, we will use a different measure as discussed here. For calculating the dissimilarity measure, consider the following dataset.

The following table contains the grades of 15 students.

StudentSubject 1Subject 2Subject 3Subject 4Subject 5
Student 1ABABA
Student 2AABBA
Student 3CCBAC
Student 4CABBA
Student 5BAABC
Student 6ABBAC
Student 7BACCC
Student 8AAAAA
Student 9ACBBB
Student 10ABBAA
Student 11CCDBA
Student 12ACBBC
Student 13ABACB
Student 14BCCDB
Student 15ABBBB
Dataset for K-Modes Clustering

Suppose that we have to partition the above table into 3 clusters. For this, we need to define a dissimilarity score. For this, we will use the following procedure.

  • First, we will compare each feature of the data points. If the values in a feature are equal, we will assign a dissimilarity value of 0 to the feature. 
  • If the values in the corresponding features are different, we will assign a dissimilarity value of 0 to the feature.
  • The final dissimilarity score between two data points will be the sum of the dissimilarity value of each feature. 

For example, look at the following data points.

StudentSubject 1Subject 2Subject 3Subject 4Subject 5
Student 1ABABA
Student 12ACBBC
Dissimilarity scores01101
Calculation of Dissimilarity scores

Here, Subject 1 and Subject 4 have the same values for student 1 as well as student 12. Hence, these features have been assigned a dissimilarity value of 0.

Subject2, Subject 3, and Subject 5 have different values for both students. Hence, we have assigned the value 1 as the dissimilarity value for each of the features.

The final dissimilarity score between both the students is the sum of dissimilarity values of all the features i.e. 0+1+1+0+1 =3.

Now that we have discussed the calculation of dissimilarity score for the K-Modes clustering algorithm, let us now discuss how to calculate a new centroid for any given cluster.

Calculation of New Centroid in a Cluster

Suppose that the following data points belong to a single cluster.

StudentSubject 1Subject 2Subject 3Subject 4Subject 5
Student 5BAABC
Student 6ABBAC
Student 7BACCC
Student 8AAAAA
Student 9ACBBB
Student 10ABBAA
Mode of FeaturesAABAC
Calculation of Centroid

To calculate the new centroid for the cluster, we will calculate the mod of the values in each feature of the data points. Here, the mod is the value with the most frequency.

In the above cluster, A is the mode for feature Subject 1. Similarly, A, B, A, and C are the mod of features of Subject 2, Subject 3, and Subject 4 respectively.

Hence, the new centroid for the cluster will be (A, A, B, A, C).

K-Modes Clustering Algorithm

Till now, we have discussed how to calculate the dissimilarity score between two data points and how to calculate a new centroid for a cluster. Let us now discuss the K-Modes clustering algorithm.

  1. First, we randomly select k data points from the entire dataset and initialize them as centroids for new clusters. 
  2. Now, we calculate the dissimilarity score of each data point from the centroids. The algorithm for calculating the dissimilarity score has already been discussed.
  3. After calculating the dissimilarity score of each data point from the centroids, we will assign the data points to the clusters. The data points are assigned to the cluster with the centroid having the least dissimilarity to the data point.
  4. After assigning points to the clusters, we will calculate new centroids for each cluster using the method discussed above.
  5. After calculating the new centroid for each cluster, we will calculate the dissimilarity score of each data point in the dataset to the new centroids. Then, we will reassign the data points to the new clusters. 
  6. After reassigning the data points to the new clusters, we will calculate the new centroid for each cluster. If the new centroids are not the same as the existing centroids, we will repeat steps 4 to 6. Otherwise, we will stop the execution of the algorithm.

K-Modes Clustering Numerical Example

Having discussed the K-Modes clustering algorithm, let us now use a numerical example to understand the algorithm in a better way. We will use the same dataset given in the Student table for this. Assume that we have to partition the entire dataset into three clusters. For this, we will use the following steps.

Iteration 1

First, we will select three student records as centroids for the clusters. Let it be Student 1, Student 5, and Student 12. 

  • Centroid 1 (Student 1) = (A, B, A, B, A)
  • Centroid 2 (Student 5) = (B, A, A, B, C)
  • Centroid 3 (Student 12) = (A, C, B, B, C)

Now let us calculate the dissimilarity score of each data point from the centroids. The results have been tabulated below.

StudentSubject 1Subject 2Subject 3Subject 4Subject 5Dissimilarity with centroid 1Dissimilarity with centroid 2Dissimilarity with centroid 3Assigned Cluster
Student 1ABABA033Cluster 1
Student 2AABBA232Cluster 1
Student 3CCBAC542Cluster 3
Student 4CABBA333Cluster 1
Student 5BAABC303Cluster 2
Student 6ABBAC342Cluster 3
Student 7BACCC524Cluster 2
Student 8AAAAA234Cluster 1
Student 9ACBBB341Cluster 3
Student 10ABBAA253Cluster 1
Student 11CCDBA343Cluster 1
Student 12ACBBC330Cluster 3
Student 13ABACB244Cluster 1
Student 14BCCDB544Cluster 2
Student 15ABBBB242Cluster 1
Output Table for Iteration 1

In the above table, we have assigned each data point to a cluster based on the dissimilarity score. If a data point has the same dissimilarity value compared to the two centroids, we have assigned the data points to the cluster that comes first. 

Now, we will calculate new centroids for each cluster.

In cluster 1, we have the following rows.

Student 1ABABA
Student 2AABBA
Student 4CABBA
Student 8AAAAA
Student 10ABBAA
Student 11CCDBA
Student 13ABACB
Student 15ABBBB
Mode of AttributesABBBA
Cluster 1 in Iteration 1

After calculating the mode of attributes in the cluster, the new centroid of cluster 1 is (A, B, B, B, A).

Now let us calculate the centroid of cluster 2.

Student 5BAABC
Student 7BACCC
Student 14BCCDB
Mode of AttributesBACBC
Cluster 2 Iteration 1

The new centroid for cluster 2 is  (B, A, C, B, C).

Now, let us calculate the centroid for cluster 3.

Student 3CCBAC
Student 6ABBAC
Student 9ACBBB
Student 12ACBBC
Mode of attributesACBAC
Cluster 3 Iteration 1

New Centroid for Cluster 3  is (A, C, B, A, C ).

Iteration 2

So, the new centroids calculated for cluster 1, cluster 2, and cluster 3 are (A, B, B, B, A), (B, A, C, B, C), (A, C, B, A, C ) respectively.

Now, we will calculate the dissimilarity of each data point from the new centroids and reassign the data points to the clusters. The results have been tabulated below.

StudentSubject 1Subject 2Subject 3Subject 4Subject 5Dissimilarity with centroid 1Dissimilarity with centroid 2Dissimilarity with centroid 3Assigned Cluster
Student 1ABABA144Cluster 1
Student 2AABBA133Cluster 1
Student 3CCBAC441Cluster 3
Student 4CABBA234Cluster 1
Student 5BAABC414Cluster 2
Student 6ABBAC241Cluster 3
Student 7BACCC514Cluster 2
Student 8AAAAA343Cluster 1
Student 9ACBBB242Cluster 3
Student 10ABBAA152Cluster 1
Student 11CCDBA344Cluster 1
Student 12ACBBC231Cluster 3
Student 13ABACB354Cluster 1
Student 14BCCDB534Cluster 2
Student 15ABBBB143Cluster 1
Output Table for Iteration 2

You can observe that despite changing the centroids, no changes have happened in the clusters. Hence, the following are the final clusters calculated using K-Modes clustering. 

Cluster 1:

Student 1ABABA
Student 2AABBA
Student 4CABBA
Student 8AAAAA
Student 10ABBAA
Student 11CCDBA
Student 13ABACB
Student 15ABBBB
Final Cluster 1

Cluster 2

Student 5BAABC
Student 7BACCC
Student 14BCCDB
Final Cluster 2

Cluster 3

Student 3CCBAC
Student 6ABBAC
Student 9ACBBB
Student 12ACBBC
Final Cluster 3

Advantages of K-Modes Clustering

  • K-Modes clustering is an iterable algorithm. It doesn’t use any advanced mathematical constructs. Hence, it is easy to understand and implement the K-modes clustering algorithm.
  • K-Modes clustering is a scalable algorithm. You can use it for a small dataset as well as a large dataset having millions of entries. 
  • The k-Modes clustering algorithm is guaranteed to give us results. It guarantees convergence. Thus, we will get the result of the execution of the algorithm for sure.
  • K-modes clustering doesn’t apply to a specific domain. As long as you have a dataset having categorical attributes, you can use the K-Modes clustering algorithm to partition the dataset into different clusters.
  • The K-Modes clustering algorithm allows you to choose the initial centroids. You can warm-start the choice of centroids in a manner that suits the distribution of values in the dataset to minimize computation.

Disadvantages of K-Modes Clustering

  • In K-modes clustering, we don’t know the number of clusters that are required to partition the data. We need to try different values of K and decide the best k for clustering.
  • The computation required while clustering is hugely dependent on the choice of initial centroids. Hence, we need to be extra careful while choosing the initial centroids.
  • The maximum dissimilarity score between two data points is equal to the number of features in the dataset. Hence, it is very likely that a data point has the same dissimilarity score for two or more centroids. In such a case, we need to define a mechanism to select the centroid to which the data point will be assigned. 
  • The final clusters in K-Modes clustering are heavily dependent on the initial centroids.

Conclusion

In this article, we have discussed K-Modes clustering with a numerical example. We have also discussed the advantages and disadvantages of K-Modes clustering.

To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

To read about other computer science topics, you can read this article on  dynamic role-based authorization using ASP.net. You can also read this article on user activity logging using Asp.net.

Author

Comments are closed.