K-Modes Clustering Algorithm With Numerical Example
In machine learning, we often need to analyze datasets having categorical variables. Generally, K-Means clustering is used as the partitioning clustering technique for numerical data. However, we cannot apply k-means clustering to categorical data. In this article, we will discuss another partitioning technique called K-Modes clustering. We will also discuss a numerical example to understand how the K-Modes clustering technique actually works.
What is K-Modes Clustering?
K-Modes clustering is an unsupervised machine learning technique. It is a partition clustering algorithm used to group a dataset into K clusters. K-Modes clustering can be used in machine learning applications that need to partition data having categorical variables.
K-Modes clustering is an iterative algorithm that starts by selecting k initial data points as centroids of the cluster. After that, each data point in the dataset is assigned to a cluster based on its similarity with the centroids. After creating clusters for the first time, we select a new centroid in each cluster using the mod of each feature in the data. After selecting new clusters, we calculate their dissimilarity from each data point and regroup the clusters. This process continues until the process converges and there is no change to the clusters in two consecutive iterations.
The K-Modes clustering partitions the data into two mutually exclusive groups. Hence, it is termed a partitioning clustering algorithm.
Why Cannot We Use K-means Clustering on Categorical Data?
To calculate the distance or dissimilarity between two data points in the dataset, the K-means clustering algorithm uses euclidean distance or Manhattan distance. Similarly, the k-means algorithm uses the mean of the features of the data points in a cluster to calculate centroids for new clusters. I have discussed the entire process of k-means clustering with a numerical example. If required, you can go through this article.
We are able to calculate the Euclidean distance, Manhattan distance, or mean of the features in a cluster because K-means clustering is primarily used for numeric data. These operations cannot be defined for a dataset having categorical features.
Even if you annotate the data while data preprocessing, there is not much meaning to the assigned numerical labels in the dataset if there is no order. Hence, we cannot use k-means clustering on a dataset with categorical features. Instead, we use the K-Modes clustering algorithm.
Calculation of Dissimilarity Score Between Two Data Points in K-Modes Clustering
In K-Modes clustering, we cannot use Euclidean or Manhattan distance as a dissimilarity measure for the data points. Hence, we will use a different measure as discussed here. For calculating the dissimilarity measure, consider the following dataset.
The following table contains the grades of 15 students.
Student | Subject 1 | Subject 2 | Subject 3 | Subject 4 | Subject 5 |
Student 1 | A | B | A | B | A |
Student 2 | A | A | B | B | A |
Student 3 | C | C | B | A | C |
Student 4 | C | A | B | B | A |
Student 5 | B | A | A | B | C |
Student 6 | A | B | B | A | C |
Student 7 | B | A | C | C | C |
Student 8 | A | A | A | A | A |
Student 9 | A | C | B | B | B |
Student 10 | A | B | B | A | A |
Student 11 | C | C | D | B | A |
Student 12 | A | C | B | B | C |
Student 13 | A | B | A | C | B |
Student 14 | B | C | C | D | B |
Student 15 | A | B | B | B | B |
Suppose that we have to partition the above table into 3 clusters. For this, we need to define a dissimilarity score. For this, we will use the following procedure.
- First, we will compare each feature of the data points. If the values in a feature are equal, we will assign a dissimilarity value of 0 to the feature.
- If the values in the corresponding features are different, we will assign a dissimilarity value of 0 to the feature.
- The final dissimilarity score between two data points will be the sum of the dissimilarity value of each feature.
For example, look at the following data points.
Student | Subject 1 | Subject 2 | Subject 3 | Subject 4 | Subject 5 |
Student 1 | A | B | A | B | A |
Student 12 | A | C | B | B | C |
Dissimilarity scores | 0 | 1 | 1 | 0 | 1 |
Here, Subject 1 and Subject 4 have the same values for student 1 as well as student 12. Hence, these features have been assigned a dissimilarity value of 0.
Subject2, Subject 3, and Subject 5 have different values for both students. Hence, we have assigned the value 1 as the dissimilarity value for each of the features.
The final dissimilarity score between both the students is the sum of dissimilarity values of all the features i.e. 0+1+1+0+1 =3.
Now that we have discussed the calculation of dissimilarity score for the K-Modes clustering algorithm, let us now discuss how to calculate a new centroid for any given cluster.
Calculation of New Centroid in a Cluster
Suppose that the following data points belong to a single cluster.
Student | Subject 1 | Subject 2 | Subject 3 | Subject 4 | Subject 5 |
Student 5 | B | A | A | B | C |
Student 6 | A | B | B | A | C |
Student 7 | B | A | C | C | C |
Student 8 | A | A | A | A | A |
Student 9 | A | C | B | B | B |
Student 10 | A | B | B | A | A |
Mode of Features | A | A | B | A | C |
To calculate the new centroid for the cluster, we will calculate the mod of the values in each feature of the data points. Here, the mod is the value with the most frequency.
In the above cluster, A is the mode for feature Subject 1. Similarly, A, B, A, and C are the mod of features of Subject 2, Subject 3, and Subject 4 respectively.
Hence, the new centroid for the cluster will be (A, A, B, A, C).
K-Modes Clustering Algorithm
Till now, we have discussed how to calculate the dissimilarity score between two data points and how to calculate a new centroid for a cluster. Let us now discuss the K-Modes clustering algorithm.
- First, we randomly select k data points from the entire dataset and initialize them as centroids for new clusters.
- Now, we calculate the dissimilarity score of each data point from the centroids. The algorithm for calculating the dissimilarity score has already been discussed.
- After calculating the dissimilarity score of each data point from the centroids, we will assign the data points to the clusters. The data points are assigned to the cluster with the centroid having the least dissimilarity to the data point.
- After assigning points to the clusters, we will calculate new centroids for each cluster using the method discussed above.
- After calculating the new centroid for each cluster, we will calculate the dissimilarity score of each data point in the dataset to the new centroids. Then, we will reassign the data points to the new clusters.
- After reassigning the data points to the new clusters, we will calculate the new centroid for each cluster. If the new centroids are not the same as the existing centroids, we will repeat steps 4 to 6. Otherwise, we will stop the execution of the algorithm.
K-Modes Clustering Numerical Example
Having discussed the K-Modes clustering algorithm, let us now use a numerical example to understand the algorithm in a better way. We will use the same dataset given in the Student table for this. Assume that we have to partition the entire dataset into three clusters. For this, we will use the following steps.
Iteration 1
First, we will select three student records as centroids for the clusters. Let it be Student 1, Student 5, and Student 12.
- Centroid 1 (Student 1) = (A, B, A, B, A)
- Centroid 2 (Student 5) = (B, A, A, B, C)
- Centroid 3 (Student 12) = (A, C, B, B, C)
Now let us calculate the dissimilarity score of each data point from the centroids. The results have been tabulated below.
Student | Subject 1 | Subject 2 | Subject 3 | Subject 4 | Subject 5 | Dissimilarity with centroid 1 | Dissimilarity with centroid 2 | Dissimilarity with centroid 3 | Assigned Cluster |
Student 1 | A | B | A | B | A | 0 | 3 | 3 | Cluster 1 |
Student 2 | A | A | B | B | A | 2 | 3 | 2 | Cluster 1 |
Student 3 | C | C | B | A | C | 5 | 4 | 2 | Cluster 3 |
Student 4 | C | A | B | B | A | 3 | 3 | 3 | Cluster 1 |
Student 5 | B | A | A | B | C | 3 | 0 | 3 | Cluster 2 |
Student 6 | A | B | B | A | C | 3 | 4 | 2 | Cluster 3 |
Student 7 | B | A | C | C | C | 5 | 2 | 4 | Cluster 2 |
Student 8 | A | A | A | A | A | 2 | 3 | 4 | Cluster 1 |
Student 9 | A | C | B | B | B | 3 | 4 | 1 | Cluster 3 |
Student 10 | A | B | B | A | A | 2 | 5 | 3 | Cluster 1 |
Student 11 | C | C | D | B | A | 3 | 4 | 3 | Cluster 1 |
Student 12 | A | C | B | B | C | 3 | 3 | 0 | Cluster 3 |
Student 13 | A | B | A | C | B | 2 | 4 | 4 | Cluster 1 |
Student 14 | B | C | C | D | B | 5 | 4 | 4 | Cluster 2 |
Student 15 | A | B | B | B | B | 2 | 4 | 2 | Cluster 1 |
In the above table, we have assigned each data point to a cluster based on the dissimilarity score. If a data point has the same dissimilarity value compared to the two centroids, we have assigned the data points to the cluster that comes first.
Now, we will calculate new centroids for each cluster.
In cluster 1, we have the following rows.
Student 1 | A | B | A | B | A |
Student 2 | A | A | B | B | A |
Student 4 | C | A | B | B | A |
Student 8 | A | A | A | A | A |
Student 10 | A | B | B | A | A |
Student 11 | C | C | D | B | A |
Student 13 | A | B | A | C | B |
Student 15 | A | B | B | B | B |
Mode of Attributes | A | B | B | B | A |
After calculating the mode of attributes in the cluster, the new centroid of cluster 1 is (A, B, B, B, A).
Now let us calculate the centroid of cluster 2.
Student 5 | B | A | A | B | C |
Student 7 | B | A | C | C | C |
Student 14 | B | C | C | D | B |
Mode of Attributes | B | A | C | B | C |
The new centroid for cluster 2 is (B, A, C, B, C).
Now, let us calculate the centroid for cluster 3.
Student 3 | C | C | B | A | C |
Student 6 | A | B | B | A | C |
Student 9 | A | C | B | B | B |
Student 12 | A | C | B | B | C |
Mode of attributes | A | C | B | A | C |
New Centroid for Cluster 3 is (A, C, B, A, C ).
Iteration 2
So, the new centroids calculated for cluster 1, cluster 2, and cluster 3 are (A, B, B, B, A), (B, A, C, B, C), (A, C, B, A, C ) respectively.
Now, we will calculate the dissimilarity of each data point from the new centroids and reassign the data points to the clusters. The results have been tabulated below.
Student | Subject 1 | Subject 2 | Subject 3 | Subject 4 | Subject 5 | Dissimilarity with centroid 1 | Dissimilarity with centroid 2 | Dissimilarity with centroid 3 | Assigned Cluster |
Student 1 | A | B | A | B | A | 1 | 4 | 4 | Cluster 1 |
Student 2 | A | A | B | B | A | 1 | 3 | 3 | Cluster 1 |
Student 3 | C | C | B | A | C | 4 | 4 | 1 | Cluster 3 |
Student 4 | C | A | B | B | A | 2 | 3 | 4 | Cluster 1 |
Student 5 | B | A | A | B | C | 4 | 1 | 4 | Cluster 2 |
Student 6 | A | B | B | A | C | 2 | 4 | 1 | Cluster 3 |
Student 7 | B | A | C | C | C | 5 | 1 | 4 | Cluster 2 |
Student 8 | A | A | A | A | A | 3 | 4 | 3 | Cluster 1 |
Student 9 | A | C | B | B | B | 2 | 4 | 2 | Cluster 3 |
Student 10 | A | B | B | A | A | 1 | 5 | 2 | Cluster 1 |
Student 11 | C | C | D | B | A | 3 | 4 | 4 | Cluster 1 |
Student 12 | A | C | B | B | C | 2 | 3 | 1 | Cluster 3 |
Student 13 | A | B | A | C | B | 3 | 5 | 4 | Cluster 1 |
Student 14 | B | C | C | D | B | 5 | 3 | 4 | Cluster 2 |
Student 15 | A | B | B | B | B | 1 | 4 | 3 | Cluster 1 |
You can observe that despite changing the centroids, no changes have happened in the clusters. Hence, the following are the final clusters calculated using K-Modes clustering.
Cluster 1:
Student 1 | A | B | A | B | A |
Student 2 | A | A | B | B | A |
Student 4 | C | A | B | B | A |
Student 8 | A | A | A | A | A |
Student 10 | A | B | B | A | A |
Student 11 | C | C | D | B | A |
Student 13 | A | B | A | C | B |
Student 15 | A | B | B | B | B |
Cluster 2
Student 5 | B | A | A | B | C |
Student 7 | B | A | C | C | C |
Student 14 | B | C | C | D | B |
Cluster 3
Student 3 | C | C | B | A | C |
Student 6 | A | B | B | A | C |
Student 9 | A | C | B | B | B |
Student 12 | A | C | B | B | C |
Advantages of K-Modes Clustering
- K-Modes clustering is an iterable algorithm. It doesn’t use any advanced mathematical constructs. Hence, it is easy to understand and implement the K-modes clustering algorithm.
- K-Modes clustering is a scalable algorithm. You can use it for a small dataset as well as a large dataset having millions of entries.
- The k-Modes clustering algorithm is guaranteed to give us results. It guarantees convergence. Thus, we will get the result of the execution of the algorithm for sure.
- K-modes clustering doesn’t apply to a specific domain. As long as you have a dataset having categorical attributes, you can use the K-Modes clustering algorithm to partition the dataset into different clusters.
- The K-Modes clustering algorithm allows you to choose the initial centroids. You can warm-start the choice of centroids in a manner that suits the distribution of values in the dataset to minimize computation.
Disadvantages of K-Modes Clustering
- In K-modes clustering, we don’t know the number of clusters that are required to partition the data. We need to try different values of K and decide the best k for clustering.
- The computation required while clustering is hugely dependent on the choice of initial centroids. Hence, we need to be extra careful while choosing the initial centroids.
- The maximum dissimilarity score between two data points is equal to the number of features in the dataset. Hence, it is very likely that a data point has the same dissimilarity score for two or more centroids. In such a case, we need to define a mechanism to select the centroid to which the data point will be assigned.
- The final clusters in K-Modes clustering are heavily dependent on the initial centroids.
Conclusion
In this article, we have discussed K-Modes clustering with a numerical example. We have also discussed the advantages and disadvantages of K-Modes clustering.
To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.
To read about other computer science topics, you can read this article on dynamic role-based authorization using ASP.net. You can also read this article on user activity logging using Asp.net.