K-Prototypes Clustering With Numerical Example

Clustering algorithms are unsupervised machine learning algorithms used to segment data into various clusters. In this article, we will discuss the K-Prototypes clustering algorithm with a numerical example. We will also discuss the advantages and disadvantages of the k-prototypes clustering algorithm.

What is K-Prototypes clustering?

K-Prototypes clustering is a partitioning clustering algorithm. We use k-prototypes clustering to cluster datasets that have categorical as well as numerical attributes. The K-Prototypes clustering algorithm is an ensemble of k-means clustering and k-modes clustering algorithm. Hence, it can handle both numerical and categorical data. 

To understand the k-prototypes clustering in a better way, I would first suggest you read k-means clustering with a numerical example and k-modes clustering with a numerical example. 

In k-prototypes clustering, we select k-prototypes randomly at the start. After that, we calculate the distance between each data point and the prototypes. Accordingly, all the data points are assigned to clustering associated with different prototypes.

After assigning data points to the clusters, we calculate the new prototype for the current cluster using the method discussed in the next sections. After that, we recalculate the distance of prototypes from the data points and reassign the clusters. This process is continued until the clusters converge.

Before getting into the numerical example for k-prototypes clustering, Let us first discuss the distance measures and calculation of prototypes in the k-prototypes clustering algorithm.

Distance Measures in K-Prototypes Clustering

As the k-prototypes clustering algorithm deals with data having numerical as well as categorical attributes, it uses different measures for both data types. 

For numerical data, the k-prototypes clustering uses squared euclidean distance as the distance measure. For instance, if we are given two data points (1,2,3) and (4, 3, 3), the distance between these two data points will be calculated as (1-4)^2+(2-3)^2+(3-3)^2 which is equal to 13.

For categorical attributes, the k-prototypes clustering algorithm follows matching dissimilarity. If you have two records (A, B, C, D) and (A, D, C, C) with categorical attributes, the matching dissimilarity is the number of different values at each position in the records. In the given records, values are different at the two positions only. Hence, the matching dissimilarity between the records will be 2.

For records having mixed attributes, we calculate the distance between categorical and numerical attributes separately. After that, we use the sum of the dissimilarity scores as the distance between two records. For instance, consider that we have two records [‘A’, ‘B’, ‘F’, 155, 53] and [‘A’, ‘A’, ‘M’, 174, 70].

To find the distance between these two records, we will first find the dissimilarity score between  [‘A’, ‘B’, ‘F’] and  [‘A’, ‘A’, ‘M’]. The score is 2 as two attributes out of three have different values.

Next, we will calculate the square euclidean distance between [155, 53] and [174, 70]. Here, (155-174)^2 + (53-70)^2 which is equal to 650. 

Now, we can directly calculate the total dissimilarity score as the sum of the dissimilarity score between categorical attributes and the square euclidean distance of numerical attributes. Here, the sum will be equal to 650+2=652.

Observe that the matching dissimilarity score of categorical attributes is almost negligible compared to the square euclidean distance between numerical attributes. Hence, the categorical attributes will have little or no effect on clustering. 

To solve this problem, we can scale the values in numeric attributes within a range of say 0 to 5. Alternatively, we can take a weighted sum of the matching dissimilarity scores and the square euclidean distance. In this article, we will discuss a numerical example of the k-prototypes clustering algorithm by scaling the values in numeric attributes within a range of 0 to 5.

Choice of New Prototypes in K-Prototypes Clustering

Once a cluster is formed, we need to calculate a new prototype for the cluster using the data points in the current cluster.

To calculate the new prototype for any given cluster, we will take mode of categorical attributes of the data points in the cluster. For numerical attributes, we will use the mean of the values to calculate new prototype for the cluster. For example, suppose that we have the following data points in a cluster.

EQ RatingIQ RatingGenderHeightWeight
CAF5.0000004.705882
BAM4.1758244.470588
BAF4.8076923.235294
AAF4.9725274.117647
BCF4.6428573.470588
Points in a Cluster

In the above cluster, we will take the mode of values in the EQ Rating, IQ Rating, and Gender attributes. For the attributes Height and Weight, we will take the mean of the values to calculate the new prototype. Hence, the prototype for the above cluster will be [B, A , F, 4.71978,3.999999 ].

Here, B is the mode of values in the EQ Rating column. A is the mode of the values in the IQ Rating column, and F is the mode of the values in the Gender column. Similarly, 4.71978 and 3.999999 are mean of the Height and Weight attributes respectively.

The K-Prototypes Clustering Algorithm

The K-prototypes clustering algorithm is similar to the k-means clustering algorithm. The difference lies in how we calculate the distance between two data points and how we select new prototypes for clusters. The steps in the k-prototypes clustering algorithm are discussed below.

  1. First, we select K data points from the input dataset as initial prototypes. 
  2. We then find the distance of each data point from the current prototypes. The distances are calculated as discussed in the previous sections.
  3. After finding the distance of each data point from the prototypes, we assign data points to clusters. Here, each data point is assigned to the cluster with the prototype nearest to the data point. 
  4. After assigning data points to the clusters, we calculate new prototypes for each cluster. To calculate the prototypes, we take the mean of numeric attributes and the mode of categorical attributes as discussed previously.
  5. If the new prototypes are the same as the previous prototypes, we say that the algorithm has converged. Hence, the current clusters are finalized. Otherwise, we go to 2.

Now that we have introduced you to the k-prototypes clustering algorithm, let us discuss a numerical example of k-prototype clustering with a small sample dataset.

K-Prototypes Clustering Numerical Example

Following is the dataset we will use to understand the k-prototypes clustering algorithm using the numerical example. It contains EQ Rating, IQ rating, Gender, Height and Weight of students in a given class.

StudentEQ RatingIQ RatingGenderHeight(in cms)Weight(in Kgs)
Student 1ABF15553
Student 2AAM17470
Student 3CCM17775
Student 4CAF18280
Student 5BAM15276
Student 6ABM16069
Student 7BAF17555
Student 8AAF18170
Student 9ACM18085
Student 10ABF16654
Student 11CCM16266
Student 12ACM15374
Student 13ABM16062
Student 14BCF16959
Student 15ABF17171
Dataset For K-Prototypes Clustering

The above table isn’t normalized. Due to this, the dissimilarity score of categorical attributes is negligible compared to the difference between numerical attributes. Due to this, the clustering will be biased on the basis of height and weight. To avoid this bias, we will normalize the dataset as shown in the following table.

StudentEQ RatingIQ RatingGenderHeightWeight
Student 1ABF4.2582423.117647
Student 2AAM4.7802204.117647
Student 3CCM4.8626374.411765
Student 4CAF5.0000004.705882
Student 5BAM4.1758244.470588
Student 6ABM4.3956044.058824
Student 7BAF4.8076923.235294
Student 8AAF4.9725274.117647
Student 9ACM4.9450555.000000
Student 10ABF4.5604403.176471
Student 11CCM4.4505493.882353
Student 12ACM4.2032974.352941
Student 13ABM4.3956043.647059
Student 14BCF4.6428573.470588
Student 15ABF4.6978024.176471
Normalized dataset

In the above table, we have normalized the Height and Weight attributes in the range 0 to 5. Now, the distance between the numeric attributes will not be very large compared to the dissimilarity between the categorical attributes.

Let us now discuss the numerical for k-prototypes clustering using the normalized dataset having mixed data types.

K-Prototypes Clustering With Numerical Example Iteration 1

Suppose that we want to classify the input dataset into 3 clusters. Hence, we will select three data points as initial prototypes. Here, we will select Student 5, Student 9, and Student 13 as initial prototypes. Hence, 

  • prototype1= [‘B’, ‘A’, ‘M’, 4.175824, 4.470588]
  • prototype2= [‘A’, ‘C’, ‘M’, 4.945055, 5.0]
  • prototype3=[‘A’, ‘B’, ‘M’, 4.395604, 3.6470589]

Now, let us calculate the distance of each data point with the prototypes. The distance is calculated using the matching dissimilarity and squared distance measure discussed in the previous sections. The distance has been calculated and tabulated below.

EQ RatingIQ RatingGenderHeightWeightDistance From Prototype 1Distance From Prototype 2Distance From Prototype 3Assigned Cluster
ABF4.2582423.1176473.1837242.4014961.029915Cluster 3
AAM4.780224.1176471.0489861.0805721.036938Cluster 3
CCM4.8626374.4117652.0475171.0352812.080289Cluster 2
CAF54.7058822.0734633.0089523.14864Cluster 1
BAM4.1758244.47058802.0871992.07265Cluster 1
ABM4.3956044.0588242.0217851.1187710.016955Cluster 3
BAF4.8076923.2352941.1925213.3133063.033937Cluster 1
AAF4.9725274.1176472.075932.077932.055429Cluster 3
ACM4.94505552.08719901.213235Cluster 2
ABF4.560443.1764713.1822672.3473191.024862Cluster 3
CCM4.4505493.8823532.0421491.1493672.005838Cluster 2
ACM4.2032974.3529412.001460.0968891.053525Cluster 2
ABM4.3956043.6470592.072651.2132350Cluster 3
BCF4.6428573.4705882.1218122.2430423.009228Cluster 1
ABF4.6978024.1764713.0358972.0739331.03716Cluster 3
Iteration 1

In the above table, we have first calculated the distance of each data point from the initial prototypes. Then, we have assigned the data points to the nearest prototype. 

After creating the clusters, we will calculate the new prototypes for each cluster using the mode of categorical attributes and mean of the numerical attributes. 

In cluster 1, we have the following points.

CAF5.0000004.705882
BAM4.1758244.470588
BAF4.8076923.235294
BCF4.6428573.470588
Cluster 1

The prototype for this cluster will be [B, A, F, 4.656593, 3.970588].

We have the following points in the second cluster.

CCM4.8626374.411765
ACM4.9450555.000000
CCM4.4505493.882353
ACM4.2032974.352941
Cluster 2

Hence, the prototype for the second cluster will be [A, C, M, 4.615384, 4.411764].

Cluster 3 has the following points.

ABF4.2582423.117647
AAM4.7802204.117647
ABM4.3956044.058824
AAF4.9725274.117647
ABF4.5604403.176471
ABM4.3956043.647059
ABF4.6978024.176471
Cluster 3

Hence, the prototype for cluster 3 is [A, B, F, 4.580062, 3.773109].

After iteration 1, we have the following prototypes.

  • prototype1= [B, A, F, 4.656593, 3.970588]
  • prototype2= [A, C, M, 4.615384, 4.411764]
  • prototype3=[A, B, F, 4.580062, 3.773109]

You can observe that the current prototypes are not the same as the initial prototypes. Hence, we will calculate the distance of the data points in the dataset to these prototypes and reassign the points to the clusters.

K-Prototypes Clustering With Numerical Example: Iteration 2

The distance of each point from the prototypes calculate in the first iteration are tabulated below.

EQ RatingIQ RatingGenderHeightWeightDistance From Prototype 1Distance From Prototype 2Distance From Prototype 3Assigned Cluster
ABF4.2582423.1176472.0886192.1802290.05332Cluster 3
AAM4.780224.1176472.0036911.0113682.015877Cluster 2
CCM4.8626374.4117653.0237091.0061133.048773Cluster 2
CAF54.7058821.0658593.0234442.104641Cluster 1
BAM4.1758244.4705881.0481142.0196673.064989Cluster 1
ABM4.3956044.0588243.007591.0172871.011566Cluster 3
BAF4.8076923.2352940.0563493.1421062.034106Cluster 1
AAF4.9725274.1176471.0121442.0214061.027274Cluster 1
ACM4.94505553.114290.045472.163848Cluster 2
ABF4.560443.1764712.0639872.1528970.035636Cluster 3
CCM4.4505493.8823533.0050241.0307453.002871Cluster 2
ACM4.2032974.3529413.0351670.0173282.047816Cluster 2
ABM4.3956043.6470593.0172791.0633081.004991Cluster 3
BCF4.6428573.4705881.0250192.0886572.009546Cluster 1
ABF4.6978024.1764712.0044092.0062160.017656Cluster 3
Iteration 2

In the above table, we have reassigned the data points to the clusters based on their distance from the prototypes that were calculated in iteration 1. Now, we will again calculate the prototype for each cluster.

In cluster 1, we have the following points.

EQ RatingIQ RatingGenderHeightWeight
CAF5.0000004.705882
BAM4.1758244.470588
BAF4.8076923.235294
AAF4.9725274.117647
BCF4.6428573.470588
Cluster 1

The prototype for the above cluster is [B, A , F, 4.71978,3.999999].

In cluster 2, we have the following points.

EQ RatingIQ RatingGenderHeightWeight
AAM4.7802204.117647
CCM4.8626374.411765
ACM4.9450555.000000
CCM4.4505493.882353
ACM4.2032974.352941
Cluster 2

The prototype for cluster 2 is [A, C, M, 4.648351, 4.3529412].

In cluster 3, we have the following data points.

EQ RatingIQ RatingGenderHeightWeight
ABF4.2582423.117647
ABM4.3956044.058824
ABF4.5604403.176471
ABM4.3956043.647059
ABF4.6978024.176471
Cluster 3

The prototype for cluster 3 is [A, B, F, 4.461538, 3.635294].

After iteration 2, we have the following prototypes.

  • prototype1= [B, A , F, 4.71978,3.999999].
  • prototype2= [A, C, M, 4.648351, 4.3529412].
  • prototype3=[A, B, F, 4.461538, 3.635294].

You can observe that the current prototypes are not the same as the prototypes calculated after iteration 1. Hence, we will calculate the distance of the data points in the dataset to these prototypes and reassign the points to the clusters.

K-Prototypes Clustering With Numerical Example: Iteration 3

The following table contains the distance of each data point from the prototypes calculated in iteration 2.

EQ RatingIQ RatingGenderHeightWeightDistance From Prototype 1Distance From Prototype 2Distance From Prototype 3Assigned Cluster
ABF4.2582423.1176472.0991562.1678140.030929Cluster 3
AAM4.780224.1176472.0017491.0072752.033422Cluster 2
CCM4.8626374.4117653.0189961.0049383.076379Cluster 2
CAF54.7058821.0576793.0248222.14361Cluster 1
BAM4.1758244.4705881.0517342.0237123.077935Cluster 1
ABM4.3956044.0588243.0108551.0150391.018372Cluster 2
BAF4.8076923.2352940.059253.1274522.027982Cluster 1
AAF4.9725274.1176471.0077722.0160451.049377Cluster 1
ACM4.94505553.1050750.0506722.209621Cluster 2
ABF4.560443.1764712.0703592.1391810.02203Cluster 3
CCM4.4505493.8823533.0086331.0260583.006116Cluster 2
ACM4.2032974.3529413.0391320.0198072.058171Cluster 2
ABM4.3956043.6470593.0229661.0562151.000449Cluster 3
BCF4.6428573.4705881.0286192.0778582.006Cluster 1
ABF4.6978024.1764712.0031632.0033590.034869Cluster 3
Iteration 3

In the above table, we have reassigned the points to different clusters. Now, let us recalculate the prototype of each cluster.

In cluster 1, we have the following data points.

EQ RatingIQ RatingGenderHeightWeight
CAF5.0000004.705882
BAM4.1758244.470588
BAF4.8076923.235294
AAF4.9725274.117647
BCF4.6428573.470588
Cluster 1

The prototype for cluster 1 will be [B, A , F, 4.71978,3.999999].

In cluster 2, we have the following data points.

EQ RatingIQ RatingGenderHeightWeight
AAM4.7802204.117647
CCM4.8626374.411765
ACM4.9450555.000000
CCM4.4505493.882353
ABM4.3956043.647059
ACM4.2032974.352941
Cluster 2

The prototype for cluster 2 will be [A, C, M, 4.606227, 4.235294].

In cluster 3, we have the following data points.

EQ RatingIQ RatingGenderHeightWeight
ABF4.2582423.117647
ABM4.3956044.058824
ABF4.5604403.176471
ABF4.6978024.176471
Cluster 3

The prototype for the above cluster will be [A, B, F, 4.478022, 3.632353].

After iteration 2, we have the following prototypes.

  • prototype1= [B, A , F, 4.71978,3.999999].
  • prototype2= [A, C, M, 4.606227, 4.235294].
  • prototype3=[A, B, F, 4.478022, 3.632353].

You can observe that the current prototypes are not the same as the prototypes calculated after iteration 2. Hence, we will calculate the distance of the data points in the dataset to these prototypes and reassign the points to the clusters.

K-Prototypes Clustering With Numerical Example: Iteration 4

The following table contains the distance of points from the prototypes calculated in iteration 3 along with the reassigned clusters.

EQ RatingIQ RatingGenderHeightWeightDistance From Prototype 1Distance From Prototype 2Distance From Prototype 3Assigned Cluster
ABF4.2582423.1176472.0991562.1370230.031323Cluster 3
AAM4.780224.1176472.0017491.0044112.032683Cluster 2
CCM4.8626374.4117653.0189961.0096893.075541Cluster 2
CAF54.7058821.0576793.0376512.142493Cluster 1
BAM4.1758244.4705881.0517342.0240613.079396Cluster 1
ABM4.3956044.0588243.0108551.007551.018867Cluster 2
BAF4.8076923.2352940.059253.1040592.026634Cluster 1
AAF4.9725274.1176471.0077722.0148021.048005Cluster 1
ACM4.94505553.1050750.0699582.208858Cluster 2
ABF4.560443.1764712.0703592.112320.021462Cluster 3
CCM4.4505493.8823533.0086331.014883.006325Cluster 2
ACM4.2032974.3529413.0391320.0176192.059472Cluster 2
ABM4.3956043.6470593.0229661.0390381.000701Cluster 3
BCF4.6428573.4705881.0286192.0586122.005334Cluster 1
ABF4.6978024.1764712.0031632.0011850.034437Cluster 3
Iteration 4

You can observe that the the points in the clusters haven’t changed. Hence, the prototype in each cluster will also not change. Thus, our algorithm has converged and the clusters calculated in the previous iteration are the final output clusters.

Suggested Reading: Advanced Coding Concepts

Advantages of K-prototype Clustering

  • The first advantage of k-prototypes clustering is that it can be used for datasets having mixed data types. As we are able to cluster datasets with numerical and categorical attributes, the usability of the k-prototypes clustering algorithm becomes high in real-life situations. 
  • K-Prototypes clustering algorithm is easy to implement. It uses simple mathematical constructs and is an iterable algorithm. Hence, it becomes easy to understand and implement this algorithm.
  • K-prototypes clustering is a scalable clustering algorithm. You can use it for smaller as well as large datasets. 
  • The K-Prototypes clustering algorithm is guaranteed to converge. Hence, it is sure that we will get the output clusters if we execute the algorithm.
  • The K-Prototypes clustering algorithm isn’t restricted to a particular domain. As long as we have structured datasets with numeric and categorical attributes, we can use the k-prototypes clustering algorithm in any domain. 
  • The number of iterations and the shape of clusters in K-Prototypes clustering are highly dependent on the initial prototypes. The algorithm allows us to warm-start the choice of clusters. Hence after data preprocessing and exploratory analysis, you can choose suitable prototypes initially so that the number of iterations can be minimized while executing the algorithm.

Disadvantages of K-Prototypes Clustering

  • The K-prototypes clustering algorithm, just like the k-means and k-modes clustering algorithm, faces the curse of dimensionality. With an increasing number of attributes in the input dataset, the dissimilarity between the data points starts to become almost the same. In such a case, we need to define a mechanism to specify which cluster a data point is assigned to if the data points have the same dissimilarity with two prototypes. 
  • In K-prototypes clustering, we don’t know the optimal number of clusters. Due to this, we need to try different numbers of clusters to find the optimal k for clustering data into k clusters.
  • The number of iterations in the k-prototypes clustering for convergence depends on the choice of initial prototypes. Due to this, if the prototypes aren’t selected in an efficient way, the runtime for the algorithm will be longer.
  • The shape of the clusters in k-prototypes clustering depends on the initial prototypes. Hence, the k-prototypes algorithm might give different results for each run.

Conclusion

The k-prototypes cluster algorithm finds its applications in various real life situations due to its ability to handle mixed data types. You can use k-prototypes clustering in loan classification, customer segmentation, cyber profiling, and other situations where we need to group data into various clusters.

In this article, we have discussed the k-prototypes clustering algorithm with a numerical example. We have also discussed the advantages and disadvantages of the algorithm.

To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

To read about other computer science topics, you can read this article on  dynamic role-based authorization using ASP.net. You can also read this article on user activity logging using Asp.net.

Similar Posts