K-Prototypes Clustering With Numerical Example

Clustering algorithms are unsupervised machine learning algorithms used to segment data into various clusters. In this article, we will discuss the K-Prototypes clustering algorithm with a numerical example. We will also discuss the advantages and disadvantages of the k-prototypes clustering algorithm.

What is K-Prototypes clustering?

K-Prototypes clustering is a partitioning clustering algorithm. We use k-prototypes clustering to cluster datasets that have categorical as well as numerical attributes. The K-Prototypes clustering algorithm is an ensemble of k-means clustering and k-modes clustering algorithm. Hence, it can handle both numerical and categorical data.

To understand the k-prototypes clustering in a better way, I would first suggest you read k-means clustering with a numerical example and k-modes clustering with a numerical example.

In k-prototypes clustering, we select k-prototypes randomly at the start. After that, we calculate the distance between each data point and the prototypes. Accordingly, all the data points are assigned to clustering associated with different prototypes.

After assigning data points to the clusters, we calculate the new prototype for the current cluster using the method discussed in the next sections. After that, we recalculate the distance of prototypes from the data points and reassign the clusters. This process is continued until the clusters converge.

Before getting into the numerical example for k-prototypes clustering, Let us first discuss the distance measures and calculation of prototypes in the k-prototypes clustering algorithm.

Distance Measures in K-Prototypes Clustering

As the k-prototypes clustering algorithm deals with data having numerical as well as categorical attributes, it uses different measures for both data types.

For numerical data, the k-prototypes clustering uses squared euclidean distance as the distance measure. For instance, if we are given two data points (1,2,3) and (4, 3, 3), the distance between these two data points will be calculated as (1-4)^2+(2-3)^2+(3-3)^2 which is equal to 13.

For categorical attributes, the k-prototypes clustering algorithm follows matching dissimilarity. If you have two records (A, B, C, D) and (A, D, C, C) with categorical attributes, the matching dissimilarity is the number of different values at each position in the records. In the given records, values are different at the two positions only. Hence, the matching dissimilarity between the records will be 2.

For records having mixed attributes, we calculate the distance between categorical and numerical attributes separately. After that, we use the sum of the dissimilarity scores as the distance between two records. For instance, consider that we have two records [‘A’, ‘B’, ‘F’, 155, 53] and [‘A’, ‘A’, ‘M’, 174, 70].

To find the distance between these two records, we will first find the dissimilarity score between [‘A’, ‘B’, ‘F’] and [‘A’, ‘A’, ‘M’]. The score is 2 as two attributes out of three have different values.

Next, we will calculate the square euclidean distance between [155, 53] and [174, 70]. Here, (155-174)^2 + (53-70)^2 which is equal to 650.

Now, we can directly calculate the total dissimilarity score as the sum of the dissimilarity score between categorical attributes and the square euclidean distance of numerical attributes. Here, the sum will be equal to 650+2=652.

Observe that the matching dissimilarity score of categorical attributes is almost negligible compared to the square euclidean distance between numerical attributes. Hence, the categorical attributes will have little or no effect on clustering.

To solve this problem, we can scale the values in numeric attributes within a range of say 0 to 5. Alternatively, we can take a weighted sum of the matching dissimilarity scores and the square euclidean distance. In this article, we will discuss a numerical example of the k-prototypes clustering algorithm by scaling the values in numeric attributes within a range of 0 to 5.

Choice of New Prototypes in K-Prototypes Clustering

Once a cluster is formed, we need to calculate a new prototype for the cluster using the data points in the current cluster.

To calculate the new prototype for any given cluster, we will take mode of categorical attributes of the data points in the cluster. For numerical attributes, we will use the mean of the values to calculate new prototype for the cluster. For example, suppose that we have the following data points in a cluster.

EQ Rating	IQ Rating	Gender	Height	Weight
C	A	F	5.000000	4.705882
B	A	M	4.175824	4.470588
B	A	F	4.807692	3.235294
A	A	F	4.972527	4.117647
B	C	F	4.642857	3.470588

Points in a Cluster

In the above cluster, we will take the mode of values in the EQ Rating, IQ Rating, and Gender attributes. For the attributes Height and Weight, we will take the mean of the values to calculate the new prototype. Hence, the prototype for the above cluster will be [B, A , F, 4.71978,3.999999 ].

Here, B is the mode of values in the EQ Rating column. A is the mode of the values in the IQ Rating column, and F is the mode of the values in the Gender column. Similarly, 4.71978 and 3.999999 are mean of the Height and Weight attributes respectively.

The K-Prototypes Clustering Algorithm

The K-prototypes clustering algorithm is similar to the k-means clustering algorithm. The difference lies in how we calculate the distance between two data points and how we select new prototypes for clusters. The steps in the k-prototypes clustering algorithm are discussed below.

First, we select K data points from the input dataset as initial prototypes.
We then find the distance of each data point from the current prototypes. The distances are calculated as discussed in the previous sections.
After finding the distance of each data point from the prototypes, we assign data points to clusters. Here, each data point is assigned to the cluster with the prototype nearest to the data point.
After assigning data points to the clusters, we calculate new prototypes for each cluster. To calculate the prototypes, we take the mean of numeric attributes and the mode of categorical attributes as discussed previously.
If the new prototypes are the same as the previous prototypes, we say that the algorithm has converged. Hence, the current clusters are finalized. Otherwise, we go to 2.

Now that we have introduced you to the k-prototypes clustering algorithm, let us discuss a numerical example of k-prototype clustering with a small sample dataset.

K-Prototypes Clustering Numerical Example

Following is the dataset we will use to understand the k-prototypes clustering algorithm using the numerical example. It contains EQ Rating, IQ rating, Gender, Height and Weight of students in a given class.

Student	EQ Rating	IQ Rating	Gender	Height(in cms)	Weight(in Kgs)
Student 1	A	B	F	155	53
Student 2	A	A	M	174	70
Student 3	C	C	M	177	75
Student 4	C	A	F	182	80
Student 5	B	A	M	152	76
Student 6	A	B	M	160	69
Student 7	B	A	F	175	55
Student 8	A	A	F	181	70
Student 9	A	C	M	180	85
Student 10	A	B	F	166	54
Student 11	C	C	M	162	66
Student 12	A	C	M	153	74
Student 13	A	B	M	160	62
Student 14	B	C	F	169	59
Student 15	A	B	F	171	71

Dataset For K-Prototypes Clustering

The above table isn’t normalized. Due to this, the dissimilarity score of categorical attributes is negligible compared to the difference between numerical attributes. Due to this, the clustering will be biased on the basis of height and weight. To avoid this bias, we will normalize the dataset as shown in the following table.

Student	EQ Rating	IQ Rating	Gender	Height	Weight
Student 1	A	B	F	4.258242	3.117647
Student 2	A	A	M	4.780220	4.117647
Student 3	C	C	M	4.862637	4.411765
Student 4	C	A	F	5.000000	4.705882
Student 5	B	A	M	4.175824	4.470588
Student 6	A	B	M	4.395604	4.058824
Student 7	B	A	F	4.807692	3.235294
Student 8	A	A	F	4.972527	4.117647
Student 9	A	C	M	4.945055	5.000000
Student 10	A	B	F	4.560440	3.176471
Student 11	C	C	M	4.450549	3.882353
Student 12	A	C	M	4.203297	4.352941
Student 13	A	B	M	4.395604	3.647059
Student 14	B	C	F	4.642857	3.470588
Student 15	A	B	F	4.697802	4.176471

Normalized dataset

In the above table, we have normalized the Height and Weight attributes in the range 0 to 5. Now, the distance between the numeric attributes will not be very large compared to the dissimilarity between the categorical attributes.

Let us now discuss the numerical for k-prototypes clustering using the normalized dataset having mixed data types.

K-Prototypes Clustering With Numerical Example Iteration 1

Suppose that we want to classify the input dataset into 3 clusters. Hence, we will select three data points as initial prototypes. Here, we will select Student 5, Student 9, and Student 13 as initial prototypes. Hence,

prototype1= [‘B’, ‘A’, ‘M’, 4.175824, 4.470588]
prototype2= [‘A’, ‘C’, ‘M’, 4.945055, 5.0]
prototype3=[‘A’, ‘B’, ‘M’, 4.395604, 3.6470589]

Now, let us calculate the distance of each data point with the prototypes. The distance is calculated using the matching dissimilarity and squared distance measure discussed in the previous sections. The distance has been calculated and tabulated below.

EQ Rating	IQ Rating	Gender	Height	Weight	Distance From Prototype 1	Distance From Prototype 2	Distance From Prototype 3	Assigned Cluster
A	B	F	4.258242	3.117647	3.183724	2.401496	1.029915	Cluster 3
A	A	M	4.78022	4.117647	1.048986	1.080572	1.036938	Cluster 3
C	C	M	4.862637	4.411765	2.047517	1.035281	2.080289	Cluster 2
C	A	F	5	4.705882	2.073463	3.008952	3.14864	Cluster 1
B	A	M	4.175824	4.470588	0	2.087199	2.07265	Cluster 1
A	B	M	4.395604	4.058824	2.021785	1.118771	0.016955	Cluster 3
B	A	F	4.807692	3.235294	1.192521	3.313306	3.033937	Cluster 1
A	A	F	4.972527	4.117647	2.07593	2.07793	2.055429	Cluster 3
A	C	M	4.945055	5	2.087199	0	1.213235	Cluster 2
A	B	F	4.56044	3.176471	3.182267	2.347319	1.024862	Cluster 3
C	C	M	4.450549	3.882353	2.042149	1.149367	2.005838	Cluster 2
A	C	M	4.203297	4.352941	2.00146	0.096889	1.053525	Cluster 2
A	B	M	4.395604	3.647059	2.07265	1.213235	0	Cluster 3
B	C	F	4.642857	3.470588	2.121812	2.243042	3.009228	Cluster 1
A	B	F	4.697802	4.176471	3.035897	2.073933	1.03716	Cluster 3

Iteration 1

In the above table, we have first calculated the distance of each data point from the initial prototypes. Then, we have assigned the data points to the nearest prototype.

After creating the clusters, we will calculate the new prototypes for each cluster using the mode of categorical attributes and mean of the numerical attributes.

In cluster 1, we have the following points.

C	A	F	5.000000	4.705882
B	A	M	4.175824	4.470588
B	A	F	4.807692	3.235294
B	C	F	4.642857	3.470588

Cluster 1

The prototype for this cluster will be [B, A, F, 4.656593, 3.970588].

We have the following points in the second cluster.

C	C	M	4.862637	4.411765
A	C	M	4.945055	5.000000
C	C	M	4.450549	3.882353
A	C	M	4.203297	4.352941

Cluster 2

Hence, the prototype for the second cluster will be [A, C, M, 4.615384, 4.411764].

Cluster 3 has the following points.

A	B	F	4.258242	3.117647
A	A	M	4.780220	4.117647
A	B	M	4.395604	4.058824
A	A	F	4.972527	4.117647
A	B	F	4.560440	3.176471
A	B	M	4.395604	3.647059
A	B	F	4.697802	4.176471

Cluster 3

Hence, the prototype for cluster 3 is [A, B, F, 4.580062, 3.773109].

After iteration 1, we have the following prototypes.

prototype1= [B, A, F, 4.656593, 3.970588]
prototype2= [A, C, M, 4.615384, 4.411764]
prototype3=[A, B, F, 4.580062, 3.773109]

You can observe that the current prototypes are not the same as the initial prototypes. Hence, we will calculate the distance of the data points in the dataset to these prototypes and reassign the points to the clusters.

K-Prototypes Clustering With Numerical Example: Iteration 2

The distance of each point from the prototypes calculate in the first iteration are tabulated below.

EQ Rating	IQ Rating	Gender	Height	Weight	Distance From Prototype 1	Distance From Prototype 2	Distance From Prototype 3	Assigned Cluster
A	B	F	4.258242	3.117647	2.088619	2.180229	0.05332	Cluster 3
A	A	M	4.78022	4.117647	2.003691	1.011368	2.015877	Cluster 2
C	C	M	4.862637	4.411765	3.023709	1.006113	3.048773	Cluster 2
C	A	F	5	4.705882	1.065859	3.023444	2.104641	Cluster 1
B	A	M	4.175824	4.470588	1.048114	2.019667	3.064989	Cluster 1
A	B	M	4.395604	4.058824	3.00759	1.017287	1.011566	Cluster 3
B	A	F	4.807692	3.235294	0.056349	3.142106	2.034106	Cluster 1
A	A	F	4.972527	4.117647	1.012144	2.021406	1.027274	Cluster 1
A	C	M	4.945055	5	3.11429	0.04547	2.163848	Cluster 2
A	B	F	4.56044	3.176471	2.063987	2.152897	0.035636	Cluster 3
C	C	M	4.450549	3.882353	3.005024	1.030745	3.002871	Cluster 2
A	C	M	4.203297	4.352941	3.035167	0.017328	2.047816	Cluster 2
A	B	M	4.395604	3.647059	3.017279	1.063308	1.004991	Cluster 3
B	C	F	4.642857	3.470588	1.025019	2.088657	2.009546	Cluster 1
A	B	F	4.697802	4.176471	2.004409	2.006216	0.017656	Cluster 3

Iteration 2

In the above table, we have reassigned the data points to the clusters based on their distance from the prototypes that were calculated in iteration 1. Now, we will again calculate the prototype for each cluster.

In cluster 1, we have the following points.

EQ Rating	IQ Rating	Gender	Height	Weight
C	A	F	5.000000	4.705882
B	A	M	4.175824	4.470588
B	A	F	4.807692	3.235294
A	A	F	4.972527	4.117647
B	C	F	4.642857	3.470588

Cluster 1

The prototype for the above cluster is [B, A , F, 4.71978,3.999999].

In cluster 2, we have the following points.

EQ Rating	IQ Rating	Gender	Height	Weight
A	A	M	4.780220	4.117647
C	C	M	4.862637	4.411765
A	C	M	4.945055	5.000000
C	C	M	4.450549	3.882353
A	C	M	4.203297	4.352941

Cluster 2

The prototype for cluster 2 is [A, C, M, 4.648351, 4.3529412].

In cluster 3, we have the following data points.

EQ Rating	IQ Rating	Gender	Height	Weight
A	B	F	4.258242	3.117647
A	B	M	4.395604	4.058824
A	B	F	4.560440	3.176471
A	B	M	4.395604	3.647059
A	B	F	4.697802	4.176471

Cluster 3

The prototype for cluster 3 is [A, B, F, 4.461538, 3.635294].

After iteration 2, we have the following prototypes.

prototype1= [B, A , F, 4.71978,3.999999].
prototype2= [A, C, M, 4.648351, 4.3529412].
prototype3=[A, B, F, 4.461538, 3.635294].

You can observe that the current prototypes are not the same as the prototypes calculated after iteration 1. Hence, we will calculate the distance of the data points in the dataset to these prototypes and reassign the points to the clusters.

K-Prototypes Clustering With Numerical Example: Iteration 3

The following table contains the distance of each data point from the prototypes calculated in iteration 2.

EQ Rating	IQ Rating	Gender	Height	Weight	Distance From Prototype 1	Distance From Prototype 2	Distance From Prototype 3	Assigned Cluster
A	B	F	4.258242	3.117647	2.099156	2.167814	0.030929	Cluster 3
A	A	M	4.78022	4.117647	2.001749	1.007275	2.033422	Cluster 2
C	C	M	4.862637	4.411765	3.018996	1.004938	3.076379	Cluster 2
C	A	F	5	4.705882	1.057679	3.024822	2.14361	Cluster 1
B	A	M	4.175824	4.470588	1.051734	2.023712	3.077935	Cluster 1
A	B	M	4.395604	4.058824	3.010855	1.015039	1.018372	Cluster 2
B	A	F	4.807692	3.235294	0.05925	3.127452	2.027982	Cluster 1
A	A	F	4.972527	4.117647	1.007772	2.016045	1.049377	Cluster 1
A	C	M	4.945055	5	3.105075	0.050672	2.209621	Cluster 2
A	B	F	4.56044	3.176471	2.070359	2.139181	0.02203	Cluster 3
C	C	M	4.450549	3.882353	3.008633	1.026058	3.006116	Cluster 2
A	C	M	4.203297	4.352941	3.039132	0.019807	2.058171	Cluster 2
A	B	M	4.395604	3.647059	3.022966	1.056215	1.000449	Cluster 3
B	C	F	4.642857	3.470588	1.028619	2.077858	2.006	Cluster 1
A	B	F	4.697802	4.176471	2.003163	2.003359	0.034869	Cluster 3

Iteration 3

In the above table, we have reassigned the points to different clusters. Now, let us recalculate the prototype of each cluster.

In cluster 1, we have the following data points.

EQ Rating	IQ Rating	Gender	Height	Weight
C	A	F	5.000000	4.705882
B	A	M	4.175824	4.470588
B	A	F	4.807692	3.235294
A	A	F	4.972527	4.117647
B	C	F	4.642857	3.470588

Cluster 1

The prototype for cluster 1 will be [B, A , F, 4.71978,3.999999].

In cluster 2, we have the following data points.

EQ Rating	IQ Rating	Gender	Height	Weight
A	A	M	4.780220	4.117647
C	C	M	4.862637	4.411765
A	C	M	4.945055	5.000000
C	C	M	4.450549	3.882353
A	B	M	4.395604	3.647059
A	C	M	4.203297	4.352941

Cluster 2

The prototype for cluster 2 will be [A, C, M, 4.606227, 4.235294].

In cluster 3, we have the following data points.

EQ Rating	IQ Rating	Gender	Height	Weight
A	B	F	4.258242	3.117647
A	B	M	4.395604	4.058824
A	B	F	4.560440	3.176471
A	B	F	4.697802	4.176471

Cluster 3

The prototype for the above cluster will be [A, B, F, 4.478022, 3.632353].

After iteration 2, we have the following prototypes.

prototype1= [B, A , F, 4.71978,3.999999].
prototype2= [A, C, M, 4.606227, 4.235294].
prototype3=[A, B, F, 4.478022, 3.632353].

You can observe that the current prototypes are not the same as the prototypes calculated after iteration 2. Hence, we will calculate the distance of the data points in the dataset to these prototypes and reassign the points to the clusters.

K-Prototypes Clustering With Numerical Example: Iteration 4

The following table contains the distance of points from the prototypes calculated in iteration 3 along with the reassigned clusters.

EQ Rating	IQ Rating	Gender	Height	Weight	Distance From Prototype 1	Distance From Prototype 2	Distance From Prototype 3	Assigned Cluster
A	B	F	4.258242	3.117647	2.099156	2.137023	0.031323	Cluster 3
A	A	M	4.78022	4.117647	2.001749	1.004411	2.032683	Cluster 2
C	C	M	4.862637	4.411765	3.018996	1.009689	3.075541	Cluster 2
C	A	F	5	4.705882	1.057679	3.037651	2.142493	Cluster 1
B	A	M	4.175824	4.470588	1.051734	2.024061	3.079396	Cluster 1
A	B	M	4.395604	4.058824	3.010855	1.00755	1.018867	Cluster 2
B	A	F	4.807692	3.235294	0.05925	3.104059	2.026634	Cluster 1
A	A	F	4.972527	4.117647	1.007772	2.014802	1.048005	Cluster 1
A	C	M	4.945055	5	3.105075	0.069958	2.208858	Cluster 2
A	B	F	4.56044	3.176471	2.070359	2.11232	0.021462	Cluster 3
C	C	M	4.450549	3.882353	3.008633	1.01488	3.006325	Cluster 2
A	C	M	4.203297	4.352941	3.039132	0.017619	2.059472	Cluster 2
A	B	M	4.395604	3.647059	3.022966	1.039038	1.000701	Cluster 3
B	C	F	4.642857	3.470588	1.028619	2.058612	2.005334	Cluster 1
A	B	F	4.697802	4.176471	2.003163	2.001185	0.034437	Cluster 3

Iteration 4

You can observe that the the points in the clusters haven’t changed. Hence, the prototype in each cluster will also not change. Thus, our algorithm has converged and the clusters calculated in the previous iteration are the final output clusters.

Suggested Reading: Advanced Coding Concepts

Advantages of K-prototype Clustering

The first advantage of k-prototypes clustering is that it can be used for datasets having mixed data types. As we are able to cluster datasets with numerical and categorical attributes, the usability of the k-prototypes clustering algorithm becomes high in real-life situations.
K-Prototypes clustering algorithm is easy to implement. It uses simple mathematical constructs and is an iterable algorithm. Hence, it becomes easy to understand and implement this algorithm.
K-prototypes clustering is a scalable clustering algorithm. You can use it for smaller as well as large datasets.
The K-Prototypes clustering algorithm is guaranteed to converge. Hence, it is sure that we will get the output clusters if we execute the algorithm.
The K-Prototypes clustering algorithm isn’t restricted to a particular domain. As long as we have structured datasets with numeric and categorical attributes, we can use the k-prototypes clustering algorithm in any domain.
The number of iterations and the shape of clusters in K-Prototypes clustering are highly dependent on the initial prototypes. The algorithm allows us to warm-start the choice of clusters. Hence after data preprocessing and exploratory analysis, you can choose suitable prototypes initially so that the number of iterations can be minimized while executing the algorithm.

Disadvantages of K-Prototypes Clustering

The K-prototypes clustering algorithm, just like the k-means and k-modes clustering algorithm, faces the curse of dimensionality. With an increasing number of attributes in the input dataset, the dissimilarity between the data points starts to become almost the same. In such a case, we need to define a mechanism to specify which cluster a data point is assigned to if the data points have the same dissimilarity with two prototypes.
In K-prototypes clustering, we don’t know the optimal number of clusters. Due to this, we need to try different numbers of clusters to find the optimal k for clustering data into k clusters.
The number of iterations in the k-prototypes clustering for convergence depends on the choice of initial prototypes. Due to this, if the prototypes aren’t selected in an efficient way, the runtime for the algorithm will be longer.
The shape of the clusters in k-prototypes clustering depends on the initial prototypes. Hence, the k-prototypes algorithm might give different results for each run.

Conclusion

The k-prototypes cluster algorithm finds its applications in various real life situations due to its ability to handle mixed data types. You can use k-prototypes clustering in loan classification, customer segmentation, cyber profiling, and other situations where we need to group data into various clusters.

In this article, we have discussed the k-prototypes clustering algorithm with a numerical example. We have also discussed the advantages and disadvantages of the algorithm.

To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

To read about other computer science topics, you can read this article on dynamic role-based authorization using ASP.net. You can also read this article on user activity logging using Asp.net.

K-Prototypes Clustering With Numerical Example

What is K-Prototypes clustering?

Distance Measures in K-Prototypes Clustering

Choice of New Prototypes in K-Prototypes Clustering

The K-Prototypes Clustering Algorithm

K-Prototypes Clustering Numerical Example

K-Prototypes Clustering With Numerical Example Iteration 1

K-Prototypes Clustering With Numerical Example: Iteration 2

K-Prototypes Clustering With Numerical Example: Iteration 3

K-Prototypes Clustering With Numerical Example: Iteration 4

Advantages of K-prototype Clustering

Disadvantages of K-Prototypes Clustering

Conclusion

Linear Regression vs Logistic Regression in Machine Learning

Regression in Machine Learning With Examples

Machine Learning: An Introduction

Mlflow Tutorial With Code Example

Data Cleaning Steps Explained

KNN Regression Using sklearn Module in Python

What is K-Prototypes clustering?

Distance Measures in K-Prototypes Clustering

Choice of New Prototypes in K-Prototypes Clustering

The K-Prototypes Clustering Algorithm

K-Prototypes Clustering Numerical Example

K-Prototypes Clustering With Numerical Example Iteration 1

K-Prototypes Clustering With Numerical Example: Iteration 2

K-Prototypes Clustering With Numerical Example: Iteration 3

K-Prototypes Clustering With Numerical Example: Iteration 4

Advantages of K-prototype Clustering

Disadvantages of K-Prototypes Clustering

Conclusion

Similar Posts