Elbow Method to Find Best K in K-Prototypes Clustering

In partition-based clustering algorithms, we face a major challenge in deciding the optimal number of clusters. In this article, we will discuss how we can find the best k in k-prototypes clustering using the elbow method while clustering mixed data types in Python.

What Is the K-Prototypes Clustering Algorithm?

The k-prototypes clustering algorithm is a partitioning-based clustering algorithm that we can use to cluster datasets having attributes of numeric and categorical types. This algorithm is an ensemble of the k-means clustering algorithm and the k-modes clustering algorithm. To understand the k-prototypes clustering algorithm in a better manner, you can read the following articles.

K-Prototypes Clustering Implementation

To implement the k-prototypes clustering in python, we use the kprototypes module from the kmodes library in python. In kmodes.kprototypes module, we get the KPrototypes() function that we can use to create an untrained machine-learning model for k-prototypes clustering. After creating the model, we can use the fit(), fit_predict(), and predict() methods to perform clustering and prediction. 

Following is the implementation of k-prototypes clustering in python. I have used the following dataset for clustering. It contains five attributes namely EQ Rating, IQ Rating, Gender, Height, and Weight

StudentEQ RatingIQ RatingGenderHeight(in cms)Weight(in Kgs)
Student 1ABF15553
Student 2AAM17470
Student 3CCM17775
Student 4CAF18280
Student 5BAM15276
Student 6ABM16069
Student 7BAF17555
Student 8AAF18170
Student 9ACM18085
Student 10ABF16654
Student 11CCM16266
Student 12ACM15374
Student 13ABM16062
Student 14BCF16959
Student 15ABF17171
Dataset For clustering

Following is the code to implement k-prototypes clustering in Python.

#import modules
import pandas as pd
import numpy as np
from kmodes import kprototypes
#read data input
input_data=pd.read_csv("Kprototypes_dataset.csv")
print("The input data is:")
print(input_data)
df=input_data.copy()
#drop unnecessary columns
df.drop(columns=["Student"],inplace=True)
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
#create untrained model
untrained_model = kprototypes.KPrototypes(n_clusters=3,max_iter=20)
#predict clusters
clusters = untrained_model.fit_predict(data_array, categorical=[0, 1, 2])
input_data["Cluster labels"]=clusters
print("The clustered data is:")
input_data

Output:

The input data is:
Student    EQ Rating IQ Rating Gender  Height(in cms)  Weight(in Kgs)
Student 1         A         B      F             155              53
Student 2         A         A      M             174              70
Student 3         C         C      M             177              75
Student 4         C         A      F             182              80
Student 5         B         A      M             152              76
Student 6         A         B      M             160              69
Student 7         B         A      F             175              55
Student 8         A         A      F             181              70
Student 9         A         C      M             180              85
Student 10         A         B      F            166              54
Student 11         C         C      M            162              66
Student 12         A         C      M            153              74
Student 13         A         B      M            160              62
Student 14         B         C      F            169              59
Student 15         A         B      F            171              71

The clustered data is:
Student    EQ Rating IQ Rating Gender Height    Weight Cluster Labels
Student 1 	A 	B 	F 	155 	53 	2
Student 2 	A 	A 	M 	174 	70 	1
Student 3 	C 	C 	M 	177 	75 	0
Student 4 	C 	A 	F 	182 	80 	1
Student 5 	B 	A 	M 	152 	76 	0
Student 6 	A 	B 	M 	160 	69 	0
Student 7 	B 	A 	F 	175 	55 	2
Student 8 	A 	A 	F 	181 	70 	1
Student 9 	A 	C 	M 	180 	85 	0
Student 10 	A 	B 	F 	166 	54 	2
Student 11 	C 	C 	M 	162 	66 	0
Student 12 	A 	C 	M 	153 	74 	0
Student 13 	A 	B 	M 	160 	62 	2
Student 14 	B 	C 	F 	169 	59 	2
Student 15 	A 	B 	F 	171 	71 	1

In the above code,

  • First, I read the dataset from a CSV file using the read_csv() method. The read_csv() method takes the filename of the CSV file as its input argument and returns a  pandas dataframe containing the dataset.
  • After that, I dropped the unwanted column (Student) from the input dataset using the drop() method. The drop() method takes a list of column names that need to be dropped. After execution, we get the modified dataframe.
  • In the third step, I normalized the values in the numeric attributes. This is important to make sure that the difference in the numeric values does not outweigh the difference in categorical values. You can read about this explicitly in the article on clustering mixed data types in Python.
  • After normalization, I extracted a numpy array of the values in the dataframe using the values attribute of the dataframe.
  • Once we got the values in the array, I specified the data type of each column using the astype() method. The astype() method, when invoked on a list or array of elements, takes the name of the required data type and converts all the elements of the array to the given data type. This step is important to ensure that the categorical variables are represented as strings and the numerical values are represented as floating-point numbers or integers.
  • Next, I used the KPrototypes() function to create an untrained machine-learning model for k-prototypes clustering. For this example, I  have kept the number of clusters as 3 and the maximum number of iterations as 20.
  • After creating the untrained model, I invoked the fit_predict() method on the untrained model. The fit_predict() method takes the input data array as its first input argument. Additionally, it takes the index of the columns having categorical attributes in the “categorical” parameter. After execution, the fit_predict() method returns an array containing the cluster label for each data point in the input dataset. 
  • Finally, I have added the output array to the dataframe containing the input dataset to show the cluster label for each data point.

Now that we have discussed how to perform k-prototypes clustering, let us now discuss how we can find the optimal number of clusters i.e. the best k in k-prototypes clustering.

Finding Best K in K-Prototypes Clustering Using the Elbow Method

With an increase in the number of clusters, the total cluster variance of the clusters decreases rapidly. You can get the cluster variance of a trained k-prototypes clustering model using the cost_ attribute. 

After decreasing rapidly, the cluster variance almost becomes constant. Due to this, when we plot the number of clusters and respective cluster variance for each k in k-prototypes clustering, we get an elbow-shaped line. In the chart, the k at which the cluster variance becomes almost constant is selected as the best k for clustering.

To find the best k in k-prototypes clustering using the elbow method, we will use the following steps.

  • First, we will create a dictionary say elbow_scores to store the total cluster variance for each value of k.
  • Now, we will use a for loop to find the total cluster variance for each k.  In the for loop, we will vary the value of k from 2 to the total number of points in the dataset. You can also choose to vary it from 2 to half of the total number of points in the dataset.
  • Inside the for loop, we will perform the following operations.
    • We will create an untrained machine-learning model for k-prototypes clustering using the KPrototypes() function and the current value of k.
    • Then, we will train the machine learning model using the given dataset and the fit() method. 
    • After training the model, we will find the total cluster variance for the current k. For this, we can use the cost_ attribute of the model.
    • After obtaining the total cluster variance, we will store the current value of k as the key and the total cluster variance as the associated value in the elbow_scores dictionary.
  • After execution of the for loop, we will get the total cluster variance for each k in the elbow_scores dictionary.
  • We will plot the total cluster variance vs k. Then, you can identify the k in the plot after which the total cluster variance becomes almost constant. This k will be the best k optimal number of clusters for our dataset.

Following is the implementation of the program to find the best k using the elbow method for k-prototypes clustering in Python.

#import modules
import pandas as pd
import numpy as np
from kmodes import kprototypes
import matplotlib.pyplot as plt
#read data input
input_data=pd.read_csv("Kprototypes_dataset.csv")
df=input_data.copy()
#drop unnecessary columns
df.drop(columns=["Student"],inplace=True)
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
elbow_scores = dict()
range_of_k = range(2,10) 
for k in range_of_k :
    untrained_model = kprototypes.KPrototypes(n_clusters=3,max_iter=20)
    trained_model = untrained_model.fit(data_array, categorical=[0, 1, 2])
    elbow_scores[k]=trained_model.cost_
 
plt.plot(elbow_scores.keys(),elbow_scores.values())
plt.scatter(elbow_scores.keys(),elbow_scores.values())
plt.xlabel("Values of K") 
plt.ylabel("Cost") 
plt.show()

Output:

In the output chart, you can observe that there is a sharp decrease in cost from k=2 to k=3. After that, the value of k is almost constant. We can consider k=3 as the elbow point. Hence, we will select 3 as the best k for k-prototypes clustering for the given dataset.

Conclusion

In this article, we have discussed how to find the best k k-prototypes clustering using the elbow method. The k-prototypes cluster algorithm finds its applications in various real-life situations due to its ability to handle mixed data types. You can use k-prototypes clustering in loan classification, customer segmentation, cyber profiling, and other situations where we need to group data into various clusters.

To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

To read about other computer science topics, you can read this article on dynamic role-based authorization using ASP.net. You can also read this article on user activity logging using Asp.net.

Stay tuned for more informative articles.

Happy Learning!

Similar Posts