Clustering For Mixed Data Types in Python

Clustering datasets into different groups finds its applicability in many industries. K-Means clustering is one of the most popular clustering algorithms. However, it only works on numerical data. In the real world, any dataset contains numeric as well as categorical attributes. In such cases, we can use k-prototypes clustering. In this article, we will discuss the k-prototypes clustering algorithm for mixed data types and its implementation in detail.

What Is K-Prototypes Clustering?

K-Prototypes clustering is an unsupervised machine learning algorithm. It is an ensemble of k-means clustering and k-modes clustering algorithms. 

The k-means clustering algorithm is used to cluster numeric data. You can read about k-means clustering with a numerical example first if you don’t know about this algorithm. 

The k-modes clustering algorithm is used to perform clustering on categorical data. I also advise you to read the k-modes clustering algorithm with numerical examples to understand k-prototypes clustering better.

Being an ensemble of k-means clustering and k-modes clustering, the k-prototypes clustering algorithm is used to perform clustering on a dataset with mixed data types. In other words, We can perform clustering on a dataset having numerical and categorical data using the k-prototypes clustering.

Distance Measures in K-Prototypes Clustering

The K-Prototypes clustering algorithm uses different distance measures for numerical and categorical attributes for any given data. 

Distance Measure for Numerical Attributes

For numerical attributes, the k-prototypes clustering algorithm uses squared euclidean distance as the distance measure. If you have two records as (1, 2, 3) and (4,1, 5), the squared euclidean distance is calculated as (1-4)^2+(2-1)^2+(3-5)^2 which is equal to 14. The kprototpyes module in the kmodes package implements the euclidean_dissim() function to calculate the squared euclidean distance.

The euclidean_dissim() function takes two 1-D numpy arrays of equal length and returns the square euclidean distance of the arrays as shown below.

x1=np.array([1, 2, 3]).reshape(1,-1)
print("The first data point is:")
print(x1)
x2=np.array([4,1, 5]).reshape(1,-1)
print("The second data point is:")
print(x2)
distance=kprototypes.euclidean_dissim(x1, x2)
print("The squared euclidean distance is:")
print(distance)

Output:

The first data point is:
[[1 2 3]]
The second data point is:
[[4 1 5]]
The squared euclidean distance is:
[14]

The input arrays passed to the euclidean_dissim() function should be of equal length. Otherwise, the function raises a ValueError exception with the message “ValueError: operands could not be broadcast together with shapes ”. You can observe this in the following example.

x1=np.array([1, 2, 3, 4]).reshape(1,-1)
print("The first data point is:")
print(x1)
x2=np.array([4,1, 5]).reshape(1,-1)
print("The second data point is:")
print(x2)
distance=kprototypes.euclidean_dissim(x1, x2)
print("The squared euclidean distance is:")
print(distance)

Output:

The first data point is:
[[1 2 3 4]]
The second data point is:
[[4 1 5]]
ValueError: operands could not be broadcast together with shapes (1,4) (1,3) 

Also, the inputs given to the euclidean_dissim() function must be numeric. Otherwise, the program runs into a TypeError exception.

Distance Measure for Categorical Attributes

For categorical attributes, the kprototypes clustering algorithm follows matching dissimilarity. If you have two records (A, B, C, D) and (A, D, C, C) with categorical attributes, the matching dissimilarity is the number of different values at each position in the records. In the given records, values are different at the two positions only. Hence, the matching dissimilarity between the records will be 2.

The kprototypes module provides matching_dissim() function to calculate the dissimilarity score between two records. The matching_dissim() function takes two numpy arrays having categorical data and returns their matching dissimilarity score as shown in the following example.

x1=np.array(["A", "B", "C", "D"]).reshape(1,-1)
print("The first data point is:")
print(x1)
x2=np.array(["A", "D", "C", "C"]).reshape(1,-1)
print("The second data point is:")
print(x2)
distance=kprototypes.matching_dissim(x1, x2)
print("The matching dissimilarity is:")
print(distance)

Output:

The first data point is:
[['A' 'B' 'C' 'D']]
The second data point is:
[['A' 'D' 'C' 'C']]
The matching dissimilarity is:
[2]

Distance Measure for Records With Mixed Data Types

For records with mixed attributes, the kprototypes clustering algorithm uses both euclidean and matching dissimilarity. In this case, the algorithm calculates squared euclidean distance for categorical attributes and matching dissimilarity for categorical attributes and uses their resultant to calculate the distance between records having mixed attributes.

 For instance, if we have two records ['A', 'B', 'F', 155, 53] and ['A', 'A', 'M', 174, 70].

  • To find the distance between these two records, we will first find the dissimilarity score between  ['A', 'B', 'F'] and  ['A', 'A', 'M']. The score is 2 as two attributes out of three have different values.
  • Next, we will calculate the square euclidean distance between [155, 53] and [174, 70]. Here, (155-174)^2 + (53-70)^2 which is equal to 650. 
  • Now, we can directly calculate the total dissimilarity score as the sum of the dissimilarity score between categorical attributes and the square euclidean distance of numerical attributes. Here, the sum will be equal to 650+2=652.

Now, observe that the matching dissimilarity score of categorical attributes is almost negligible compared to the square euclidean distance between numerical attributes. Hence, the categorical attributes will have little or no effect on clustering. 

To solve this problem, we can scale the values in numeric attributes within a range of say 0 to 5. Alternatively, we can take a weighted sum of the matching dissimilarity scores and the square euclidean distance. In this article, we will discuss a numerical example of the k-prototypes clustering algorithm by scaling the values in numeric attributes within a range of 0 to 5.

Selection of New Prototypes in the Clusters

Once a cluster is formed, we need to calculate a new prototype for the cluster using the data points in the current cluster.

To calculate the new prototype for any given cluster, we will take the mode of categorical attributes of the data points in the cluster. For numerical attributes, we will use the mean of the values to calculate a new prototype for the cluster. For example, suppose that we have the following data points in a cluster.

CAF5.0000004.705882
BAM4.1758244.470588
BAF4.8076923.235294
AAF4.9725274.117647
BCF4.6428573.470588
A sample cluster

The first three attributes of the cluster are categorical. Hence, we will calculate the mode of those attributes to obtain the prototype. For numeric attributes, we will use mean of the values. The prototype for the above cluster will be [B, A, F, 4.71978,3.999999 ].

The K-Prototypes Algorithm

The K-Prototypes algorithm is almost similar to the k-means or k-modes clustering algorithm. The steps in the k-prototype algorithm are as follows.

  1. First, we take K data points from the input dataset and use them as prototypes. The clusters are built around each prototype.
  2. Find the distance of each data point from the current prototypes. Here, the distance is calculated according to the method discussed in the previous section.
  3. After finding the distance of each data point from the prototypes, we assign data points to clusters. Here, each data point is assigned to the cluster with the prototype nearest to the data point. 
  4. After assigning data points to the clusters, we calculate new prototypes for each cluster. To calculate the prototypes, we take the mean of numeric attributes and the mode of categorical attributes.
  5. If the new prototypes are the same as the previous prototypes, we say that the algorithm has converged. Hence, the current clusters are finalized. Otherwise, go to 2.

Before moving to the implementation of the kprototypes clustering for mixed data types in python, I would suggest you read this article on k-prototypes clustering with numerical example to understand the algorithm in a better manner.

The KPrototypes() Function in Python

To implement the k-prototypes clustering algorithm in python, we will use the KPrototypes() function defined in the kmodes.kprototypes module. The syntax of the KPrototypes() function is as follows.

KPrototypes(n_clusters=8, max_iter=100, num_dissim=euclidean_dissim, cat_dissim= matching_dissim, init='Cao', n_init=10, gamma=None, verbose=0, random_state=None, n_jobs=1)

Here, 

  • The n_clusters parameter takes the number of clusters to form as well as the number of prototypes to generate. By default, it has the value 8. 
  • While executing k-prototypes clustering, we need to execute the algorithms several times to get the output clusters. To limit the number of clusters, we use the max_iter parameter. The max_iter parameter takes the maximum number of iterations in a single run. It has a default value of 100.
  • The num_dissim parameter takes the function used to calculate the dissimilarity between numerical attributes. By default, euclidean_dissim() function is used.
  • The cat_dissim parameter takes the function used to calculate the dissimilarity between categorical attributes. By default, the matching_dissim() function is used to calculate the dissimilarity. 
  • The n_init parameter is used to decide the number of times the k-prototypes algorithm will be run with different prototype seeds. The final results will be the best output of  n_init consecutive runs in terms of cost.
  • The init parameter takes the method for initialization prototypes as its input argument. The default method here is “Cao” which was introduced by Huang [1997, 1998]. You can also pass “random” to the init parameter to select prototypes in a random manner.
  • The parameter gamma takes the weighing factor that determines the relative importance of numerical vs. categorical attributes. By default, it has the value None. In this case, gamma is automatically calculated from the data.
  • The verbose parameter is used to execute the function in verbose mode. By default, it is set to 0. You can specify any non-zero value in the verbose parameter to execute the function in verbose mode.
  • The n_jobs parameter is used to execute different runs of the algorithm parallelly. By default, the n_jobs parameter has the value 1. Hence, it executes all the runs of the algorithm one by one. For parallel execution, you can specify the number of executions to run parallelly. If you are not executing any other program, you can use n_jobs equal to the number of cores in your CPU for maximum efficiency.

After execution, the KPrototypes() function returns an untrained model for k-prototypes clustering to cluster a dataset with mixed data types.

Clustering For Mixed Data Types Implementation in Python

Now that we have discussed the basics of the k-prototypes clustering algorithm along with the KPrototypes() function, let us implement the k-prototypes clustering to cluster mixed data types in python. 

Data Preprocessing

We are given the following table as the input dataset for k-prototypes clustering. It contains five attributes namely EQ Rating, IQ Rating, Gender, Height, and Weight. 

StudentEQ RatingIQ RatingGenderHeight(in cms)Weight(in Kgs)
Student 1ABF15553
Student 2AAM17470
Student 3CCM17775
Student 4CAF18280
Student 5BAM15276
Student 6ABM16069
Student 7BAF17555
Student 8AAF18170
Student 9ACM18085
Student 10ABF16654
Student 11CCM16266
Student 12ACM15374
Student 13ABM16062
Student 14BCF16959
Student 15ABF17171
Dataset For K-Prototypes Clustering

The numeric attributes in the above table aren’t normalized. In this case, the distance between the numeric attributes will be very large.  So, the clustering will be biased on the basis of height and weight. To avoid this, we will normalize the numeric attributes as shown in the following table.

StudentEQ RatingIQ RatingGenderHeightWeight
Student 1ABF4.2582423.117647
Student 2AAM4.7802204.117647
Student 3CCM4.8626374.411765
Student 4CAF5.0000004.705882
Student 5BAM4.1758244.470588
Student 6ABM4.3956044.058824
Student 7BAF4.8076923.235294
Student 8AAF4.9725274.117647
Student 9ACM4.9450555.000000
Student 10ABF4.5604403.176471
Student 11CCM4.4505493.882353
Student 12ACM4.2032974.352941
Student 13ABM4.3956043.647059
Student 14BCF4.6428573.470588
Student 15ABF4.6978024.176471
Normalized Dataset

Here, the numeric attributes have been adjusted and assigned a value between 1 and 5.

Clustering for Mixed Data Types Using the fit_predict() And Kprototypes() Method

After data preprocessing, we will use the following steps to implement k-prototypes clustering for mixed data types in Python.

  • First, we will read the dataset from csv file using the read_csv() method. The read_csv() method takes the filename of the csv file as its input argument and returns a  pandas dataframe containing the dataset.
  • After that, we will drop the unwanted columns from the input dataset using the drop() method. The drop() method takes a list of column names that need to be dropped. After execution, we get the modified dataframe.
  • In the third step, we will normalize the values in the numeric attributes. In this example, we have normalized the values between 1 to 5.
  • After normalization, we will extract a numpy array of the values in the dataframe using the values attribute of the dataframe.
  • Once we have obtained the values in the array, we will specify the data type of each column using the astype() method. The astype() method, when invoked on a list or array of elements, takes the name of the required data type and converts all the elements of the array to the given data type. This step is important to ensure that the categorical variables are represented as strings and the numerical values are represented as floating-point numbers or integers.
  • Now, we will use the KPrototypes() function to create an untrained machine-learning model for k-prototypes clustering. For this example, we will keep the number of clusters as 3 and the maximum number of iterations as 20.
  • After creating the untrained model, we will invoke the fit_predict() method on the untrained model. The fit_predict() method takes the input data array as its first input argument. Additionally, it takes the index of the columns having categorical attributes in the “categorical” parameter. After execution, the fit_predict() method returns an array containing the cluster label for each data point in the input dataset. 
  • Finally, we can add the output array to the dataframe containing the input dataset to show the cluster label for each data point.

All these steps have been implemented in the following example.

#import modules
import pandas as pd
import numpy as np
from kmodes import kprototypes
#read data input
input_data=pd.read_csv("Kprototypes_dataset.csv")
print("The input data is:")
print(input_data)
df=input_data.copy()
#drop unnecessary columns
df.drop(columns=["Student"],inplace=True)
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
#create untrained model
untrained_model = kprototypes.KPrototypes(n_clusters=3,max_iter=20)
#predict clusters
clusters = untrained_model.fit_predict(data_array, categorical=[0, 1, 2])
input_data["Cluster labels"]=clusters
print("The clustered data is:")
input_data

Output:

The input data is:
Student    EQ Rating IQ Rating Gender  Height(in cms)  Weight(in Kgs)
Student 1         A         B      F             155              53
Student 2         A         A      M             174              70
Student 3         C         C      M             177              75
Student 4         C         A      F             182              80
Student 5         B         A      M             152              76
Student 6         A         B      M             160              69
Student 7         B         A      F             175              55
Student 8         A         A      F             181              70
Student 9         A         C      M             180              85
Student 10         A         B      F            166              54
Student 11         C         C      M            162              66
Student 12         A         C      M            153              74
Student 13         A         B      M            160              62
Student 14         B         C      F            169              59
Student 15         A         B      F            171              71

The clustered data is:
Student    EQ Rating IQ Rating Gender Height    Weight Cluster Labels
Student 1 	A 	B 	F 	155 	53 	2
Student 2 	A 	A 	M 	174 	70 	1
Student 3 	C 	C 	M 	177 	75 	0
Student 4 	C 	A 	F 	182 	80 	1
Student 5 	B 	A 	M 	152 	76 	0
Student 6 	A 	B 	M 	160 	69 	0
Student 7 	B 	A 	F 	175 	55 	2
Student 8 	A 	A 	F 	181 	70 	1
Student 9 	A 	C 	M 	180 	85 	0
Student 10 	A 	B 	F 	166 	54 	2
Student 11 	C 	C 	M 	162 	66 	0
Student 12 	A 	C 	M 	153 	74 	0
Student 13 	A 	B 	M 	160 	62 	2
Student 14 	B 	C 	F 	169 	59 	2
Student 15 	A 	B 	F 	171 	71 	1

In the above output, you can observe that we have successfully obtained three clusters.

The fit_predict() method assigns clusters to each data point in the input dataset. If you want to create a trained model for K-Prototypes clustering to cluster mixed data types, you can use the fit() method and the predict() method separately.

Clustering for Mixed Data Types Using the fit(), predict() And Kprototypes() Method

The fit() method takes the input data array as its first input argument. Additionally, it takes the index of the columns having categorical attributes in the “categorical” parameter. After execution, it returns a trained machine learning model. 

On the trained machine learning model, you can invoke the predict() method to predict the cluster for a given data point. The predict() method takes an array of data points for which we have to determine the cluster. Additionally, it takes the index of the columns having categorical attributes in the “categorical” parameter. After execution, it returns the cluster label for the input data points. You can observe this in the following example.

#import modules
import pandas as pd
import numpy as np
from kmodes import kprototypes
#read data input
input_data=pd.read_csv("Kprototypes_dataset.csv")
df=input_data.copy()
#drop unnecessary columns
df.drop(columns=["Student"],inplace=True)
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
#create untrained model
untrained_model = kprototypes.KPrototypes(n_clusters=3,max_iter=20)
#train model clusters
trained_model = untrained_model.fit(data_array, categorical=[0, 1, 2])
data_point=np.array(['A', 'B', 'F', 4.258241758241758, 3.1176470588235294]).reshape(1,-1)
#predict clusters for data point
predicted_cluster=trained_model.predict(data_point,categorical=[0, 1, 2])
print("The data point is:")
print(data_point)
print("The predicted cluster is:")
print(predicted_cluster)

Output:

The data point is:
[['A' 'B' 'F' '4.258241758241758' '3.1176470588235294']]
The predicted cluster is:
[0]

In the above example, we have first created a trained machine learning model for kprototypes clustering using the fit() method. Then we have passed a value to the predict() method to predict the cluster.

Precautions While Using K-Prototypes Clustering for Mixed Data Types

  • With increasing dimensionality and an increase in the number of categorical variables, the distance between the data points becomes almost constant. In this case, it is possible that the algorithm will not be able to initialize the clusters as all the data points will have equal distances from prototypes. You should perform data analysis and choose appropriate attributes for the input dataset so that you don’t face the curse of dimensionality. 
  • The K-Prototypes clustering algorithm chooses the prototypes in a random manner. Due to this, the cluster formation is also randomized. So, the k-prototypes algorithm will give different clusters for the same dataset when you run it at different instances. Also, the shape of the clusters in k-prototypes clustering depends on the initial prototypes.
  • In K-Prototypes clustering, you should always specify the categorical attributes. 
  • The number of iterations in the k-prototypes clustering for convergence depends on the choice of initial prototypes. Due to this, if the prototypes aren’t selected in an efficient way, the runtime for the algorithm will be longer.
  • In K-prototypes clustering, we don’t know the optimal number of clusters. Due to this, we need to try different numbers of clusters to find the optimal k for clustering data into k clusters. You can select the optimal number of clusters using the elbow method for k-prototypes clustering in python.

Conclusion

In this article, we have discussed the K-Prototypes clustering algorithm for mixed data types and its implementation in Python. The k-prototypes cluster algorithm finds its applications in various real-life situations due to its ability to handle mixed data types. You can use k-prototypes clustering in loan classification, customer segmentation, cyber profiling, and other situations where we need to group data into various clusters.

To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

To read about other computer science topics, you can read this article on dynamic role-based authorization using ASP.net. You can also read this article on user activity logging using Asp.net.

Similar Posts