Clustering For Mixed Data Types in Python
Clustering datasets into different groups finds its applicability in many industries. K-Means clustering is one of the most popular clustering algorithms. However, it only works on numerical data. In the real world, any dataset contains numeric as well as categorical attributes. In such cases, we can use k-prototypes clustering. In this article, we will discuss the k-prototypes clustering algorithm for mixed data types and its implementation in detail.
What Is K-Prototypes Clustering?
K-Prototypes clustering is an unsupervised machine learning algorithm. It is an ensemble of k-means clustering and k-modes clustering algorithms.
The k-means clustering algorithm is used to cluster numeric data. You can read about k-means clustering with a numerical example first if you don’t know about this algorithm.
The k-modes clustering algorithm is used to perform clustering on categorical data. I also advise you to read the k-modes clustering algorithm with numerical examples to understand k-prototypes clustering better.
Being an ensemble of k-means clustering and k-modes clustering, the k-prototypes clustering algorithm is used to perform clustering on a dataset with mixed data types. In other words, We can perform clustering on a dataset having numerical and categorical data using the k-prototypes clustering.
Distance Measures in K-Prototypes Clustering
The K-Prototypes clustering algorithm uses different distance measures for numerical and categorical attributes for any given data.
Distance Measure for Numerical Attributes
For numerical attributes, the k-prototypes clustering algorithm uses squared euclidean distance as the distance measure. If you have two records as (1, 2, 3)
and (4,1, 5)
, the squared euclidean distance is calculated as (1-4)^2+(2-1)^2+(3-5)^2
which is equal to 14. The kprototpyes module in the kmodes package implements the euclidean_dissim()
function to calculate the squared euclidean distance.
The euclidean_dissim()
function takes two 1-D numpy arrays of equal length and returns the square euclidean distance of the arrays as shown below.
x1=np.array([1, 2, 3]).reshape(1,-1)
print("The first data point is:")
print(x1)
x2=np.array([4,1, 5]).reshape(1,-1)
print("The second data point is:")
print(x2)
distance=kprototypes.euclidean_dissim(x1, x2)
print("The squared euclidean distance is:")
print(distance)
Output:
The first data point is:
[[1 2 3]]
The second data point is:
[[4 1 5]]
The squared euclidean distance is:
[14]
The input arrays passed to the euclidean_dissim()
function should be of equal length. Otherwise, the function raises a ValueError exception with the message “ValueError: operands could not be broadcast together with shapes ”
. You can observe this in the following example.
x1=np.array([1, 2, 3, 4]).reshape(1,-1)
print("The first data point is:")
print(x1)
x2=np.array([4,1, 5]).reshape(1,-1)
print("The second data point is:")
print(x2)
distance=kprototypes.euclidean_dissim(x1, x2)
print("The squared euclidean distance is:")
print(distance)
Output:
The first data point is:
[[1 2 3 4]]
The second data point is:
[[4 1 5]]
ValueError: operands could not be broadcast together with shapes (1,4) (1,3)
Also, the inputs given to the euclidean_dissim()
function must be numeric. Otherwise, the program runs into a TypeError exception.
Distance Measure for Categorical Attributes
For categorical attributes, the kprototypes clustering algorithm follows matching dissimilarity. If you have two records (A, B, C, D)
and (A, D, C, C)
with categorical attributes, the matching dissimilarity is the number of different values at each position in the records. In the given records, values are different at the two positions only. Hence, the matching dissimilarity between the records will be 2.
The kprototypes module provides matching_dissim()
function to calculate the dissimilarity score between two records. The matching_dissim()
function takes two numpy arrays having categorical data and returns their matching dissimilarity score as shown in the following example.
x1=np.array(["A", "B", "C", "D"]).reshape(1,-1)
print("The first data point is:")
print(x1)
x2=np.array(["A", "D", "C", "C"]).reshape(1,-1)
print("The second data point is:")
print(x2)
distance=kprototypes.matching_dissim(x1, x2)
print("The matching dissimilarity is:")
print(distance)
Output:
The first data point is:
[['A' 'B' 'C' 'D']]
The second data point is:
[['A' 'D' 'C' 'C']]
The matching dissimilarity is:
[2]
Distance Measure for Records With Mixed Data Types
For records with mixed attributes, the kprototypes clustering algorithm uses both euclidean and matching dissimilarity. In this case, the algorithm calculates squared euclidean distance for categorical attributes and matching dissimilarity for categorical attributes and uses their resultant to calculate the distance between records having mixed attributes.
For instance, if we have two records ['A', 'B', 'F', 155, 53]
and ['A', 'A', 'M', 174, 70]
.
- To find the distance between these two records, we will first find the dissimilarity score between
['A', 'B', 'F']
and['A', 'A', 'M'].
The score is 2 as two attributes out of three have different values. - Next, we will calculate the square euclidean distance between
[155, 53]
and[174, 70]
. Here,(155-174)^2 + (53-70)^2
which is equal to 650. - Now, we can directly calculate the total dissimilarity score as the sum of the dissimilarity score between categorical attributes and the square euclidean distance of numerical attributes. Here, the sum will be equal to 650+2=652.
Now, observe that the matching dissimilarity score of categorical attributes is almost negligible compared to the square euclidean distance between numerical attributes. Hence, the categorical attributes will have little or no effect on clustering.
To solve this problem, we can scale the values in numeric attributes within a range of say 0 to 5. Alternatively, we can take a weighted sum of the matching dissimilarity scores and the square euclidean distance. In this article, we will discuss a numerical example of the k-prototypes clustering algorithm by scaling the values in numeric attributes within a range of 0 to 5.
Selection of New Prototypes in the Clusters
Once a cluster is formed, we need to calculate a new prototype for the cluster using the data points in the current cluster.
To calculate the new prototype for any given cluster, we will take the mode of categorical attributes of the data points in the cluster. For numerical attributes, we will use the mean of the values to calculate a new prototype for the cluster. For example, suppose that we have the following data points in a cluster.
C | A | F | 5.000000 | 4.705882 |
B | A | M | 4.175824 | 4.470588 |
B | A | F | 4.807692 | 3.235294 |
A | A | F | 4.972527 | 4.117647 |
B | C | F | 4.642857 | 3.470588 |
The first three attributes of the cluster are categorical. Hence, we will calculate the mode of those attributes to obtain the prototype. For numeric attributes, we will use mean of the values. The prototype for the above cluster will be [B, A, F, 4.71978,3.999999 ]
.
The K-Prototypes Algorithm
The K-Prototypes algorithm is almost similar to the k-means or k-modes clustering algorithm. The steps in the k-prototype algorithm are as follows.
- First, we take K data points from the input dataset and use them as prototypes. The clusters are built around each prototype.
- Find the distance of each data point from the current prototypes. Here, the distance is calculated according to the method discussed in the previous section.
- After finding the distance of each data point from the prototypes, we assign data points to clusters. Here, each data point is assigned to the cluster with the prototype nearest to the data point.
- After assigning data points to the clusters, we calculate new prototypes for each cluster. To calculate the prototypes, we take the mean of numeric attributes and the mode of categorical attributes.
- If the new prototypes are the same as the previous prototypes, we say that the algorithm has converged. Hence, the current clusters are finalized. Otherwise, go to 2.
Before moving to the implementation of the kprototypes clustering for mixed data types in python, I would suggest you read this article on k-prototypes clustering with numerical example to understand the algorithm in a better manner.
The KPrototypes() Function in Python
To implement the k-prototypes clustering algorithm in python, we will use the KPrototypes()
function defined in the kmodes.kprototypes
module. The syntax of the KPrototypes()
function is as follows.
KPrototypes(n_clusters=8, max_iter=100, num_dissim=euclidean_dissim, cat_dissim= matching_dissim, init='Cao', n_init=10, gamma=None, verbose=0, random_state=None, n_jobs=1)
Here,
- The
n_clusters
parameter takes the number of clusters to form as well as the number of prototypes to generate. By default, it has the value 8. - While executing k-prototypes clustering, we need to execute the algorithms several times to get the output clusters. To limit the number of clusters, we use the
max_iter
parameter. Themax_iter
parameter takes the maximum number of iterations in a single run. It has a default value of 100. - The
num_dissim
parameter takes the function used to calculate the dissimilarity between numerical attributes. By default,euclidean_dissim()
function is used. - The
cat_dissim
parameter takes the function used to calculate the dissimilarity between categorical attributes. By default, thematching_dissim()
function is used to calculate the dissimilarity. - The
n_init
parameter is used to decide the number of times the k-prototypes algorithm will be run with different prototype seeds. The final results will be the best output ofn_init
consecutive runs in terms of cost. - The
init
parameter takes the method for initialization prototypes as its input argument. The default method here is“Cao”
which was introduced by Huang [1997, 1998]. You can also pass“random”
to theinit
parameter to select prototypes in a random manner. - The parameter
gamma
takes the weighing factor that determines the relative importance of numerical vs. categorical attributes. By default, it has the valueNone
. In this case,gamma
is automatically calculated from the data. - The
verbose
parameter is used to execute the function in verbose mode. By default, it is set to 0. You can specify any non-zero value in theverbose
parameter to execute the function in verbose mode. - The
n_jobs
parameter is used to execute different runs of the algorithm parallelly. By default, then_jobs
parameter has the value 1. Hence, it executes all the runs of the algorithm one by one. For parallel execution, you can specify the number of executions to run parallelly. If you are not executing any other program, you can usen_jobs
equal to the number of cores in your CPU for maximum efficiency.
After execution, the KPrototypes()
function returns an untrained model for k-prototypes clustering to cluster a dataset with mixed data types.
Clustering For Mixed Data Types Implementation in Python
Now that we have discussed the basics of the k-prototypes clustering algorithm along with the KPrototypes()
function, let us implement the k-prototypes clustering to cluster mixed data types in python.
Data Preprocessing
We are given the following table as the input dataset for k-prototypes clustering. It contains five attributes namely EQ Rating, IQ Rating, Gender, Height, and Weight.
Student | EQ Rating | IQ Rating | Gender | Height(in cms) | Weight(in Kgs) |
Student 1 | A | B | F | 155 | 53 |
Student 2 | A | A | M | 174 | 70 |
Student 3 | C | C | M | 177 | 75 |
Student 4 | C | A | F | 182 | 80 |
Student 5 | B | A | M | 152 | 76 |
Student 6 | A | B | M | 160 | 69 |
Student 7 | B | A | F | 175 | 55 |
Student 8 | A | A | F | 181 | 70 |
Student 9 | A | C | M | 180 | 85 |
Student 10 | A | B | F | 166 | 54 |
Student 11 | C | C | M | 162 | 66 |
Student 12 | A | C | M | 153 | 74 |
Student 13 | A | B | M | 160 | 62 |
Student 14 | B | C | F | 169 | 59 |
Student 15 | A | B | F | 171 | 71 |
The numeric attributes in the above table aren’t normalized. In this case, the distance between the numeric attributes will be very large. So, the clustering will be biased on the basis of height and weight. To avoid this, we will normalize the numeric attributes as shown in the following table.
Student | EQ Rating | IQ Rating | Gender | Height | Weight |
Student 1 | A | B | F | 4.258242 | 3.117647 |
Student 2 | A | A | M | 4.780220 | 4.117647 |
Student 3 | C | C | M | 4.862637 | 4.411765 |
Student 4 | C | A | F | 5.000000 | 4.705882 |
Student 5 | B | A | M | 4.175824 | 4.470588 |
Student 6 | A | B | M | 4.395604 | 4.058824 |
Student 7 | B | A | F | 4.807692 | 3.235294 |
Student 8 | A | A | F | 4.972527 | 4.117647 |
Student 9 | A | C | M | 4.945055 | 5.000000 |
Student 10 | A | B | F | 4.560440 | 3.176471 |
Student 11 | C | C | M | 4.450549 | 3.882353 |
Student 12 | A | C | M | 4.203297 | 4.352941 |
Student 13 | A | B | M | 4.395604 | 3.647059 |
Student 14 | B | C | F | 4.642857 | 3.470588 |
Student 15 | A | B | F | 4.697802 | 4.176471 |
Here, the numeric attributes have been adjusted and assigned a value between 1 and 5.
Clustering for Mixed Data Types Using the fit_predict() And Kprototypes() Method
After data preprocessing, we will use the following steps to implement k-prototypes clustering for mixed data types in Python.
- First, we will read the dataset from csv file using the
read_csv()
method. Theread_csv()
method takes the filename of the csv file as its input argument and returns a pandas dataframe containing the dataset. - After that, we will drop the unwanted columns from the input dataset using the
drop()
method. Thedrop()
method takes a list of column names that need to be dropped. After execution, we get the modified dataframe. - In the third step, we will normalize the values in the numeric attributes. In this example, we have normalized the values between 1 to 5.
- After normalization, we will extract a numpy array of the values in the dataframe using the values attribute of the dataframe.
- Once we have obtained the values in the array, we will specify the data type of each column using the
astype()
method. Theastype()
method, when invoked on a list or array of elements, takes the name of the required data type and converts all the elements of the array to the given data type. This step is important to ensure that the categorical variables are represented as strings and the numerical values are represented as floating-point numbers or integers. - Now, we will use the
KPrototypes()
function to create an untrained machine-learning model for k-prototypes clustering. For this example, we will keep the number of clusters as 3 and the maximum number of iterations as 20. - After creating the untrained model, we will invoke the
fit_predict()
method on the untrained model. Thefit_predict()
method takes the input data array as its first input argument. Additionally, it takes the index of the columns having categorical attributes in the“categorical”
parameter. After execution, thefit_predict()
method returns an array containing the cluster label for each data point in the input dataset. - Finally, we can add the output array to the dataframe containing the input dataset to show the cluster label for each data point.
All these steps have been implemented in the following example.
#import modules
import pandas as pd
import numpy as np
from kmodes import kprototypes
#read data input
input_data=pd.read_csv("Kprototypes_dataset.csv")
print("The input data is:")
print(input_data)
df=input_data.copy()
#drop unnecessary columns
df.drop(columns=["Student"],inplace=True)
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
#create untrained model
untrained_model = kprototypes.KPrototypes(n_clusters=3,max_iter=20)
#predict clusters
clusters = untrained_model.fit_predict(data_array, categorical=[0, 1, 2])
input_data["Cluster labels"]=clusters
print("The clustered data is:")
input_data
Output:
The input data is:
Student EQ Rating IQ Rating Gender Height(in cms) Weight(in Kgs)
Student 1 A B F 155 53
Student 2 A A M 174 70
Student 3 C C M 177 75
Student 4 C A F 182 80
Student 5 B A M 152 76
Student 6 A B M 160 69
Student 7 B A F 175 55
Student 8 A A F 181 70
Student 9 A C M 180 85
Student 10 A B F 166 54
Student 11 C C M 162 66
Student 12 A C M 153 74
Student 13 A B M 160 62
Student 14 B C F 169 59
Student 15 A B F 171 71
The clustered data is:
Student EQ Rating IQ Rating Gender Height Weight Cluster Labels
Student 1 A B F 155 53 2
Student 2 A A M 174 70 1
Student 3 C C M 177 75 0
Student 4 C A F 182 80 1
Student 5 B A M 152 76 0
Student 6 A B M 160 69 0
Student 7 B A F 175 55 2
Student 8 A A F 181 70 1
Student 9 A C M 180 85 0
Student 10 A B F 166 54 2
Student 11 C C M 162 66 0
Student 12 A C M 153 74 0
Student 13 A B M 160 62 2
Student 14 B C F 169 59 2
Student 15 A B F 171 71 1
In the above output, you can observe that we have successfully obtained three clusters.
The fit_predict()
method assigns clusters to each data point in the input dataset. If you want to create a trained model for K-Prototypes clustering to cluster mixed data types, you can use the fit()
method and the predict()
method separately.
Clustering for Mixed Data Types Using the fit(), predict() And Kprototypes() Method
The fit()
method takes the input data array as its first input argument. Additionally, it takes the index of the columns having categorical attributes in the “categorical”
parameter. After execution, it returns a trained machine learning model.
On the trained machine learning model, you can invoke the predict()
method to predict the cluster for a given data point. The predict()
method takes an array of data points for which we have to determine the cluster. Additionally, it takes the index of the columns having categorical attributes in the “categorical”
parameter. After execution, it returns the cluster label for the input data points. You can observe this in the following example.
#import modules
import pandas as pd
import numpy as np
from kmodes import kprototypes
#read data input
input_data=pd.read_csv("Kprototypes_dataset.csv")
df=input_data.copy()
#drop unnecessary columns
df.drop(columns=["Student"],inplace=True)
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
#create untrained model
untrained_model = kprototypes.KPrototypes(n_clusters=3,max_iter=20)
#train model clusters
trained_model = untrained_model.fit(data_array, categorical=[0, 1, 2])
data_point=np.array(['A', 'B', 'F', 4.258241758241758, 3.1176470588235294]).reshape(1,-1)
#predict clusters for data point
predicted_cluster=trained_model.predict(data_point,categorical=[0, 1, 2])
print("The data point is:")
print(data_point)
print("The predicted cluster is:")
print(predicted_cluster)
Output:
The data point is:
[['A' 'B' 'F' '4.258241758241758' '3.1176470588235294']]
The predicted cluster is:
[0]
In the above example, we have first created a trained machine learning model for kprototypes clustering using the fit()
method. Then we have passed a value to the predict()
method to predict the cluster.
Precautions While Using K-Prototypes Clustering for Mixed Data Types
- With increasing dimensionality and an increase in the number of categorical variables, the distance between the data points becomes almost constant. In this case, it is possible that the algorithm will not be able to initialize the clusters as all the data points will have equal distances from prototypes. You should perform data analysis and choose appropriate attributes for the input dataset so that you don’t face the curse of dimensionality.
- The K-Prototypes clustering algorithm chooses the prototypes in a random manner. Due to this, the cluster formation is also randomized. So, the k-prototypes algorithm will give different clusters for the same dataset when you run it at different instances. Also, the shape of the clusters in k-prototypes clustering depends on the initial prototypes.
- In K-Prototypes clustering, you should always specify the categorical attributes.
- The number of iterations in the k-prototypes clustering for convergence depends on the choice of initial prototypes. Due to this, if the prototypes aren’t selected in an efficient way, the runtime for the algorithm will be longer.
- In K-prototypes clustering, we don’t know the optimal number of clusters. Due to this, we need to try different numbers of clusters to find the optimal k for clustering data into k clusters. You can select the optimal number of clusters using the elbow method for k-prototypes clustering in python.
Conclusion
In this article, we have discussed the K-Prototypes clustering algorithm for mixed data types and its implementation in Python. The k-prototypes cluster algorithm finds its applications in various real-life situations due to its ability to handle mixed data types. You can use k-prototypes clustering in loan classification, customer segmentation, cyber profiling, and other situations where we need to group data into various clusters.
To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.
To read about other computer science topics, you can read this article on dynamic role-based authorization using ASP.net. You can also read this article on user activity logging using Asp.net.