Hierarchical Clustering for Categorical and Mixed Data Types in Python

Hierarchical clustering is one of the most popular clustering algorithms after partitioning clustering algorithms like k-means clustering. In this article, we will discuss hierarchical clustering for categorical and mixed data types in python. For this, we will implement agglomerative clustering for datasets having categorical data and mixed data types.

How to Perform Hierarchical Clustering for Categorical and Mixed Data Types?

Most of the clustering algorithms are primarily defined for numeric data types. However, we can perform clustering on categorical data or a dataset having mixed data types if we define the function to calculate the distance between the data points in such a dataset.

In hierarchical clustering, we need to create a linkage matrix from the distance matrix of the dataset. So, if we define a function to calculate the distance between categorical or mixed data types, we can implement agglomerative clustering on the dataset.

To implement hierarchical clustering on categorical or mixed data types, you can use the functions defined in this article on silhouette coefficient for k-modes and k-prototypes clustering.

Hierarchical Clustering for Categorical Data in Python

To implement agglomerative hierarchical clustering on categorical data, we will use the create_dm() function defined in the above-mentioned article to calculate the distance matrix for the given dataset. 

  • We will determine the distance matrix by using the dissimilarity score between two pieces of data. In a previous article on k-modes clustering with numerical examples, I explained how to calculate dissimilarity scores for categorical data and provided a numerical example.
  • To implement the calculation of the dissimilarity score in python, we will utilize the kprototypes.matching_dissim() function. The operation of this function is covered in an article on clustering mixed data types in Python.
  • Using the matching_dissim() function, we implemented the create_dm() function which is discussed in the article on silhouette coefficient. 
  • Using the create_dm() function, we will first calculate the distance matrix for a given dataset having categorical data.
  • Next, we will pass the distance matrix to the linkage() function defined in the scipy.cluster.hierarchy module. The linkage() function will return a linkage matrix after execution. 

Once we get the linkage matrix, we can plot the dendrogram for the given categorical data. I have already discussed how to plot a dendrogram in python.  You can read this article to understand how to plot a dendrogram.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import numpy as np
def create_dm(dataset):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i].reshape(1,-1)
            x2= dataset[j].reshape(1,-1)
            distance=kprototypes.matching_dissim(x1, x2)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
data=pd.read_csv("KModes-dataset.csv", index_col=["Student"])
distance_matrix=create_dm(data)
linkage_matrix = linkage(distance_matrix, "ward")
dendrogram(linkage_matrix, labels=data.index)
plt.title("Dendrogram Using ward Linkage")
plt.xticks(rotation='vertical')
plt.show()

Dataset:

Output:

Hierarchical clustering for categorical data in Python
Dendrogram for Categorical Data in Python

In the above code, we have plotted a dendrogram for categorical data using the scipy module.

Instead of plotting the dendrogram, you can also find cluster labels of different clusters by performing hierarchical agglomerative clustering on categorical data as shown below.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
import numpy as np
def create_dm(dataset):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i].reshape(1,-1)
            x2= dataset[j].reshape(1,-1)
            distance=kprototypes.matching_dissim(x1, x2)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
data=pd.read_csv("KModes-dataset.csv", index_col=["Student"])
distance_matrix=create_dm(data)
linkage_matrix = linkage(distance_matrix, "ward")
cluster_labels = fcluster(linkage_matrix,3,criterion='maxclust')
data["Cluster"]=cluster_labels
print("The clustered data is:")
print(data)

Output:

The clustered data is:
           Subject 1 Subject 2 Subject 3 Subject 4 Subject 5  Cluster
Student                                                              
Student 1          A         B         A         B         A        2
Student 2          A         A         B         B         A        3
Student 3          C         C         B         A         C        3
Student 4          C         A         B         B         A        3
Student 5          B         A         A         B         C        1
Student 6          A         B         B         A         C        2
Student 7          B         A         C         C         C        1
Student 8          A         A         A         A         A        2
Student 9          A         C         B         B         B        3
Student 10         A         B         B         A         A        2
Student 11         C         C         D         B         A        3
Student 12         A         C         B         B         C        3
Student 13         A         B         A         C         B        2
Student 14         B         C         C         D         B        1
Student 15         A         B         B         B         B        3

Instead of using the scipy module and calculating the linkage matrix, you can directly implement hierarchical clustering on categorical data using the sklearn module in python as shown below.

from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
def create_dm(dataset):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i].reshape(1,-1)
            x2= dataset[j].reshape(1,-1)
            distance=kprototypes.matching_dissim(x1, x2)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
data=pd.read_csv("KModes-dataset.csv", index_col=["Student"])
distance_matrix=create_dm(data)
model = AgglomerativeClustering(n_clusters=3, affinity="precomputed", linkage='complete')
model.fit(distance_matrix)
labels = model.labels_
data["Cluster"]=labels
print("The clustered data is:")
print(data)

Output:

The clustered data is:
           Subject 1 Subject 2 Subject 3 Subject 4 Subject 5  Cluster
Student                                                              
Student 1          A         B         A         B         A        0
Student 2          A         A         B         B         A        0
Student 3          C         C         B         A         C        0
Student 4          C         A         B         B         A        0
Student 5          B         A         A         B         C        2
Student 6          A         B         B         A         C        1
Student 7          B         A         C         C         C        2
Student 8          A         A         A         A         A        0
Student 9          A         C         B         B         B        0
Student 10         A         B         B         A         A        1
Student 11         C         C         D         B         A        0
Student 12         A         C         B         B         C        0
Student 13         A         B         A         C         B        1
Student 14         B         C         C         D         B        2
Student 15         A         B         B         B         B        1

In the above example, we have used the sklearn module for hierarchical clustering for categorical data. Here, we didn’t need to calculate the linkage matrix. The AgglomerativeClustering() function directly takes the distance matrix as its input argument. Also, you should set the affinity parameter to "precomputed" if you are using sklearn version 1.4 or earlier. For versions above this, you should set the metric parameter to precomputed.

Hierarchical Clustering for Mixed Data Types in Python

By calculating the distance matrix, you can also implement agglomerative hierarchical clustering for mixed data types in python. For this, we will use the following steps.

  • First, we will define a function to calculate the distance between two data points having mixed attributes. For this, we will calculate the dissimilarity between categorical values and the distance between the numeric attributes separately. I have already implemented this in the mixed_distance() function in this article.
  • After defining the function to calculate the distance between the data points, you can define a function to calculate the distance matrix for the given dataset with attributes having mixed data types. I have implemented the dm_prototypes() function to calculate the same in the same article on silhouette coefficient. 
  • After calculating the distance matrix for the dataset having mixed data types, we will pass the distance matrix to the linkage() function defined in the scipy.cluster.hierarchy module to create a linkage matrix. The linkage() function will return a linkage matrix after execution. 

Once we get the linkage matrix, we can plot the dendrogram for the given dataset having mixed data types using the dendrogram() function as shown below.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
import numpy as np
def mixed_distance(a,b,categorical=None, alpha=0.01):
    if categorical is None:
        num_score=kprototypes.euclidean_dissim(a,b)
        return num_score
    else:
        cat_index=categorical
        a_cat=[]
        b_cat=[]
        for index in cat_index:
            a_cat.append(a[index])
            b_cat.append(b[index])
        a_num=[]
        b_num=[]
        l=len(a)
        for index in range(l):
            if index not in cat_index:
                a_num.append(a[index])
                b_num.append(b[index])
                
        a_cat=np.array(a_cat).reshape(1,-1)
        a_num=np.array(a_num).reshape(1,-1)
        b_cat=np.array(b_cat).reshape(1,-1)
        b_num=np.array(b_num).reshape(1,-1)
        cat_score=kprototypes.matching_dissim(a_cat,b_cat)
        num_score=kprototypes.euclidean_dissim(a_num,b_num)
        return cat_score+num_score*alpha
def dm_prototypes(dataset,categorical=None,alpha=0.1):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i]
            x2= dataset[j]
            distance=mixed_distance(x1, x2,categorical=categorical,alpha=alpha)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
import pandas as pd
import numpy as np
df=pd.read_csv("Kprototypes_dataset.csv", index_col=["Student"])
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
distance_matrix=dm_prototypes(data_array,categorical=[0, 1, 2],alpha=0.1)
linkage_matrix = linkage(distance_matrix, "ward")
dendrogram(linkage_matrix, labels=data.index)
plt.title("Dendrogram Using ward Linkage")
plt.xticks(rotation='vertical')
plt.show()

Dataset:

Output:

Dendrogram for mixed data types in Python
Dendrogram for Mixed Data Types in Python

Instead of plotting the dendrogram, you can also find cluster labels of different clusters by performing hierarchical agglomerative clustering on mixed data types as shown below.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
import numpy as np
def mixed_distance(a,b,categorical=None, alpha=0.01):
    if categorical is None:
        num_score=kprototypes.euclidean_dissim(a,b)
        return num_score
    else:
        cat_index=categorical
        a_cat=[]
        b_cat=[]
        for index in cat_index:
            a_cat.append(a[index])
            b_cat.append(b[index])
        a_num=[]
        b_num=[]
        l=len(a)
        for index in range(l):
            if index not in cat_index:
                a_num.append(a[index])
                b_num.append(b[index])
                
        a_cat=np.array(a_cat).reshape(1,-1)
        a_num=np.array(a_num).reshape(1,-1)
        b_cat=np.array(b_cat).reshape(1,-1)
        b_num=np.array(b_num).reshape(1,-1)
        cat_score=kprototypes.matching_dissim(a_cat,b_cat)
        num_score=kprototypes.euclidean_dissim(a_num,b_num)
        return cat_score+num_score*alpha
def dm_prototypes(dataset,categorical=None,alpha=0.1):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i]
            x2= dataset[j]
            distance=mixed_distance(x1, x2,categorical=categorical,alpha=alpha)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
import pandas as pd
import numpy as np
df=pd.read_csv("Kprototypes_dataset.csv", index_col=["Student"])
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
distance_matrix=dm_prototypes(data_array,categorical=[0, 1, 2],alpha=0.1)
linkage_matrix = linkage(distance_matrix, "ward")
cluster_labels = fcluster(linkage_matrix,3,criterion='maxclust')
#reading dataframe again as featurs were normalized earlier
df=pd.read_csv("Kprototypes_dataset.csv", index_col=["Student"])
df.columns=["IQ", "EQ", "Gender", "Height", "Weight"]
df["Cluster"]=cluster_labels
print("The clustered data is:")
print(df)

Output:

The clustered data is:
           IQ EQ Gender  Height  Weight  Cluster
Student                                         
Student 1   A  B      F     155      53        1
Student 2   A  A      M     174      70        3
Student 3   C  C      M     177      75        3
Student 4   C  A      F     182      80        2
Student 5   B  A      M     152      76        3
Student 6   A  B      M     160      69        1
Student 7   B  A      F     175      55        2
Student 8   A  A      F     181      70        1
Student 9   A  C      M     180      85        3
Student 10  A  B      F     166      54        1
Student 11  C  C      M     162      66        3
Student 12  A  C      M     153      74        3
Student 13  A  B      M     160      62        1
Student 14  B  C      F     169      59        2
Student 15  A  B      F     171      71        1

Instead of using the scipy module and calculating the linkage matrix, you can directly implement hierarchical clustering using the sklearn module on mixed data types in python as shown below.

from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
def mixed_distance(a,b,categorical=None, alpha=0.01):
    if categorical is None:
        num_score=kprototypes.euclidean_dissim(a,b)
        return num_score
    else:
        cat_index=categorical
        a_cat=[]
        b_cat=[]
        for index in cat_index:
            a_cat.append(a[index])
            b_cat.append(b[index])
        a_num=[]
        b_num=[]
        l=len(a)
        for index in range(l):
            if index not in cat_index:
                a_num.append(a[index])
                b_num.append(b[index])
                
        a_cat=np.array(a_cat).reshape(1,-1)
        a_num=np.array(a_num).reshape(1,-1)
        b_cat=np.array(b_cat).reshape(1,-1)
        b_num=np.array(b_num).reshape(1,-1)
        cat_score=kprototypes.matching_dissim(a_cat,b_cat)
        num_score=kprototypes.euclidean_dissim(a_num,b_num)
        return cat_score+num_score*alpha
def dm_prototypes(dataset,categorical=None,alpha=0.1):
    #if the input dataset is a dataframe, we take out the values as a numpy. 
    #If the input dataset is a numpy array, we use it as is.
    if type(dataset).__name__=='DataFrame':
        dataset=dataset.values    
    lenDataset=len(dataset)
    distance_matrix=np.zeros(lenDataset*lenDataset).reshape(lenDataset,lenDataset)
    for i in range(lenDataset):
        for j in range(lenDataset):
            x1= dataset[i]
            x2= dataset[j]
            distance=mixed_distance(x1, x2,categorical=categorical,alpha=alpha)
            distance_matrix[i][j]=distance
            distance_matrix[j][i]=distance
    return distance_matrix
import pandas as pd
import numpy as np
df=pd.read_csv("Kprototypes_dataset.csv", index_col=["Student"])
#Normalize dataset
df["Height(in cms)"]=(df["Height(in cms)"]/df["Height(in cms)"].abs().max())*5
df["Weight(in Kgs)"]=(df["Weight(in Kgs)"]/df["Weight(in Kgs)"].abs().max())*5
#obtain array of values
data_array=df.values
#specify data types
data_array[:, 0:3] = data_array[:, 0:3].astype(str)
data_array[:, 3:] = data_array[:, 3::].astype(float)
distance_matrix=dm_prototypes(data_array,categorical=[0, 1, 2],alpha=0.1)
distance_matrix=create_dm(data)
model = AgglomerativeClustering(n_clusters=3, affinity="precomputed", linkage='complete')
model.fit(distance_matrix)
labels = model.labels_
df=pd.read_csv("Kprototypes_dataset.csv", index_col=["Student"])
df.columns=["IQ", "EQ", "Gender", "Height", "Weight"]
df["Cluster"]=labels
print("The clustered data is:")
print(df)

Output:

The clustered data is:
           IQ EQ Gender  Height  Weight  Cluster
Student                                         
Student 1   A  B      F     155      53        0
Student 2   A  A      M     174      70        0
Student 3   C  C      M     177      75        0
Student 4   C  A      F     182      80        0
Student 5   B  A      M     152      76        2
Student 6   A  B      M     160      69        1
Student 7   B  A      F     175      55        2
Student 8   A  A      F     181      70        0
Student 9   A  C      M     180      85        0
Student 10  A  B      F     166      54        1
Student 11  C  C      M     162      66        0
Student 12  A  C      M     153      74        0
Student 13  A  B      M     160      62        1
Student 14  B  C      F     169      59        2
Student 15  A  B      F     171      71        1

Conclusion

In this article, we discussed how to perform hierarchical agglomerative clustering on categorical and mixed data types. We also discussed how to plot dendrograms for categorical and mixed data types. For this, we used the scipy and sklearn modules in python.

To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Similar Posts