Plot Dendrogram in Python

Dendrograms are a great tool to visualize hierarchy in a dataset. In this article, we will discuss what a dendrogram is, and how to plot a dendrogram in python. This article also discusses the advantages, disadvantages, and applications of dendrograms. 

Dendrogram Definition

A dendrogram is a graphical representation of a hierarchical structure, such as a taxonomy or a cluster analysis. Dendrograms are commonly used in biology, sociology, and other fields to visualize and analyze the relationships between different entities. It is also used in hierarchical clustering techniques such as agglomerative clustering and divisive clustering for clustering data.

Distance Measures for Creating Dendrograms

Dendrograms are created using the distance matrix obtained from the data points. You can create the distance matrix using any metric such as manhattan distance, euclidean distance, squared euclidean distance, etc. For datasets having categorical and mixed data types, you can define and use any distance measure to obtain the distance matrix from the data points.

After obtaining the distance matrix for data points, there are many linkage methods using which we find the distance between clusters. Let us discuss some of these. 

Single Linkage Method

Single linkage method uses the most similar data points from two given clusters to calculate the distance between two clusters. You can observe this in the following image.

Single Linkage
Single Linkage Method

In the above image, you can observe that the nearest points between Cluster X and cluster Y has been used to calculate the distance between the clusters.

Complete Linkage Method

The complete linkage method uses the least similar data points from two clusters to calculate the distance between two clusters. You can observe this in the following image.

Complete Linkage
Complete Linkage Method

In the above image, you can observe that the distance between the farthest points in cluster X and cluster Y is taken as the distance between cluster X and Cluster Y.

Centroid Linkage Method

In centroid linkage, we use the centroid of the clusters to calculate the distance between two clusters as shown in the following image.

Centroid Linkage
Centroid Linkage Method

In the above image, you can observe that the distance between the centroids of cluster X and cluster Y is taken as the distance between cluster X and Cluster Y.

How to Plot a Dendrogram in Python?

Now that we have discussed the distance metrics, let us write a python script to create a dendrogram from a given dataset. 

To create the dendrogram in python, let us take the points A (1, 1), B(2, 3), C(3, 5), D(4,5), E(6,6), and F(7,5). Before creating the dendrogram, we can calculate the distance matrix for the given data points. After that, we can use the distance matrix to create dendrograms using different linkage methods.

Create a Distance Matrix From Data Points

To create a distance matrix from the data points, we will use the following steps.

  1. First, we will import the necessary python modules such as numpy, pandas, and scipy.
  2. Then, we will define a python list of lists to store all the data points. 
  3. Next, we will create a pandas dataframe using the data points. The dataframe contains the coordinates of the points as their columns and the name of the points as the row index.
  4. After creating the dataframe, we will create a distance matrix of the data points using the distance_matrix() method defined in the scipy.spatial module. The distance_matrix() function takes the dataframe created in the last step as its first and second input argument and returns a two-dimensional array.
  5. Now, we will convert the two-dimensional array to a dataframe to denote annotate the points in the matrix. The output dataframe in this step is our final distance matrix.

You can observe this in the following example.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
data = [[1, 1], [2, 3], [3, 5],[4,5],[6,6],[7,5]]
points=["A","B","C","D","E","F"]
print("The data points are:")
print(data)
df = pd.DataFrame(data, columns=['xcord', 'ycord'],index=points)
ytdist=pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
print("The distance matrix is:")
print(ytdist)

Output:

The data points are:
[[1, 1], [2, 3], [3, 5], [4, 5], [6, 6], [7, 5]]
The distance matrix is:
          A         B         C         D         E         F
A  0.000000  2.236068  4.472136  5.000000  7.071068  7.211103
B  2.236068  0.000000  2.236068  2.828427  5.000000  5.385165
C  4.472136  2.236068  0.000000  1.000000  3.162278  4.000000
D  5.000000  2.828427  1.000000  0.000000  2.236068  3.000000
E  7.071068  5.000000  3.162278  2.236068  0.000000  1.414214
F  7.211103  5.385165  4.000000  3.000000  1.414214  0.000000

Plot Dendrogram in Python

After creating the distance matrix, we can use different linkage methods to create dendrograms in python. To plot the dendrogram in python, we will first create a linkage matrix. For this, we will use the linkage() function defined in the scipy.cluster.hierarchy module.

The linkage() function takes the distance matrix as its first input argument and the type of linkage as its second input argument. After execution, it returns a linkage matrix.

After obtaining the linkage matrix, we will use the dendrogram() function to plot the dendrogram. The dendrogram() function takes the linkage matrix as its first input argument and the data labels as its second input argument. After execution, it plots the dendrogram.

Plot Dendrogram Using Single Linkage Method in Python

To plot a dendrogram using single linkage method in python, we will create a linkage matrix by passing the distance matrix as the first input argument and the literal “single” as the second input argument to the linkage() function. Then, we will pass the linkage matrix to the dendrogram() function to create a dendrogram using a single linkage in python as shown below.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
data = [[1, 1], [2, 3], [3, 5],[4,5],[6,6],[7,5]]
points=["A","B","C","D","E","F"]
df = pd.DataFrame(data, columns=['xcord', 'ycord'],index=points)
ytdist=pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
linkage_matrix = linkage(ytdist, "single")
dendrogram(linkage_matrix, labels=["A", "B", "C","D","E","F"])
plt.title("Dendrogram Using Single Linkage")
plt.show()

Output:

Dendrogram in Python Using Single Linkage
Dendrogram in Python Using Single Linkage

Plot Dendrogram Using Complete Linkage Method in Python

To plot a dendrogram using complete linkage method, we will pass the literal “complete” to the linkage() function as its second input argument. You can observe this in the following example.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
data = [[1, 1], [2, 3], [3, 5],[4,5],[6,6],[7,5]]
points=["A","B","C","D","E","F"]
df = pd.DataFrame(data, columns=['xcord', 'ycord'],index=points)
ytdist=pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
linkage_matrix = linkage(ytdist, "complete")
dendrogram(linkage_matrix, labels=["A", "B", "C","D","E","F"])
plt.title("Dendrogram Using complete Linkage")
plt.show()

Output:

Dendrogram in Python Using Complete Linkage
Dendrogram in Python Using Complete Linkage

Plot Dendrogram Using Centroid Linkage Method in Python

To plot a dendrogram using the centroid linkage method, we will pass the literal “centroid” to the linkage() function as its second input argument. You can observe this in the following example.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
data = [[1, 1], [2, 3], [3, 5],[4,5],[6,6],[7,5]]
points=["A","B","C","D","E","F"]
df = pd.DataFrame(data, columns=['xcord', 'ycord'],index=points)
ytdist=pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
linkage_matrix = linkage(ytdist, "centroid")
dendrogram(linkage_matrix, labels=["A", "B", "C","D","E","F"])
plt.title("Dendrogram Using centroid Linkage")
plt.show()

Output:

Dendrogram in Python Using Centroid Linkage
Dendrogram in Python Using Centroid Linkage

Advantages of Dendrograms

Dendrograms have many advantages. Some of them are listed below.

  1. Visual clarity: Dendrograms provide a clear and intuitive way to visualize hierarchical relationships. They can be used to understand the relationships between different groups or categories and can help to identify patterns and trends within data.
  2. Flexibility: Dendrograms can be customized to suit a wide range of data and analysis goals. For example, they can be used to display both quantitative and qualitative data and can be modified to highlight specific relationships or patterns within the data.
  3. Ease of interpretation: Dendrograms are relatively easy to understand and interpret, even for those who are not experts in the field. This makes them a useful tool for presenting and communicating data and analysis to a wide audience.
  4. Versatility: Dendrograms can be used to analyze a wide range of data types, including categorical, ordinal, and continuous data. They can also be used in combination with other analytical techniques, such as multivariate analysis, to provide a more complete picture of the data.
  5. Efficiency: Dendrograms can be used to quickly and effectively visualize complex data sets, making them a useful tool for data exploration and analysis. They can help to identify trends and patterns within the data more efficiently than other methods, such as manual inspection or statistical analysis.
  6. Customization: Dendrograms can be customized in a variety of ways, including the choice of layout, the use of different colors or symbols to represent different data points, and the inclusion of additional information, such as labels or annotations.
  7. Comparison: Dendrograms can be used to compare different data sets or groups, making it easier to identify differences and similarities between them.
  8. Data aggregation: Dendrograms can be used to aggregate data from multiple sources or to display data at different levels of granularity. This can be useful for understanding the relationships between different data points or groups, and can help to identify patterns or trends that may not be immediately apparent.

Disadvantages of Dendrograms

Apart from their advantages, dendrograms have certain disadvantages too. Let us discuss a few of them.

  1. Limited complexity: Dendrograms can only represent hierarchical relationships, so they may not be suitable for data sets with more complex relationships.
  2. Subjective interpretation: The interpretation of a dendrogram can be somewhat subjective, as the way the data is grouped and the relationships between groups are not always clear-cut. This can lead to disagreement or confusion about the meaning of the dendrogram.
  3. Limited ability to compare groups: Dendrograms can be difficult to use for comparing groups or categories, as they only show relationships within the data set, rather than between different sets of data.
  4. Limited ability to represent continuous data: Dendrograms are generally better suited for representing categorical or ordinal data, rather than continuous data. This can limit their usefulness for certain types of data or analysis.
  5. Limited ability to show multiple relationships: Dendrograms can only show one set of hierarchical relationships at a time, so they may not be suitable for data sets with multiple, complex relationships. In these cases, other visualization methods, such as scatter plots or multivariate analysis, may be more appropriate.
  6. Limited data types: Dendrograms are most effective for visualizing and analyzing hierarchical relationships, such as taxonomies or cluster analysis. They are not as effective for other types of data or analysis, such as continuous data or regression analysis.
  7. Limited flexibility: While dendrograms can be customized to some extent, they are limited in their ability to display more complex data or relationships. For example, they may not be able to effectively visualize multiple levels of hierarchy within the same data set.
  8. Complexity: Dendrograms can become complex and cluttered when displaying large or multi-dimensional data sets. This can make them difficult to interpret and may require the use of additional tools, such as filtering or data reduction, to make the data more manageable.
  9. Limited interactivity: Dendrograms are typically static representations of data, meaning that they do not offer the same level of interactivity as some other visualization tools. This can make it more difficult to explore and analyze data in real-time, or to make changes to the visualization on the fly.
  10. Limited scalability: Dendrograms may not be effective for visualizing very large data sets, as the complexity of the visualization can become overwhelming for the viewer. This may require the use of additional tools, such as sampling or data reduction, to make the data more manageable.

Applications of Dendrograms

Dendrograms are commonly used in a variety of fields to visualize and analyze hierarchical relationships within data. Some common applications of dendrograms include the following.

  1. Taxonomy: Dendrograms are often used to visualize the relationships between different species or groups within a taxonomy. They can help to understand the evolutionary relationships between different species and can be used to classify new species or to identify new taxonomic groups.
  2. Cluster analysis: Dendrograms are commonly used to visualize the results of hierarchical clustering. Dendrograms can be used to identify the relationships between different clusters and to understand the characteristics that define each cluster.
  3. Social network analysis: Dendrograms can be used to visualize the relationships between different individuals or groups within a social network. They can help to identify patterns and trends within the data and can be used to understand the influence or importance of different individuals or groups within the network.
  4. Data classification: Dendrograms can be used to classify data into different groups or categories. They can be used to identify patterns or trends within the data and can help to understand the relationships between different data points or groups.
  5. Data exploration: Dendrograms can be used to explore and understand complex data sets, particularly those that involve hierarchical relationships. They can help to identify patterns and trends within the data and can be used to guide further analysis.

Conclusion

In this article, we have discussed different how to plot dendrograms in Python. We also discussed the advantages, disadvantages, and applications of dendrograms. Dendrograms are used in hierarchical clustering methods such as agglomerative clustering to group data into clusters. Thus, knowing how to use dendrograms in python can help you in clustering too.

To learn more about machine learning, you can read this article on regression in machine learning. You might also like this article on polynomial regression using sklearn in python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Similar Posts