Hierarchical clustering

background image
Home / Learn / Machine Learning /
Hierarchical clustering

Hierarchical clustering is a technique in unsupervised machine learning that is used to group similar data points together into clusters. Unlike other clustering techniques like k-means, hierarchical clustering creates a tree-like structure of clusters, where each node of the tree represents a cluster and its children are subclusters. The most popular hierarchical clustering algorithm is the Agglomerative Clustering, which starts with each point being a single-point cluster and then repeatedly merges the closest clusters until all points are in the same cluster.

There are two main types of hierarchical clustering:

Agglomerative Clustering: This approach starts with each point as its own cluster, then at each step, the two closest clusters are merged together until a single cluster remains. The result is a tree-like structure called a dendrogram, which shows the hierarchical relationship between the clusters. The user can then decide where to cut the dendrogram to determine the final number of clusters.

Divisive Clustering: This approach starts with all points in a single cluster and at each step, it splits the cluster into two until each cluster contains only one point. This approach is less common than agglomerative clustering.

One of the advantages of hierarchical clustering is that it doesn't require the number of clusters to be specified in advance, and it can discover clusters of arbitrary shapes. The dendrogram can also be useful for visualizing the hierarchical relationships among the data points.

However, hierarchical clustering has some disadvantages. For instance, it's computationally expensive and it's sensitive to the scale of the data. Also, the output of the hierarchical clustering algorithm is a tree-like structure, which may not be as easy to interpret as a flat clustering solution like k-means.

Here's an example of how to use the AgglomerativeClustering function from scikit-learn library to perform hierarchical clustering on a dataset of points in 2D space:

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Generate a dataset of points
num_points = 100
points = np.random.uniform(low=-10, high=10, size=(num_points, 2))

# Perform hierarchical clustering
clustering = AgglomerativeClustering(n_clusters=3).fit(points)

# Plot the points colored by cluster
plt.scatter(points[:, 0], points[:, 1], c=clustering.labels_)
plt.show()

This example generates a dataset of 100 points in 2D space and then applies the AgglomerativeClustering function to group them into 3 clusters. The function returns an object that contains the labels of the clusters for each point. These labels are then used to color the points in the scatter plot.

The main parameters for the AgglomerativeClustering function are the number of clusters and linkage criterion. linkage criterion is the method that you want to use to merge the clusters. there are three linkage criterion available 'ward', 'complete', and 'average'.

The linkage criterion 'ward' minimizes the variance of the distances between all the clusters and the new cluster center. 'complete' linkage criterion minimizes the maximum distance between the clusters and 'average' linkage criterion minimizes the average distance between the clusters.

In this example, the linkage criterion is not specified, so it defaults to 'ward'. The resulting plot shows the points colored by cluster, with each color representing a different cluster.