Clustering Methods in Bioinformatics

K-means clustering

Prerequisites: None.

Level: Beginner.
Learning objectives:

What is k-means clustering?

K-means clustering is a method of clustering data points into a predefined number of clusters (k) based on the distance between points and the mean of each cluster.

The goal is to group similar data points together in the same cluster and dissimilar data points in separate clusters. The k-means algorithm will iteratively assign each data point to the nearest cluster to the mean and then updates the cluster mean to be the mean of all data points within the cluster. This process is repeated until the cluster assignments do not change or a maximum number of iterations is reached.

K-means clustering is widespread because it is simple and efficient but has some limitations.

One limitation is that it requires the user in advance to specify the number of clusters (k), which can be difficult if the data is not clearly separated into distinct groups.

Another limitation is that k-means clustering is sensitive to the initial placement of the cluster means, and the final clusters may differ depending on the initial configuration.

Despite these limitations, k-means clustering is often used in various applications, such as image compression, market segmentation, and anomaly detection.

The k-means algorithm

The k-means algorithm is an iterative method for clustering a dataset into k groups (clusters) based on the distance between data points and the mean of each cluster. The algorithm works as follows:

Initialize the cluster means (centroids) randomly or with k-means++.
Assign each data point to the closest cluster's mean.
Update the cluster mean to be the mean of all data points assigned to the cluster.
Repeat steps 2 and 3 until the composition of clusters does not change, or a maximum number of iterations is reached.

The algorithm begins by randomly initializing the cluster means or using the k-means++ method, which selects the initial centroids to be distant to improve the algorithm's convergence.

The k-means++ method is often used because it produces better clusters than randomly initializing the centroids.

Once the cluster means are initialized, the algorithm assigns each data point to the closest cluster mean. The assignment is done by calculating the distance between each data point and each cluster mean using some distance measure, e.g., Euclidean distance. The data point is then assigned to the cluster with the closest mean.

After all data points have been assigned to a cluster, the algorithm updates the cluster means to be the mean of all data points assigned to the cluster by calculating the mean of each cluster's data points.

The algorithm then repeats steps 2 and 3 until the cluster assignments do not vary or a maximum number of iterations is reached. When the cluster assignments no longer change, the algorithm converges, producing the final clusters.

The k-means algorithm is an effective method for clustering data, but it has some limitations. One limitation is that we must define the number of clusters (k) in advance, which can be difficult if the data is not clearly separated into distinct groups.

Another limitation is that the k-means algorithm is sensitive to the initial placement of the cluster means, and the final clusters may differ depending on the initial configuration.

Despite these limitations, k-means clustering is often used in various applications, such as image compression, market segmentation, and anomaly detection.

Choosing the number of clusters

Choosing the number of clusters is an essential step in any clustering analysis. The number of clusters you choose will depend on the nature of your data and the goals of your analysis. Let's look at a few things to consider when selecting the number of clusters:

The "elbow method"

One way to determine the number of clusters is to use the "elbow method," fitting the model for different values of k (the number of clusters), computing the sum of squared distances between data points from their closest cluster center and then plotting them, also known as the "inertia."

As you increase the number of clusters, the sum of squared distances will generally decrease.

The "elbow" in the plot is the point of inflection, where the improvement in the sum of squared distances begins to level off. The "elbow" is an excellent place to choose the number of clusters.

Domain knowledge

If you have domain knowledge about your data, you may know how many clusters are likely to be present. For example, if you are analyzing customer data and you know that your company has three distinct customer segments, you may want to choose three clusters.

Data characteristics

The characteristics of your data can also influence the number of clusters you choose. For example, if your data is well-separated and distinct, you can use fewer clusters.

On the other hand, if your data is more overlapping or muddled, you may need to use more clusters to capture the structure in the data.

The number of data points

The number of data points you have can also influence the number of clusters you choose.

If you have many data points, you can use a larger number of clusters without overfitting the model. You may need fewer clusters to avoid overfitting if you have a small number of data points.

Computing resources

The number of clusters you choose can also be influenced by your computing resources. Fitting a model with many clusters may require more time and memory than you have available.

There are several factors to consider when choosing the number of clusters, including the elbow method, domain knowledge, data characteristics, number of data points, and computing resources. The best choice for your analysis will depend on the specific context of your data and your goals.

Evaluating the performance of k-means clustering

In k-means clustering, we aim to partition a set of data points into a predefined number of clusters, such that the data points within each cluster are more similar than those in other clusters.

There are several ways to assess the performance of a k-means clustering algorithm. One way is to use the within-cluster sum of squares (WCSS) metric, also known as inertia.

WCSS measures the sum of the squared distance between each data point in a cluster and the centroid (mean) of that cluster. The WCSS can be used to select the optimal number of clusters by plotting the WCSS for each possible value of k and selecting the value of k that results in the lowest WCSS.

Another way to evaluate the performance of k-means clustering is to use external evaluation metrics, such as the adjusted Rand index or the adjusted mutual information. These metrics compare the clusters obtained from the k-means algorithm to a ground truth set of labels, if available, and provide a measure of how well the clusters match the true labels.

Visualizing the clusters using a scatter plot is also common, with the data points colored according to their assigned cluster to indicate how well the data has been partitioned into clusters and whether there is any overlap between them.

Overall, the performance of k-means clustering can be evaluated by considering a combination of these metrics and visualization techniques.

We must keep in mind that the choice of evaluation metric may depend on the specific goals of the clustering analysis and the characteristics of the data being analyzed.

References and further reading

Back to Clustering Methods in Bioinformatics