Machine Learning (ML) in Bioinformatics


Support Vector Machines (SVMs)


image
Prerequisites: None.
Level: Beginner.
Learning objectives:

What Are Support Vector Machines (SVMs)?

Support Vector Machines (SVMs) are one type of supervised machine learning algorithm, and we can use SVMs for both classification and regression tasks. They are based on finding the hyperplane, or a line, in an N-dimensional space that maximally separates the points of one class from others.

SVMs are highly effective in cases where the number of features (dimensions) exceeds the number of samples. They are also effective when there is some noise in the data.

We can find the hyperplane that maximally separates the two classes, the SVM optimization algorithm searches for the hyperplane with the longest distance from the nearest points of each class.

This hyperplane is called the "support vector." We call the distance from the hyperplane to the closest support vector the "margin." Our goal is to choose a hyperplane with the largest possible margin.

SVMs can be kernelized to handle nonlinear classification tasks by using a kernel function to map the data into a higher-dimensional space. The kernel function transforms the data to perform linear separation in the transformed space.

Some commonly used kernel functions include the polynomial kernel, the radial basis function (RBF) kernel, and the sigmoid kernel.

SVMs have several advantages. They are effective in high-dimensional spaces, memory efficient, and versatile because different kernel functions can be used.

However, they can be less effective when the feature count is much greater than the number of samples, and they can be sensitive to the choice of kernel and hyperparameters.

Linear SVMs

Linear Support Vector Machines (SVMs) are a type of SVM used for classification tasks. They are called "linear" because they are based on finding the hyperplane in an N-dimensional space that maximally separates the points of one class from those of another. The hyperplane resides in a subspace with one dimension less than the original.

SVM finds the hyperplane that maximally separates the two classes, using an optimization algorithm that searches for the hyperplane with the greatest distance from the nearest points of each class, called the "support vectors."

The distance from the hyperplane to the most immediate support vector is known as the "margin." The aim is to choose a hyperplane with the most considerable possible margin.

Linear SVMs can be used in cases where the data is linearly separable, which means it is possible to draw a straight line (or hyperplane in higher dimensions) to separate the two classes. In cases where the data is not linearly separable, it may be necessary to use a nonlinear SVM or transform it using techniques such as kernel functions.

Linear SVMs have several advantages. They are efficient in training and predicting, especially for high-dimensional data, and they do not require tuning many hyperparameters. However, they may not perform as well as nonlinear SVMs in cases where the data is not linearly separable or when there is high noise.

Nonlinear SVMs

The standard SVM algorithm is a linear classifier, meaning it can only classify data that is linearly separable, i.e., data that a single straight line can separate.

However, using a kernel trick, we can transform the data into a higher-dimensional space in which it becomes linearly separable, allowing us to use the SVM algorithm to classify nonlinearly separable data. These modified SVMs are called "nonlinear SVMs."

It is helpful to understand how nonlinear SVMs work first to know how the standard SVM algorithm works.

Given a set of labeled training data, the SVM algorithm finds the hyperplane in the feature space that maximally separates the classes. In other words, it finds the line that is as far as possible from the nearest data points of each class. This line is called the "maximum margin hyperplane."

Now, suppose we have a dataset that is not linearly separable. One way to handle this is to transform the data into a higher-dimensional space using a function called a "kernel." A higher-dimensional space often allows us to find a hyperplane that can separate the classes.

The most commonly used kernel functions are the polynomial kernel, the radial basis function (RBF) kernel, and the sigmoid kernel.

Once the kernel function has been chosen and the data has been transformed into the higher-dimensional space, the SVM algorithm can be applied as usual to find the maximum margin hyperplane. The classifier can then predict the class of a new data point by seeing which side of the hyperplane it falls.

Nonlinear SVMs have several advantages. First, they can classify data that is not linearly separable, which is impossible with the standard SVM algorithm. Second, they often achieve higher accuracy than linear SVMs on complex tasks.

However, nonlinear SVMs can be more computationally expensive to train and may be more prone to overfitting if the kernel function is not carefully chosen.

Choosing the right kernel for nonlinear SVMs

When using a nonlinear SVM, choosing the correct kernel function is essential. The kernel function is used to transform the data into a higher-dimensional space in which it becomes linearly separable, so the choice of the kernel can have a significant impact on the performance of the classifier.

Three standard kernel functions are used with nonlinear SVMs: the polynomial kernel, the radial basis function (RBF) kernel, and the sigmoid kernel.

1. Polynomial kernel: The polynomial kernel is defined as:

\[K(x, y) = (x . y + c)^d\]

where the input vectors are x and y, c is a constant, and d is the polynomial degree.

The polynomial kernel can be used for classification tasks where the data is not linearly separable. It works well when there is some nonlinear structure in the data that low-degree polynomials can capture.

However, it can be computationally expensive, especially for high-degree polynomials, and it may be prone to overfitting if the degree is too high.

2. Radial basis function (RBF) kernel: The RBF kernel is defined as:

\[K(x, y) = exp(-gamma \times ||x-y||^2)\]

where the input vectors are x and y, gamma is a parameter, and ||x-y|| is the Euclidean distance between x and y.

The RBF kernel is a popular choice for nonlinear SVMs because it is relatively simple to compute and works well for various classification tasks. It is particularly effective when the data is highly nonlinear, and there is a clear boundary between the classes.

However, the RBF kernel can be sensitive to the choice of gamma, and if gamma is set too high, it may lead to overfitting.

3. Sigmoid kernel: The sigmoid kernel is defined as:

\[K(x, y) = tanh(gamma \times x^T . y + c)\]

Where the input vectors are x and y, gamma and c are parameters.

The sigmoid kernel is similar to the RBF kernel but is less commonly used because it is less effective for most classification tasks. It can be sensitive to the choice of parameters and may be prone to overfitting if the parameters are not chosen carefully.

In summary, choosing the correct kernel for nonlinear SVMs is vital because it can significantly impact the classifier's performance. The polynomial kernel is suitable for tasks with low-degree nonlinear structure, the RBF kernel is a good general-purpose choice for highly nonlinear tasks, and the sigmoid kernel is less commonly used due to its lower effectiveness.

Evaluating the performance of SVMs

It is vital to assess the performance of the SVM to determine how well it can classify the training data and generalize it to new, unseen data.

We can evaluate the performance of an SVM classifier in many ways, including:

Accuracy:
his is the most common evaluation metric and refers to the fraction of correctly classified instances over the total number of instances. It can be calculated as (number of correctly classified instances) / (total number of instances).
Precision:
This metric measures the fraction of instances correctly classified as a positive class (e.g., the class of interest) over the total number of instances predicted as belonging to the positive class. It can be calculated as (number of true positive instances) / (number of true positive instances + number of false positive instances).
Recall:
This metric measures the fraction of instances correctly classified as belonging to the positive class over the total number of instances that belong to the positive class. It can be calculated as (number of true positive instances) / (number of true positive instances + number of false negative instances).
F1 score:
This is the harmonic mean of precision and recall, calculated as (2 * precision * recall) / (precision + recall).
Confusion matrix:
This table shows the number of correctly and incorrectly classified instances for each class. It can be used to calculate the above evaluation metrics and to understand the types of errors that the SVM classifier makes (e.g., false negatives, false positives).
ROC curve:
This plot shows the relationship between the true positive rate (recall) and the false positive rate (1 - specificity) at different classification thresholds. The area under the ROC curve (AUC) measures the classifier's overall performance, with a value of 1.0 indicating a perfect classifier and a value of 0.5, meaning a classifier that is no better than random guessing.

In addition to these evaluation metrics, it is also essential to consider other factors, such as the computational efficiency of the SVM classifier and its ability to handle large datasets.


References and further reading