Multiple Sequence Alignment (MSA) Methods
What Is Multiple Sequence Alignment?
Within bioinformatics, multiple sequence alignment means positioning and adjustment of more than two biological sequences, DNA, RNA, or protein sequences, on top of each other. The goal is to place strings so that as many as possible equal and related symbols in the series are sequentially, and column-wisely, on top of each other with either the minimum possible number of gaps in the sequences or gaps placed according to a specific algorithm.
You can learn the basics of sequence alignment in our tutorial Introduction to sequence comparison. Find out why biological sequences are similar in the tutorial Pair-wise sequence alignment, and an in-depth tutorial Pair-wise sequence alignment methods in which we cover the global, Needleman-Wunch and the local, Smith-Waterman Dynamic Programming algorithms.
The above, general definition has roots in computer science and is accurate only in specific cases, such as in the WGS sequence assembly where we want all the identical sequences stacked together regardless of sequencing errors.
However, from the biological point of view, the definition of optimal multiple sequence alignment is not that simple. For example, We could think of aiming to align amino acids or nucleotides that originate from a common ancestor. But, this is not a good foundation for an operational definition since homology refers to unknown historical events; Thus, it is not possible to construct a mathematical function for an algorithm to optimize.
Contrary to a common belief, multiple sequence alignment problem is not yet in all aspects a solved problem. For example, given a set of sequences, each software produces different alignments as a solution to the same problem. This tutorial covers the main algorithmic methods and their variations of the efforts to solve the multiple sequence alignment problem.
Multiple sequence alignment methods vary according to the purpose
Multiple sequence alignment (MSA) is an essential and well-studied fundamental problem in bioinformatics. MSA is also often a bottleneck in various analysis pipelines. Hence, the development of fast and efficient algorithms that produce the desired correct output for each alignment purpose is of utmost concern. Importantly, no general alignment algorithm exists that suits every purpose. Consequently, it is vital that you choose an alignment algorithm that is well-suited for your specific goal to get the correct results.
We can divide the purposes into four categories: 1. General sequence comparison, Assessment of sequence quality, 2. Structure prediction, 3. Phylogenetic analysis, 4. Database searching.
1. General sequence comparison
In this purpose category, we include all multiple sequence alignments that have no relevance to homology. Examples are WGS sequence assembly, sequence quality assessment, i.e., those that have no evolutionary pressure on them.
For this purpose category, we should not use scoring matrices that are based on some evolutionary model or computed on homologous sequences.
In the WGS sequence assembly, we want to align reads that originate from the same genomic location and are therefore identical except for sequencing errors.
The main problem to solve is to distinguish sequencing errors and discrepancies among almost identical repeated sequences so that the different repeat copies are separated; Therefore, we can construct an objective function that, for example, minimizes the number of mismatches, insertions, and deletions.
One algorithmic choice to construct a multiple sequence alignment could be to add sequences into the growing set in a progressive manner and then optimize using an iterative method.
We can expect the iterative method to converge without exception when the sequences are almost identical.
Before the sequence assembly step itself, we usually want to correct sequencing errors. Several error correction methods exist that use some statistical method computed on multiple sequence alignment to distinguish sequencing errors from differences due to repeated genomic streches. Similarly, to the WGS sequence assembly step, we can use an objective function that minimizes the number of discrepancies in the alignments.
In summary, when sequences have no relevance to homology, we can construct multiple sequence alignments using a simple objective function that minimizes the number of differences.
We explore the algorithmic methods in detail below.
2. Structure prediction
3. Phylogenetic analysisHomology, analogy, similarity
4. Database searching
You may also be interested in the following
IUPAC-IUB Joint Commission on Biochemical Nomenclature. Nomenclature and Symbolism for Amino Acids and Peptides: Recommendations 1983. FEBS J. 1984, 138 (1), 9–37. DOI: 10.1111/j.1432-1033.1984.tb07877.x. HTML version.