# DNA Sequence Alignment How to

Prerequisites: Knowledge equivalent to the tutorial Pairwise sequence alignment. In addition the tutorial Introduction to information theory preferred.
Level: Intermediate.
Objectives:
1. To provide a basic understanding of various scoring schemes' effect on sequence alignments.
2. To be able to construct custom scoring schemes for DNA sequence alignments.

## DNA Sequence Alignment vs Protein Sequence Alignment

Before we dig into DNA scoring matrices, let's compare the amount of information that protein and DNA sequences can carry, because, interestingly protein and DNA sequences contain differing amounts of information. DNA sequence consists of four different bases, A, T, G, C and we can calculate the maximum amount of information each base can carry by taking the logarithm of base two of four which gives two bits.

Protein sequences consist of 20 different amino acids, and we can calculate the maximum amount of information each amino acid can carry the same way as for DNA by taking the logarithm of base two of 20 which gives ~4.3 bits.

So, it looks like protein sequences could have more information than DNA sequences, but wait, DNA codes for protein using codons consisting of three bases each; Thus, DNA uses six bits of information (3x2=6) to code one amino acid. However, each amino acid can only carry about 4.3 bits. Consequently, there is a difference of about 1.7 bits less information in protein sequences.

Where did the information get lost? Let's see, one codon of three bases could potentially make 64 different amino acids (4^3=64), but only 20 of them are in use, resulting in most of the codons being redundant, in other words, coding for same amino acids. For example, all the four codons GCT, GCC, GCA, GCG code for a single amino acid Alanine. Therefore, the codon redundancy can explain the information loss.

However, when we compare sequences, the above is true only for 100% identical sequences. Things get a little bit different when we compare less conserved sequences, and tables can even turn upside down by protein sequences ending up containing more information than DNA sequences below 50% identity (Figure 1).

Consequently, one could argue that we should compare DNA when sequences are more than 50% identical. However, the answer is not that simple, it depends on what our goals with the comparisons are, and the differences in information content from 40% to 70% identity is relatively small. Furthermore, only a small portion of many genomes code for protein; Thus, we have no choice but to align DNA sequences when protein sequences do not exist. More about this topic in the tutorial "How to select the right substitution matrix?" See the subsections "related" and "What next?"

## Alignment Scoring

Most of the time we are interested in detecting sequence homology, and therefore it is essential to choose an appropriate scoring system. Just by randomly making up match and mismatch scores as a scoring approach for DNA alignments results in alignments and scoring that is not possible to correctly evaluate.

Blastn pair-wise nucleotide sequence alignment tool at The National Center for Biotechnology (NCBI) uses the match/mismatch scores 2/-3 as default, but you can also choose from a range of 1/-2, 1/-3, 1/-4, 4/-5, and 1/-1. European Bioinformatics Institute (EBI) offers the following choices for Blastn and Megablast scoring: 1/-4, 2/-3, 2/-7, 2/-5, 1/-2, 1/-1, 5/-4, and 4/-5.

## Optimal Sequence Alignment

Why are there so many scoring schemes? How to choose the right scheme? Each scoring scheme is to optimize alignment scoring to a specific sequence similarity that we are searching. For example, the match/mismatch score 1/-4 optimizes the scoring for 100% identical sequences and 1/-1 for 75% identical sequences. The default for NCBI Blastn is 2/-3, which is optimal for 89% identical sequences, somewhat in the middle of the range of choices. So, you should choose the scoring scheme that is close to your desired sequence identity. If in doubt, only by trying several different ones may help. If you wish to find out for what percentage of sequence similarity each scoring system is optimal, take a look at this DNA alignment evaluation tool.

All the above assumes that DNA base substitutions are independent and equally probable, which is not the case most of the time in reality but seems to work well in general database searches. To improve the performance, we need to construct sequence-specific matrices. Consequently, there are many published variations of DNA scoring matrices. Specific matrices either account for particular nucleotide frequencies along sequences or define position-specific scores from iterative sequence alignments, such as PSI-Blast.

#### Customization of DNA Sequence Alignment Scoring

By customizing our scoring system, we get the best, sharp results that are relevant to our aims. There are numerous possibilities to customize scoring depending on what our goals are. In sequence fragment assembly, we would like to find 100% identical sequences and adjust scoring to reflect sequencing errors specific to different sequencing technologies and accounting for quality values for each base if available.

If we are interested in biological relatedness, whether it is homology, a specific function or any other kind of relatedness, the best way to find these frequencies is to consult the Nature itself by analyzing multiple sequences instead of making assumptions. Base frequencies vary from species to species and from functional site to another and even within stretches of functional sites. Further, the match frequency is crucially dependent on the divergence, evolutionary distance of the sequences we compare. By optimizing the scoring, we can successfully compare genes, non-coding sequences, such as micro RNAs and any particular part of interest.

##### Make a Hypothesis

An excellent way to assess sequence alignment is to make a hypothesis and then test it. The hypothesis has two parts, a probability that DNA bases match and a probability that they don't match, given a specific evolutionary distance and base frequencies of sequences we are comparing.

Now that we have the hypothesis, we need to test it against something. Commonly this something is the second hypothesis consisting of a probability that bases match by random change. However, we could also attempt to make a distinction between two sequence groups, in which case we would test against the base frequency of the first group against the base frequency of the second group instead of random match frequencies.

##### Calculate Log Odds

A good way to compare these two hypotheses is to calculate odds, just by dividing a match probability or mismatch probability by a probability of the second hypothesis, commonly a probability that bases match by random chance. If we do this calculation for each base pair in a sequence alignment and sum the odds, we get the total alignment score, reflecting the odds that the two sequences are either evolutionarily related or some other way connected depending on how we set our hypothesis (Equations 1.a and 1.b). Note that the odds score here is a ratio between two probabilities, and therefore it can be higher than one as opposed to a probability, which must be less than or equal to one. Also, both odds and probability must be greater or equal to zero.

$Odds \: for \: match = \frac{P_{Match}}{P_{Random}} \: \: (Eq. 1.a)$ $Odds \: for \: mismatch = \frac{P_{Mismatch}}{P_{Random}} \: \: (Eq. 1.b)$

Consequently, the odds gives us some number that is zero or any number that is greater than zero. Odds less than one means that the alignment does not fulfill our hypothesis, odds larger than one meaning the alignment fulfills our hypothesis, e.g., sequences are related, and lastly odds precisely one means that there's no difference between the two hypothesis; Thus, we cannot make a judgment because both hypotheses are equally probable.

To make the odds scores easily tractable and quick to add, we can take a logarithm of the odds ratio. This way, negative scores mean non-relatedness, positive scores relatedness, and zero equal probability (Equations 2.a and 2.b).

$Log \: odds \: for$ $Match = \log_{2}\left( \frac{P_{Match}}{P_{Random}} \right) \: \: (Eq. 2.a)$ $Mismatch = \log_{2} \left( \frac{P_{Mismatch}}{P_{Random}} \right) \: \: (Eq. 2.b)$
##### Optimizing Scoring for a Specific Similarity Percentage

To optimize scoring for a specific sequence similarity percentage, in other words for a specific target frequency, we must first calculate the match probabilities and mismatch probabilities. We do this purely by statistical means with no biological implications. Let's say we want to optimize for 75% sequence similarity then the total match probability is 0.75 and for each of the four bases, it is 0.75/4 = 0.1875. Consequently, leaving us with the total mismatch probability of 1 - 0.75 = 0.25. So, what is the probability of each mismatch? It boils down to knowing how many different ways a mismatch can occur. There are four different bases in DNA, and therefore we can mismatch each one of the bases in three different ways. From this follows that there are in total 4 x 3 = 12 ways bases can mismatch. The total mismatch probability space is 0.25, so the probability of each mismatch is 0.25 / 12 = ~0.0208.

We still need the null hypothesis, i.e., a hypothesis to which to compare our match and mismatch hypotheses. In this example we assume an equal random probability for each base, then the probability of a random base is 0.25. Although, it would be better to use a specific distribution relevant to the sequences we are comparing.

All the above results in our hypothesis of a probability of a base pair matching to be 0.1875, a pair mismatching 0.0208, and the random pair match 0.25 x 0.25 = 0.0625. Inserting these into the equation, we get a match score 0.1875 / 0.0625 = ~3.0242 and a mismatch score of 0.0208 / 0.0625 = 0.3328 (Equation 2.a and 2.b).

By taking a logarithm base two of the scores, we get the match score ~1.6 and the mismatch score ~-1.6. Since the base of the logarithm is two, the scores are in bits. The expression of alignment scores in bits becomes handy when we need to know how much information is required to distinguish an alignment from a random one uniquely when searching a sequence database or just a set of sequences. If we were searching Genbank, which in August 2018 contains 260,806,936,411 bases, we would need $$log_{2}(260806936411) \approx 38$$ bits of information at least. Meaning that if we use our example scoring yielding a match score of 1.6 bits, an identical sequence alignment must be at least 38/1.6 = ~24 bases long or the total alignment score must be 38 bits or more. By lowering the target frequency further down from 75%, we need increasingly long alignments to reach a minimum required amount of information.

Often people want to scale the scores to some number they like, since scaling the scores do not affect the rank of alignments. By multiplying both match and mismatch scores by the same number, the alignment ranking remains the same, e.g., the best alignment is still the best, and the worst is still the worst. Also, if we first take, e.g., the natural logarithm of the ratio and then divide by the natural logarithm of two, we change the base to be logarithm base two. The equation below combines these effects, and variations of it are commonly in use (Equation 3).

$S_{ij} = \frac{1}{\lambda} log \left( \frac{P_{ij}}{f_{i}f_{j}} \right) / log(2) \: (Eq. 3)$

Where $$S_{ij}$$ is the score in bits, $$\lambda$$ is the scaling factor, $$P_{ij}$$ is the probability of a match or a mismatch, and $$f_{i}f_{j}$$ is the probability of a random match.

The easy way to design a simple scoring is to use our online tool to create a DNA matrix. .

#### What next?

How to select the right substitution matrix?

Tutorials Main Page

## References

Burkhard Rost, "Twilight zone of protein sequence alignments." Protein Engineering vol.12 no.2 pp.85–94, 1999.

Ying Chen, Weicai Ye, Yongdong Zhang1, and Yuesheng Xu, "High speed BLASTN: an accelerated MegaBLAST search tool." 7762–7768 Nucleic Acids Research, 2015, Vol. 43, No. 16 Published online 6 August 2015 doi: 10.1093/nar/gkv784.

Zheng Z, Schwartz S, Wagner L et al., "A greedy algorithm for aligning DNA sequence." Journal of Computational Biology, 2000, 7(12, 1/2): 203–214.

Guang-Ming Tan, Lin Xu, Dong-Bo Bu, Sheng-Zhong Feng, Ning-Hui Sun, "Improvement of Performance of MegaBlast Algorithm for DNA Sequence Alignment." J Comput Sci Technol (2006) 21: 973. https://doi.org/10.1007/s11390-006-0973-0

Altschul, Stephen; Gish, Warren; Miller, Webb; Myers, Eugene; Lipman, David (1990). "Basic local alignment search tool." Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712.

Karlin S, Altschul SF. "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264-8. PubMed

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. PubMed