# How to select the right substitution matrix?

**Prerequisites:**Knowledge equivalent to the tutorial Construction of substitution matrices, and DNA scoring matrices preferred.

**Level:**Intermediate.

**Objectives:**

Basic understanding of pros and cons of scoring matrix selection, how all scoring schemes have implicit or implied optimal target similarity, and understanding effects of choosing a 'wrong' scoring matrix.

#### Introduction

A comparison of protein and DNA sequences from living organisms reveals a varied number of differences between sequences and they can exist because, despite differences, they still provide a viable function or in some cases in DNA do not result in a coding of a different amino acid. The selection of match, substitution, insertion and deletion scores affect the resulting alignments and the sensitivity of database searches of both DNA and protein sequences.

To design a correct scoring scheme requires prior knowledge of the probabilities of each type of match, changes, and the underlying frequency of occurrence of each residue. We can quickly approximate the frequencies of occurrence by counting the rate of each residue in large databases, but to estimate the other probabilities is a more complicated process and no perfect scoring scheme exists. Each one of them has their specific limitations.

A single pair of sequences does not contain enough information to allow us to determine a scoring scheme; Therefore, we need to compare multiple sequences, but unfortunately, construction of multiple sequence alignments is computationally a hard problem which is not possible to solve optimally. In an attempt to reduce errors the estimation of the frequencies are based on closely related sequences and alignments with no insertions or deletions; Thus, resulting in scoring systems without gap scores.

The typical general purpose scoring matrices for protein sequence alignment are BLOSUM, PAM, variable time (VT), variable time maximum likelihood (VTML), and more recently published PFASUM matrices. See available matrices and search tools at Genbank and European Molecular Biology Laboratory (EMBL) or DNA Databank of Japan (DDBJ). The selection of a scoring matrix depends on our goal whether we are using them to search a database or to align known sequences and wish to maximize the alignment accuracy. In database searches, the primary concern is to find matches that are statistically significant and thus discriminating matches from chance. Once, we have identified a correct sequence family we should make a custom scoring matrix using the information available in multiple sequences in that family instead of a general one to fine tune alignments or search increasingly distant homologs.

PSI-BLAST does just that by iteratively aligning and adjusting scores, which are position specific. However, the algorithm is sensitive for each sequence inclusion and thus one must be careful of not including incorrect sequences that may result in a blend of families. Besides, the resulting scoring is dependent on the order of inclusion of sequences to the alignments. More about this in the separate PSI-BLAST tutorial. Please, see the subsection related below.

In summary, the resulting alignments are dependent on the algorithm global or local alignment, the scoring scheme, the evolutionary distance of the aligned sequences, and the gap penalty scheme.

#### The effects of evolutionary distance of sequences

With increasing evolutionary distance the available information decreases and consequently increasingly long sequence alignments are required to collect enough information so that the alignments are distinguishable from random alignments. For example, PAM2 matrix models alignments with 2% differences and has the entropy of 3.97 bits and PAM250 modeling alignments with 250% change (80% observed difference) has the entropy of only 0.354 bits. Let's say that the minimum required amount of information is 50 bits to distinguish an alignment from chance, then the closely related minimum sequence alignment length is 50/3.97 = about 13 residues, whereas the more distantly related sequences require 50/0.354 = about 142 residues to be significant. Note that the calculation assumes alignments with no gaps and that the length of similar blocks between sequences tends to decrease with increasing divergence. With gaps in the alignments, the minimum lengths would increase. Consequently, to find significant distant homologs is a more challenging task than finding closely related sequences. See the software tools "Make PAM matrices" and "Create DNA matrices" to explore more entropies.

#### Scoring scheme v.s. evolutionary distance

Every scoring scheme is either based on some overall percentage of similarity of sequences or implies a similarity percentage. With increasing evolutionary distance or divergence the frequency of matching residues decreases and vice versa. In general, aligning 100% identical sequences the mismatch penalty would be very high since we don't expect any mismatches and in contrast alignment of distantly related sequences the mismatch scores would be lower and match scores higher.

For example, in DNA alignment +1/-1 match/mismatch scores imply 75% identical sequences and +1/-3 entails 99% identical sequences. BLOSUM62 for sequences with up to 62% identity gives Tryptophan-Tryptophan (W-W) match a score of 11 and BLOSUM45 for sequences with up to 45% identity gives a score of 15 for the same W-W match. A mismatch score in BLOSUM62 for Isoleucine-Tryptophan (I-W) is -3, and in BLOSUM45 it is -2.

For Leucine-Leucine (L-L) matches, BLOSUM62 gives the score +4 and BLOSUM45 the score +5. The reason why L-L matches score lower than W-W matches is that Leucine is more abundant than Tryptophan; Consequently, the chance of randomly getting a W-W pairing is lower than getting an L-L pairing. Remember that the scores are log-odds: \[ Score = log_{2}\frac{Probability\: in\: homologs}{Probability\: by\: chance} \] By decreasing the random chance probability, the score increases.

Furthermore, when the base of the logarithm is two, the scores are in bits. In general, as also is the case in BLOSUM matrices, the scores are further scaled to represent multiples of half bits, i.e., divided by ln(2)/2 ~ 0.347. We can calculate that in BLOSUM45 the W-W pairing in homologous sequences is about 56 times more likely than by chance \( (15/2)^{2} \). In comparison, the L-L score is +5 and thus only \( (5/2)^{2} \) about six times more likely in homologs than by chance.

The examples are just a generalization of the concept. However, any scoring scheme whether based on sets of real sequences or theoretical reasoning imply a specific target frequency, i.e., the expected frequency of matches and mismatches in an alignment.

So what is the effect of using a scoring matrix optimized for, e.g., 75% identical sequences, but the sequences we align are more or less divergent than that? Theoretically, the efficiency of scoring matrices decreases with increasing distance from the optimal similarity, i.e., their target frequency. Figure 2 shows DNA matrices optimized for 95%, 75%, 60%, and 45% similarities and their efficiency with real sequence similarity deviating from optimal ones. The greyed area shows the range of 90% efficiency and the histogram the minimum required length of the alignment to achieve statistical significance compared to the optimal one, which is 1. We can observe that by deviating from the target frequency, the minimum alignment length to attain statistical significance increases, which is an important observation. Regardless of which substitution matrix we use, modern search tools efficiently prevent us getting false matches by employing efficient statistics, on the other hand, we may miss relevant matches to short well-conserved domains.

Scoring matrices for protein sequences behave the same way as DNA scoring matrices. Their efficiency also decreases with increasing divergence from the optimal target frequency. Figure 3 shows the efficacy of a selection of five PAM matrices with an underlying frequency deviating from their target frequency. The efficiency is one when the target frequency of a PAM matrix matches the actual distance given in PAMs.

Figure 4 shows a selection of BLOSUM matrix efficiencies with various real divergences. As with DNA and PAM matrices, the maximum efficiency is one when the target frequency matches the actual underlying frequency.

#### Information content in matrices and alignment length

Information content among different scoring matrices varies. This information content, usually given as bits per position, decreases with decreasing target similarity and with increasing deviation from the optimal target frequency (Figures 2, 3, and 4). In other words, scoring matrices that target low similarities, such as BLOSUM45, PAM250, and VTML160 have lower information content than matrices that

target high similarities, such as BLOSUM90, PAM100, and VTML10 (Tables 1 and 2). Scoring with low information content matrices requires longer alignment lengths than scoring with high information content matrices for an alignment to be statistically significant. For example, if an alignment needs, say, 50 bits to be significant, then scoring with BLOSUM62 gives 0.4 bits per position, and the alignment needs to be at least 50/0.4 = 125 residues long. If we were to use VTML10 matrix instead, which gives 3.87 bits per position, the minimum alignment length needs to be only 50/3.87 = 13 residues long to be statistically significant (scores are with gap penalties, Table 2).All this is essential knowledge since although statistics most of the time prevent us from getting false hits, we may miss short conserved domains when using matrices that target low similarities. As in the example above, we would have lost a highly conserved domain, which length is less than 125 residues if we used BLOSUM62, but would have found it by using VTML10 instead. Note that these values apply to scores using a matrix whose target frequency matches a real underlying frequency. By deviating from the optimal target frequency, the minimum statistically significant alignment lengths increase accordingly.

For example, if we are looking for human-mouse orthologs which share about 83% sequence identities, we should use matrices that have the target frequency optimized close to that rate.

Similarity (%) |
PAM# (E bits) |
BLOSUM# (E bits) |
---|---|---|

43 | 100 (1.18) |
90 (1.18) |

38 | 120 (0.98) |
80 (0.99) |

30 | 160 (0.70) |
60 (0.66) |

25 | 200 (0.51) |
50 (0.52) |

20 | 250 (0.36) |
45 (0.38) |

DNA sequences are often more divergent than protein at an equal evolutionary distance because codons are redundant, primarily the third codon which is the most wobbly one. A codon consisting of three bases can code for 64 different amino acids, but only 20 is in use, so the remaining (64-20) 44 codons code for the same amino acid and three stop codons. Methionine is the only exception which codes for a start; Therefore, many mutations in DNA keep the codons coding for the corresponding amino acid, such mutations are *synonymous* or silent mutations. Mutations that result in codons to code a different amino acid are *non-synonymous*.

In a residue per residue, comparison DNA contains less information than protein. One DNA base can contain up to two bits and an amino acid up to about 4.3 bits. However, if we compare DNA codons which consist of base triplets, then DNA can contain up to 3 x 2 = 6 bits of information surpassing the maximum information content of an amino acid.

Should we then compare DNA instead of protein sequences? Well, it depends on what our goals with comparisons are. Firstly, if we translate DNA into protein, we again end up having only up to 4.3 bits of information per residue instead of the 6 bits that DNA originally contained. Secondly, to capture the whole DNA information content a sequence alignment algorithm needs to account for the triplets and not only align base by base manner. However, it is possible to question to what end this is useful since most of the changes in DNA do not result in a different amino acid.

On the other hand, just by counting differences between DNA sequences, we can estimate evolutionary distances. Thirdly, the information content of sequence alignments of both DNA and protein decreases with increasing evolutionary distance but not at the same rate and become equal at around 50% identity, and whereafter DNA sequences end up containing less information than protein sequences with growing divergence. For more information on DNA matrices, see the tutorial "DNA scoring matrices."

#### The effect of gap penalties

The scoring and information content in matrices are from sequence alignments with no gaps, and thus the inclusion of gaps in alignments decreases the information content. There is no theoretical mathematical model for gapped alignments for various evolutionary distances because to estimate background frequencies of gaps is challenging and furthermore these background frequencies are likely to change with different gap penalty values; Consequently, gap penalties need to be determined by simulations and empirical studies.

The setting of gap penalties is probably the most crucial in short evolutionary distances, and low gap penalties are inefficient in identifying highly similar sequences just as is the case with scoring matrices themselves. However on the other hand, while we want alignments to contain as few gaps as possible, we want them to be long enough to achieve statistical significance, presenting a sort of a dilemma.

The rule of thumb for gap penalties follows the same logic as in using scoring matrices to score highly similar sequences giving high penalties for substitutions and low penalties for highly divergent sequences. So, searching for highly similar sequences the gap penalties should be high and for divergent sequences low. By decreasing gap penalties the alignment lengths tend to increase.

Reese and Pearson performed simulations to maximize the performance of short alignments and suggest that for a half-bit scoring matrices, gap open penalties of 15 and gap extension penalty of 2 for VTML20 are the most efficient.

The allowance of gaps in alignments reduces the information content considerably. The effect is the most profound within short evolutionary distances. For instance, PAM30 with the gap open penalty of nine and the extension penalty of one yields 0.90 bits per position, and without gaps, PAM30 produces 2.6 bits per position almost a triple of information. With BLOSUM62 the information content increases from 0.40 to 0.74 without gaps and the minimum alignment length generating a statistically significant score, requiring 50 bits, reduce from 125 to 68 residues (Table 2).

Matrix | Gap penalty^{1} |
Similarity (%) | Bits/pos. | 50 bit length |
---|---|---|---|---|

BLOSUM80 | 10/1 | 32.0 | 0.48 | 104 |

BLOSUM62 | 11/1 | 28.9 | 0.40 | 125 |

VTML140 | 10/1 | 28.4 | 0.44 | 114 |

VTML120 | 11/1 | 32.1 | 0.54 | 93 |

VTML80 | 10/1 | 40.5 | 0.74 | 68 |

VTML40 | 13/1 | 64.7 | 1.92 | 26 |

VTML20 | 15/2 | 86.1 | 3.30 | 15 |

VTML10 | 16/2 | 90.9 | 3.87 | 13 |

PAM70 | 10/1 | 33.9 | 0.58 | 86 |

PAM30 | 9/1 | 45.9 | 0.90 | 56 |

#### Alignment overextension

Local alignment algorithms find local high scoring segments between sequences and programs such as SSEARCH and BLAST will always only report alignments that between sequences that share a statistically significant domain. However, the exact length of the alignment depends on the location of the domain and the scoring matrix used. The alignments may overextend beyond true homologous domain boundaries when we use matrices whose target frequency is low and align closely related sequences. Overextension is most likely to occur when sequences contain short repeated domains.

When in doubt we can use the annotation feature in FASTA/SSEARCH programs to evaluate the alignments. The annotation feature uses available domain annotations to subdivide alignments to give a clear view of partial scores and information content (Figure 5.)

```
TR:A0A2I3RKW2_PANTR A0A2I3RKW2 VAV1 isoform 4 OS=Pan troglodytes OX=9598 GN=VAV1
PE=4 SV=1 (790 aa)
Region: 1-12:594-605 : score=104; bits=28.1; Id=1.000; Q=46.2 : UnDef :
Region: 13-22:606-615 : score=69; bits=18.6; Id=1.000; Q=18.1 : NODOM :0
Region: 23-117:616-710 : score=622; bits=168.1; Id=1.000; Q=462.0 : UnDef :1
Region: 118-133:711-726 : score=96; bits=25.9; Id=1.000; Q=39.8 : NODOM :0
Region: 134-193:727-786 : score=437; bits=118.1; Id=1.000; Q=313.5 : UnDef :1
s-w opt: 1328 Z-score: 1892.2 bits: 358.9 E(120841748): 1.7e-95
Smith-Waterman score: 1328; 100.0% identity (100.0% similar) in 193 aa overlap (1-193:594-786)
10 20 30 40
P15498 WFPCNRVKPYVHGPPQDLSVHLWYAGPMERAGAESILANR
::::::::::::::::::::::::::::::::::::::::
TR:A0A PPGAIGPFLRLNPGDIVELTKAEAEQNWWEGRNTSTNEIGWFPCNRVKPYVHGPPQDLSVHLWYAGPMERAGAESILANR
560 570 580 590 600 ][ 610 ][ 620 630
50 60 70 80 90 100 110 120
P15498 SDGTFLVRQRVKDAAEFAISIKYNVEVKHIKIMTAEGLYRITEKKAFRGLTELVEFYQQNSLKDCFKSLDTTLQFPFKEP
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
TR:A0A SDGTFLVRQRVKDAAEFAISIKYNVEVKHIKIMTAEGLYRITEKKAFRGLTELVEFYQQNSLKDCFKSLDTTLQFPFKEP
640 650 660 670 680 690 700 71][
130 140 150 160 170 180 190
P15498 EKRTISRPAVGSTKYFGTAKARYDFCARDRSELSLKEGDIIKILNKKGQQGWWRGEIYGRVGWFPANYVEEDY
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
TR:A0A EKRTISRPAVGSTKYFGTAKARYDFCARDRSELSLKEGDIIKILNKKGQQGWWRGEIYGRVGWFPANYVEEDYSEYC
720 ][730 740 750 760 770 780 790
```

*Figure 5.*The annotation feature of FASTA/SSEARCH. The subdivision of the alignment to five separate regions and respective scores. We colored the domains for the illustration purpose.

Scott Barlowe and colleagues recently developed an R-package called "SubVis" to rapidly explore the effects of multiple scoring matrices in pair-wise sequence alignments. The source code is freely available at The Comprehensive R Archive Network, and they also have made demo videos. Link to their paper is in the subsection References below.

#### Other scoring matrices

Besides the general and popular matrices, many other matrices exist a few are general, such as JTT, 1992 (PubMed:1633570) Gonnet, 1992 (PubMed:1604319) OPTIMA, 2000 (PubMed:11056037), and many specialized ones, for example, Jean-Michel Claverie's scoring matrix specializing in detecting frameshifts, "Detecting Frame Shifts by Amino Acid Sequence Comparison," or to detect transmembrane proteins and many more.

#### Summary

The choice of scoring matrices and gap penalties greatly affect the alignment results. However, modern software tools employ sophisticated statistics that prevents us from getting false hits in database searches, and we are likely always to get matches to well-conserved domains given the large and rapidly growing size of sequence databases today. A wrong choice of a matrix may also result in overextension of alignments.

While we may not get false hits, we may miss evolutionarily distant homologs due to a choice of matrix and associated gap penalties. The efficiency of scoring matrices decreases both with their decreasing target similarity and a deviation from a real similarity from this matrices' optimal similarity. This decrease of efficiency results in the requirement of increasingly longer sequence alignments to be statistically significant.

The rule of thumb is to use matrices with high optimal similarity to align short highly similar homologous domains (< 50 aa / < 150 nt) using accordingly high gap penalties and lower gap penalties and matrices with low target similarity to align distantly related sequences.

No matter what scoring scheme we devise, it will always imply a corresponding optimal target similarity. See "Evaluate DNA alignment scoring," for example.

#### What next?

Pair-wise sequence alignment

Tutorials Main Page

#### Related tutorials

Pair-wise sequence alignment

Pair-wise sequence alignment methods

Construction of substitution matrices

DNA scoring matrices

#### References

Tobias Müller and Martin Vingron, *"Modeling Amino Acid Replacement."* JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Number 6, 2000 Mary Ann Liebert, Inc. Pp. 761–776. https://doi.org/10.1089/10665270050514918

Tobias Müller, Rainer Spang, and Martin Vingron, *"Estimating Amino Acid Substitution Models: A Comparison of Dayhoff’s Estimator, the Resolvent Approach and a Maximum Likelihood Method."* Mol Biol Evol. 2002 Jan;19(1):8-13. PubMed

Gavin A. Price, Gavin E. Crooks, Richard E. Green and Steven E. Brenner, *"Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap."* Bioinformatics, Volume 21, Issue 20, 15 October 2005, Pages 3824–3831. https://doi.org/10.1093/bioinformatics/bti627

Robert C Edgar, *"Optimizing substitution matrix choice and gap parameters for sequence alignment."* BMC Bioinformatics 2009, 10:396. doi:10.1186/1471-2105-10-396

Steven Henikoff and Jorja G. Henikoff, *"Performance Evaluation of Amino Acid Substitution Matrices."* PROTEINS:Structure, Function, and Genetics 17:49-61 (1993). PubMed

David T.Jones, William R.Taylor and Janet M.Thornton, *"The rapid generation of mutation data matrices from protein sequence."* Mol Biol Evol. 2002 Jan;19(1):8-13. PubMed

Müller T, Spang R, Vingron M., *"Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method."* Mol Biol Evol. 2002 Jan;19(1):8-13. PubMed

S Henikoff and J G Henikoff, *"Empirical determination of effective gap penalties for sequence comparison."* Bioinformatics. 2002 Nov;18(11):1500-7. PubMed

S Henikoff and J G Henikoff, *"Amino acid substitution matrices from protein blocks."* PNAS November 15, 1992 89 (22) 10915-10919 https://doi.org/10.1073/pnas.89.22.10915

Burkhard Rost, *"Twilight zone of protein sequence alignments."* Protein Engineering vol.12 no.2 pp.85–94, 1999.

Ying Chen, Weicai Ye, Yongdong Zhang1, and Yuesheng Xu, *"High speed BLASTN: an accelerated MegaBLAST search tool."* 7762–7768 Nucleic Acids Research, 2015, Vol. 43, No. 16 Published online 6 August 2015 doi: 10.1093/nar/gkv784.

Zheng Z, Schwartz S, Wagner L et al., *"A greedy algorithm for aligning DNA sequence."* Journal of Computational Biology, 2000, 7(12, 1/2): 203–214.

Guang-Ming Tan, Lin Xu, Dong-Bo Bu, Sheng-Zhong Feng, Ning-Hui Sun, *"Improvement of Performance of MegaBlast Algorithm for DNA Sequence Alignment."* J Comput Sci Technol (2006) 21: 973. https://doi.org/10.1007/s11390-006-0973-0

Altschul, Stephen; Gish, Warren; Miller, Webb; Myers, Eugene; Lipman, David (1990). "*Basic local alignment search tool.*" Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712. ⚝^{➚}

Karlin S, Altschul SF. "*Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.*" Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264-8. PubMed

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "*Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.*" Nucleic Acids Res. 25:3389-3402. PubMed

J. T. Reese and W. R. Pearson "*Empirical determination of effective gap penalties for sequence comparison.*" Bioinformatics, Volume 18, Issue 11, 1 November 2002, Pages 1500–1507, https://doi.org/10.1093/bioinformatics/18.11.1500

William R. Pearson, Weizhong Li, and Rodrigo Lopez "*Query-seeded iterative sequence similarity searching improves selectivity 5–20-fold.*" Nucleic Acids Res. 2017 Apr 20; 45(7): e46. Published online 2016 Dec 6. doi: 10.1093/nar/gkw1207.

Mileidy W. Gonzalez and William R. Pearson "*Homologous over-extension: a challenge for iterative similarity searches.*" Nucleic Acids Res. 2010 Apr; 38(7): 2177–2189. Published online 2010 Jan 11. doi: 10.1093/nar/gkp1219.

Frank Keul, Martin Hess "*PFASUM: a substitution matrix from Pfam structural alignments.*" BMC BioinformaticsBMC series – open, inclusive and trusted 201718:293. https://doi.org/10.1186/s12859-017-1703-z.

Scott Barlowe, Heather B. Coan and Robert T. Youker "*SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment.*" PeerJ 5:e3492. DOI 10.7717/peerj.3492.

Scott Barlowe, Heather B. Coan and Robert T. Youker "*Non-symmetric score matrices and the detection of homologous transmembrane proteins*" Bioinformatics, Vol. 17 Suppl. 1 2001 Pages S182–S189. PubMed.