Part 3: Introduction to Information Theory and Its Applications to DNA and Protein Sequence Alignments
Information in DNA, Protein, and Sequence Alignments
By now we know that bases in DNA sequences can carry up to two bits of information and amino acids in protein sequences about 4.3 bits.
Let's say that we have an unknown DNA sequence and want to search similar sequences in a database hoping to infer a function of our sequence. We can infer function if we find sequences that have a common ancestor with our sequence and their function is known. The primary requirement is that our sequence contains enough information so that we can confidently distinguish random matches from those possessing a common ancestry with our sequence.
Sequence databases are enormous and rapidly growing. For example, the number of sequences in Genbank has doubled every 18 months and had in August 2018 260,806,936,411 bases plus 3,204,855,013,281 bases in the whole genome sequencing (WGS) section. So how much information do we need for a sequence search that can produce statistically significant results?
The total number of sequences in Genbank is 3,465,661,949,692, and we now know that the entropy is \( H = log_{2}(3,465,661,949,692) \approx 42 \) bits; Thus, our sequence needs to contain at least that amount of information. Because a DNA base can carry up to two bits per base, the sequence needs to be at least 42/2 = 21 bases of length.
Amino acids can carry more information per residue than DNA bases, up to 4,3 bits/residue; Consequently, a protein sequence that can produce statistically significant matches is much shorter than a DNA sequence, only about 42/4.3 ≈ 10 residues.
Information Content Is Dependent On a Scoring Scheme In Sequence Alignments
The above calculations apply only to 100% identical matches, but when we are interested in finding more distant relationships, we need to use a scoring scheme. This scoring scheme is preferably optimized to target a specific evolutionary distance or a specific sequence similarity range.
The usage of a scoring scheme or a scoring matrix reduces the information content from a maximum possible unless the scoring system targets 100% identity; Consequently, to produce statistically significant alignments, the sequence length must increase accordingly. Increasingly distant homologs require increasing sequence length to be statistically significant.
Apart from that each identical match and matches with similar amino acids contain a varied amount of information with different scoring systems, inevitably alignments need to include gaps that also contribute to an increased minimum alignment length.
For example, if 50 bits of information is the minimum required to produce a significant match and we use BLOSUM62 scoring matrix, the minimum length of alignment that can produce a statistically significant match is 50/0.48, which is about 105 residues. If we use VTML10 matrix, the minimum length is 50/3.87, and the minimum length is only 13 residues (Table 1). See also the tutorial "How to select the right substitution matrix?" for more on how to select a scoring matrix.
Matrix | Gap penalty1 | Similarity (%) | Bits/pos. | 50 bit length |
---|---|---|---|---|
BLOSUM80 | 10/1 | 32.0 | 0.48 | 104 |
BLOSUM62 | 11/1 | 28.9 | 0.40 | 125 |
VTML140 | 10/1 | 28.4 | 0.44 | 114 |
VTML120 | 11/1 | 32.1 | 0.54 | 93 |
VTML80 | 10/1 | 40.5 | 0.74 | 68 |
VTML40 | 13/1 | 64.7 | 1.92 | 26 |
VTML20 | 15/2 | 86.1 | 3.30 | 15 |
VTML10 | 16/2 | 90.9 | 3.87 | 13 |
PAM70 | 10/1 | 33.9 | 0.58 | 86 |
PAM30 | 9/1 | 45.9 | 0.90 | 56 |
Modern database search tools commonly employ efficient statistics that prevent us from getting matches that are not statistically significant and should be adequate for most common usage scenarios.
In the following subsection we explore the information contained in multiple sequence alignments, and since most of the common scoring matrices are based on multiple sequence alignments, this gives some insight on the origin of entropy in these. PAM matrices, however, are based on an evolutionary model and not on multiple alignments. You can find more about PAM matrices in the tutorial "Construction of substitution matrices."
Those who are interested in in-depth knowledge and intend to design scoring systems and matrices for special purposes, we aim to write a separate tutorial on this topic, which includes an in-depth tutorial of mutual information in sequence alignments. Keep an eye for the main tutorials page and our home page.
Information in Multiple Sequence Alignments
The calculation of the entropy in multiple sequence alignments is similar to the calculation of entropies in single sequences. The difference is that in multiple alignments we determine the entropy column-wise, i.e., across the aligned sequences. An example best explains this.
Let's look at the alignment of sequences in Figure 2. In column 12 there are two As, three Ts, two Gs, and two Cs. We can use the Equation 2 to calculate the entropy \(H\) of the column.
The frequencies of each letter are A: \( \frac{2}{9}\), T: \( \frac{3}{9}\), G: \( \frac{2}{9}\), and C: \( \frac{2}{9}\); Thus, the entropy \(H = - \frac{2}{9} \times log_{2}(\frac{2}{9}) + \frac{3}{9} \times log_{3}(\frac{3}{9}) \) \( + \frac{2}{9} \times log_{2}(\frac{2}{9}) + \frac{2}{9} \times log_{2}(\frac{2}{9}) \) \(\approx 1.97 \) \(\approx 2.0 \) bits and using Equation 1 the information content is \( H_{Before} - H_{After} \) = \( 2.0 - 2.0 = 0.0 \) bits. The \( H_{Before} = 2 \) bits, because before we made the alignment we did not know anything and the maximum entropy per letter in DNA sequence is two bits.
For the sake of an exercise, I also show the entropy and information calculations for the column 19. There are only two letters T and G with corresponding frequencies of \( \frac{8}{9} \) and \( \frac{1}{9} \); Thus, the entropy \(H\) is \(- \frac{8}{9} \times log_{2}(\frac{8}{9}) \) + \( \frac{1}{9} \times log_{2}(\frac{1}{9}) \) \( \approx 0.5\) bits, and the corresponding information \(I\) per letter is \( 2-0.5 = 1.5 \) bits.
The total information content of the motif is 28.1 bits, which is the sum of the information of each of the columns. At the time of writing Genbank size requires the minimum of 42 bits of information; Consequently, the information content of this motif is not enough to get statistically significant hits in Genbank, even they were 100% identical to any of the sequences in the multiple alignments in Figure 2, but using the sequences for searching databases is not the purpose of this exercise. Read on.
Figure 3 shows a histogram over position specific information content of the multiple alignment in Figure 2. However, A large number of software tools exist to create s.k. Sequence logos and are more informative than the simple histogram (see the list of software tools for creating sequence logos in the subsection "Sequence Logo Software" below ).
Sequence logos are an excellent way to visualize shared conserved features of DNA or protein sequences that can be such as binding sites of transcription factors or active sites in proteins. The function is dependent on a shape of DNA, and but specific sequences determine how proteins fold although aided by actions of chaperones.
Sometimes even a single residue alteration results in loss of function. With sequence logos, we can visually discriminate essential residues from those that are less important for a function.
Sequence logos show symbols stacked on top of each other. The height of each symbol is dependent on its contribution to the total information content in a position, and the height of a stack gives the total information content per symbol (Figure 4).
There are many other ways to construct sequence logos apart from using Shannon's information. A separate tutorial on sequence logos appears soon. In the meanwhile, see on the next page the list of various software tools online and downloadable packages with links that can create sequence logos.