For better experience, turn on JavaScript!

Introduction to information theory and its applications to DNA and protein sequence alignments, part 2: How to Measure Information?
... Continued from previous page.

Part 2: Introduction to Information Theory and Its Applications to DNA and Protein Sequence Alignments

How to Measure Information?

Claude Elwood Shannon (1916-2001).
Claude Elwood Shannon (1916-2001).

Claude Shannon laid out the foundations for information theory that can be applied to practice in the 1940s while he was working at the Bell Labs. At his time the telegraph systems were analogous and had to have an amplifier after every few miles of cable to amplify signals. The problem was that these amplifiers also amplified the noise.

Shannon solved this problem by digitizing the messages by only sending zeros and ones, a little 'bit' like in Morse code. It is much easier to separate just two states than continuous analog variations of a signal. Related work also resulted in a definition of a maximum rate of information transfer through a communication channel, among many other important ones. As we know today, this worked quite well, to put it mildly.

The more surprised we are, the more information we gain

Shannon's revolutionary realization was to separate the meaning of information from the quantity of information and further relate a quantity of information to how surprised one is when observing a message.

For example, if we read yesterday's news today, we would hardly be surprised and using Shannon's definition, we would not have gained any new information. Similarly, if we receive a message solely consisting of 'A's, as in AAAAAAA..., and nothing more, that would not be extremely surprising either. Alternatively, we could make it a long chain consisting of only ones or only zeros. Either way, there's no surprise reading along either of the messages; Thus, the amount of information in these messages is zero.

Probabilities give the amount of surprisal

The surprisal connects to the probability. In the above cases, each string only consists of a single symbol; Therefore, the probability of seeing any of these in a single message is one. If the probability of seeing a symbol is 0.5, then a message would consist of an equal number of two different symbols, for example, ten ones and ten zeros or five ones and five zeros. Having two symbols, also means that we have two alternatives, e.g., either one or zero.


So, if we only use ones and zeros to store and transmit information, how much information can this symbol set maximally contain? The maximum information amount of binary digits is the same as flipping a fair coin. In both cases, we have two alternatives, zero or one and head or tail. Consequently, if we choose once we have gained all the information they can contain.

The amount of information is also related to how many questions we minimally have to ask to arrive at an answer. In the case of a fair coin, we need to ask once "is it tail?" or "is it head?" It does not matter what the answer is as long as it is "yes" or "no." In both cases, we would then know which one it is. Because we only have to ask a single question to arrive at a correct answer, we say that the maximum amount of information in systems like this is one bit and in the cases above where there is only one symbol to choose from, we would not need to make any guesses to have the answer. Zero guesses required equals zero amount of information and likewise zero surprisal.

Conveniently, by taking the logarithm of base two of two equals one, \(log_{2} (2) = 1 \), giving us the relationship between the number of choices and the minimum number of guesses we need to make to arrive at a correct answer when both choices are equally probable. In other words, the amount of information required to arrive at a correct answer.


Let's take a convenient number of 1,048,576 locker cabinets with one of the locker cabinets containing a million dollar price, but we don't know in which of the cabinets it is but we are allowed to ask questions. What would be the best strategy if the only answer to our questions is yes or no? To minimize the number of questions needed to ask is to each time halve the possibilities; Therefore, we imagine the locker cabinets lined up in a row and ask is the locker with the price in the left half. If the answer is yes, we would again halve the possibilities by asking is the cabinet on the left side of the half of the half and so on until we only have a single cabinet left and thus found the price.

By using this halving strategy (a.k.a binary search) how many questions we maximally have to ask to pinpoint in which of the locker cabinets the million dollar price is? Given that each of the lockers is equally likely to contain the price, the number of questions we need to ask is 20. The easiest way to get this number is to take the logarithm base two of the number of lockers, \( log_{2}(1,048,576) = 20 \); Thus, we need 20 bits of information to find the price. The same amount as the number of questions we need to ask.

DNA consists of four bases A, T, G, and C, so we have four possibilities. By taking the logarithm base two of four, we get 2 which is again the same number of question we need to ask to pinpoint a letter. However, at first glance, it looks like we need to ask three questions, but we are smart and divide the bases into two sets: 1: (A, T) and B: (G, C). We can then first ask if the right base is in the set 1. If it is not, we can then proceed by asking is it G for example and the yes or no answer to this question lets us know which base is correct, C in this example.

In the case of proteins the integer relationship of the number of bits and the number of questions we are required to ask breaks down. There are 20 different amino acids; Thus, the maximum amount of information one amino acid can carry is \(log_{2}(20) \approx 4.3\) bits. It is difficult to imagine how to ask 4.3 questions. Perhaps adding some hint, I don't know. The answer may become philosophical, and like to skip it, but the previous examples show the relationship of information to practical actions that we need to take in the form of asking questions to extract information.

However, we can calculate an upper bound by adding one to the count, \(log_{2}(21) = 4.39\), so we know that the exact entropy is somewhere in between 4.32 and 4.39 bits.

Information vs Entropy

In all the examples the entropy is at its maximum before we started to ask questions and the maximum entropy in each case is the same as the maximum number of questions N we need to ask to get the final right answer. We can write it like this \( H = log_{2}(N) \), where N is the element count in a set and H is the entropy in bits.

We also note the handy relationship \( 2^{H} = N \); Thus, \( 2^{N} = 4 \) in case of DNA and in protein sequences \( 2^{4.3} \approx 20 \). The interesting thing about reducing the exponent by one is that each time the count is half of the previous count: \( 2^{2} = 4 \), \( 2^{1} = 2 \), and for protein \( 2^{4.3} \approx 20 \), \( 2^{3.3} \approx 10 \), \( 2^{2.3} \approx 5 \), \( 2^{1.3} \approx 2.5 \), and so on. Each time we ask a question, i.e., reduce the exponent of two by one we gain information \(I\) the same amount as the decrease in entropy \(H\) is. Consequently, this gives us the definition of information \(I\). Information is the amount of entropy before we gain information minus the entropy after we gained information: \[ I = H_{Before} - H_{After} \: \]

(Eq. 1).

For example, before flipping a coin, the \(H_{Before}\) is 1 and after flipping the coin, \(H_{After}\) is 0, and thus the information gained is then 1 - 0 = 1.

However, we are not limited to reducing the count by whole numbers it can be any real number. I just used whole numbers to exemplify the relationship of questions to information and entropy.

Entropy and unequal probabilities

So far we have only considered entropy with all the symbols in a message having equal probabilities. If all the symbols have the same probability in a message, then the entropy of the message is as high as it can be. In other words, when symbols in a message have varying probabilities, then the total entropy of a message is lower compared to a case where all the symbols have equal probabilities (Figure 1).

Figure 1. Entropy in a coin flip. A fair coin where the probability of getting heads or tails is equal, i.e., 0.5 yields the highest entropy, 1. Gradually making a coin more biased reduces entropy. For example, in the extreme case where a coin always lands on tails, the entropy is zero.
Figure 1. Entropy in a coin flip. A fair coin where the probability of getting heads or tails is equal, i.e., 0.5 yields the highest entropy, 1. Gradually making a coin more biased reduces entropy. For example, in the extreme case where a coin always lands on tails, the entropy is zero.

When all the symbols have equal probabilities and in cases where we don't know what the probabilities are, we can assume the worst case entropy and calculate it by using the total count \(N\) of the symbols in a message: \(H=log_{2}(N)\). However, if we know all the probabilities and they are not equal, we can reduce the entropy by calculating the entropy of each symbol separately and then summing the individual entropies to get the total entropy of a message.

How to sum the probabilities? To illustrate how to sum probabilities to calculate a total entropy of a sequence, we first look at a DNA sequence ATGC. The total length of the sequence is four, and the probabilities are thus 0.25 for each base. Since all the probabilities are equal, we could calculate the total entropy as before \(log_{2}(4)=2\), but to demonstrate how we get the same result by adding the individual probabilities, we calculate as follows: \( 4 \cdot 0.25 \cdot log_{2}(0.25) = -2\). Note that the answer now is negative. Consequently, the definition of Shannon's entropy is \[ H = - \sum_{i}^{N} p_{i} log_{2}(p_{i}) \]

(Eq. 2),

and includes the minus sign; Therefore, our calculation is correct if we add the minus sign, i.e., - - 2 = 2 bits. N is the number of symbols and \(p_{i}\) are the probabilities.

In a more extended sequence, we can calculate the entropy the same way. For example, ATGGCCGTTT, the length of the sequence is ten, and the probabilities for each letter are as follows. A: 1/10, T: 4/10, G: 3/10, and C: 2/10 that give the total entropy \( H =\) \( - \left( \frac{1}{10} \times log_{2}(\frac{1}{10}) + \frac{4}{10} \times log_{2}(\frac{4}{10})\\ + \frac{3}{10} \times log_{2}(\frac{3}{10}) + \frac{2}{10} \times log_{2}(\frac{2}{10}) \right)\\ \approx 1.846 \: bits.\)

If we didn't know the individual entropies, we could have calculated entropy as \(log_{2}(10) \approx 3.322\), and thus get a higher entropy value because of the lack of knowledge of the individual entropies, which we get by assuming equal probabilities for each letter. We already know that the entropy is highest possible when all the probabilities are equal.

The origin of the bit

Shannon coined the name "bit" as a unit measure of information in his 1948 paper "The Mathematical Theory of Communication," as a short form of a "binary digit." Although, he gave the credit of the name to one of his colleagues at the Bells Labs, that particular "bit" is used in computers as a measure of storage and is not the same as the unit of information. A bit in a computer can represent less information than the information unit "bit."

It is about the amount of information, whatever the meaning

The meaning of information has nothing to do with Shannon's amount of information. For example, the word "AND" contains the same amount of information even we spell it backward to "DNA." Similarly, 010101 carries the same amount of information 101010 as well as a DNA codon ATG and GTA or GAT all carry the same amount of information only the meaning is different. How to calculate the meaning of information is a separate tutorial series that we intend to write. Keep an eye on announcements of new tutorials on the home page.

Go to the next page for "Information in DNA, Protein, and Sequence Alignments."