Introduction to Information Theory and Its Applications to DNA and Protein Sequence Alignments
- To have a fundamental understanding of what is information and entropy.
- Be able to calculate information content in DNA and protein sequences using Shannon's entropy equation.
- To have basic knowledge how to create sequence logos.
Unless you have a good knowledge of how to align DNA and protein sequences, I recommend you to at least read the tutorial "Pair-wise sequence alignment." The link is at the top of this page.
What is Information?
The word information is familiar to all of us, and we may think that we know what it means. We may, in general, think about the information age, information technology, information on the internet, news true or fake, governments' intelligence operations, instructions how to perform a task and so on.
Humans started to transfer and store information since the invention of writing, and already a few decades ago warnings started to appear that a data tsunami is going to hit us. Sure, we produce more data than ever before at an ever-increasing rate and in varied forms and content, each day about 2.5 quintillion bytes, i.e., one followed by 18 zeros (U.S. definition). It is a vast number, but how much information is in the data? We measure data in bytes, but is the amount of data the same as the amount of information? Is it an abstract concept encapsulating all the 'stuff' floating out there that we can potentially know about and learn given that we have access? To try to answer this question, we need to explore more about the nature of information.
At first glance, information appears to be a straightforward concept, but throughout the human history, it has been an extremely elusive concept that scientists still today are studying. We learned to store and transform information long before we had any idea about what is, but today we know that information is an inherent inseparable part of mass, energy and hence likely also spacetime. Information cannot exist as an entity of its own, and this is not in some philosophical sense. We can measure amounts of information and surprisingly even its weight. We calculate that and other things that are practically applicable in the next subsection "How to Measure Information?"
Another exciting aspect of information is that scientists believe the amount of information in the universe to be constant and indestructible. The only thing that varies is what we call entropy or disorder. In locations where entropy is low information is more concentrated than in locations where entropy or disorder is high; Consequently, if we erase our hard disk full of data, even if we chop it into small pieces with a thermonuclear bomb, all the information would still exist. It just got considerably more diluted and a bit troublesome to catch, but it is possible, although not practically, to reconstruct all that data. The information just got transferred into radiation, particles and integrated widely scattered to surrounding particles, but never destroyed only the entropy increased a lot.
Information Can Create Order
Before we apply information in practical calculations to biological sequences, we explore a bit more its nature by looking at the relationship of information and thermodynamics. The laws of thermodynamics state that the entropy or disorder within a closed system either stay constant or must increase. To see how the information comes into the picture in this context we need to first examine the behavior of matter in the context of the thermodynamics.
A typical example of this is a hot cup of coffee left on a table which after a while cools down to room temperature. In other words, its entropy decreases and becomes equivalent with the entropy of the room by dissipating the heat stored in the coffee.
Heat is the movements of molecules, and the measure of temperature is the average speed of molecules. Rapidly moving molecules are hotter than the slow-moving ones, and by allowing molecules that move at different speeds to freely collide thereby transferring and mixing their energies results in a mixture.
In this case, the water molecules colliding and blending with air molecules in the room and vice versa results in both having in average a mixture of molecules within average the same number of slow and rapid molecules; Thus, increasing the entropy in the coffee and decreasing the entropy in the surrounding room until they are in equilibrium.
The entropy here measures the ratio of slow molecules to fast ones. The maximum entropy is in a system where every molecule moves at a different speed, and the minimum entropy system consists of molecules where all are moving at equal speed. Consequently, when molecules are not moving at all the entropy is zero, called the absolute zero (0 Kelvin), and the energy is also at its minimum.
Within a closed system, we cannot decrease the total entropy, but we can adjust the distribution of entropy. However, this requires energy, and in the case of zero Kelvin, there's no energy left to adjust anything - molecules are not moving at all, and the entropy of the system stays at the minimum.
At the opposite extreme, a system having the maximum possible temperature and energy where the primary building block of Nature is exposed naked, the entropy should also be zero, but the difference to the absolute zero system is that now there is a humungous amount of energy available for distribution of the entropy in various ways. If this was the origin of the Universe, we see the result of entropy distribution by looking at our planet, other planets, stars, galaxies, all the elements of the periodic table, and life itself formed while Universe is cooling down due to its expansion.
Let's go back to more ambient temperature and our coffee. A practical way to make our coffee hot is to input some energy to kick the molecules in the coffee to move faster. Given that our room is a closed system, therefore we can only use the energy available inside the room.
Let's say that we had some foresight before we ended up in this closed environment and took a fresh battery with us, which we can use to heat the coffee. By doing so, we decrease the entropy in the coffee, but at the same time, we increase the entropy of the room with the net effect that the total entropy of the room plus coffee increases, because of the battery releases relatively high entropy heat and inside the battery entropy also increases.
So what does all of this got to do with the information? A Scottish scientist James Clerk Maxwell (1831-1879) devised a thought experiment to challenge the second law of thermodynamics. His idea was to have two isolated chambers next to each other separated by a wall which had small doors that an imaginary demon could open and close. Both chambers contained gas, for example, air. Every time the demon detected a rapid molecule, i.e., a molecule having sufficient energy, he opened a door an let the molecule pass into the left chamber. Similarly, he let the slow molecules to move to the right chamber but did not let any rapid molecules to move to the right chamber and no slow molecules into the left chamber (Animation 1).
This way, the left chamber could become hot and the right one cold without spending any work, the demon used only information. Maxwell stated it as "He will thus, without the expenditure of work, raise the temperature of B and lower that of A, in contradiction to the second law of thermodynamics."
Scientist intuitively knew that this could not be correct. How could you create order just by using information alone? However, during Maxwell's time information was not known as a concept and his experiment followed by more than 100 years of debating and development on how to solve this perplexity.
The demon worked by observing all the molecules movements and thus had to store this information in his head. However, the available storage space inevitably becomes full at some point, and the demon needs to start deleting information. It is this erasure of information that costs energy, the Landauer limit which states the minimum amount of energy required to delete one bit of information. This amount of energy is tiny and perhaps can explain why human brains can be so much more effective than modern computers which use millions of times more energy than the minimum stated by Landauer.
Still today scientists are researching the relationships of information with energy, physics theories and with modern technology have recreated Maxwell's thought experiment in the real world. The research of the information also has among others led to theories employing holography since the amount of information seems to be limited by an objects surface area and not by its volume as one would intuitively expect.
Some scientists surmise the information to be the most fundamental building block of the universe. Now, let's start to look at the practical aspects of the information.
For part 2, go to the next page to find out how to measure information.