Searching for DNA sequencing methods

In his groundbreaking 1957 presentation, Francis Crick's proposed the concept of information flow from DNA to RNA to protein, which forever changed the way of reasoning in biology. In this concept what he called the 'Central Dogma' he explained that the information flow was a one-way street and consisted of a string of nucleic acids in DNA, copied to a nucleic acid string in RNA, which in turn acts as a template for a sequence of amino acids in a protein.

This information and Watson's and Crick's structure of DNA at hand together with Robert W. Holley, Har Gobind Khorana, and Marshall W. Nirenberg solving the genetic code in the early 1960s made the importance of knowing the DNA sequence clear. However, at this time scientists didn't have adequate tools to read the order of base pairs in DNA; therefore, the base sequence in any DNA was unknown at the time. It took ten or so years of intensive research work before the first DNA was sequenced. Ray Wu determined in 1971 cohesive ends of bacteriophage λ DNA(*), and Frederick Sanger completed the genome of 48,502 base pairs, ten years later, using the dideoxy chain termination method in 1982(*).

In the 1960s and 1970s scientist put a lot of effort into trying to sequence DNA. Unfortunately, the protein sequencing methods did not quite work for DNA sequencing, because amino acids sequences consist of 20 different building blocks whereas DNA consists only of four and are more similar to each other, making it difficult to distinguish them. Besides, the laborious cleaving technique to sequence proteins was only possible for relatively short sequences. This technique dates back to the early 1950s when Frederick Sanger developed this Nobel Prize-winning technique and sequenced the two chains of the insulin molecule.

At that time transfer RNAs were the shortest known biologically active amino acids, around 73 to 93 nucleotides of length. Given tRNAs' short size, made it possible to use a label and cleave method analogous to Sanger's 1949 method. Robert W. Holley and coworkers used a similar technique to sequence Escherichia coli alanine tRNA, the first sequenced nucleic acid molecule, published in 1965(*).

The discovery of type II restriction enzymes in the 1970s(*) was the key to advance sequencing technologies. They cleave DNA near specific strings of about four to eight nucleotides long; therefore, they could be used to cut DNA into small fragments that were possible to separate using electrophoresis. These enzymes cleave the double-stranded DNA so that the opposing strands have an overhang. Since the cleavage site sequence was known, this fact was an advantage in some of the early sequencing efforts, but the methods never advanced to the level where it was possible to sequence whole gene sequences.

A decisive shift started with Frederic Sanger introducing his 'plus and minus system' in 1975(*). With this approach, it was possible to determine a sequence up to about 50 base pairs in a single analysis. However, one of the deficiencies of this method was that it was difficult to measure the correct length of homopolymer stretches, such as AAA or GGG, etc. Nevertheless, Sanger used this method in 1977 to sequence the genome of bacteriophage phi X174, 5,375 nucleotides of the total estimated length of 5,400 nucleotides(*).

In the same year 1977, Maxam and Gilbert published their sequencing method(*), which was in many aspects similar to the Sanger's method but having an advantage of resolving homopolymer stretches.

At the end of the year 1977, Sanger published a novel sequencing concept(*), based on incorporation of the chain terminating dideoxynucleotides, ddNTPs. This method could initially produce read lengths of about 100 nucleotides.

Walter Gilbert and Frederick Sanger shared the 1980 Nobel Prize in Chemistry(*) "for their contributions concerning the determination of base sequences in nucleic acids." together with Paul Berg "for his fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant-DNA."