Lloyd Lowa and Martti T. Tammib
aPerdana University Centre for Bioinformatics (PU-CBi), Block B and D1, MAEPS Building, MARDI Complex, Jalan MAEPS Perdana, 43400 Serdang, Selangor, Malaysia.
bBiotechnology & Breeding Department, Sime Darby Plantation R&D Centre, Selangor, 43400, Malaysia.
In 1962 James Watson, Francis Crick and Maurice Wilkins jointly received the Nobel Prize in Physiology/Medicine for their discoveries of the structure of deoxyribonucleic acid (DNA) and its significance for information transfer in living material.1 The secret of DNA in orchestrating living activities lies in the arrangement of the four bases (i.e. adenine, thymine, guanine and cytosine). The linear sequence of the four bases can be considered as the language of life with each word specified by a codon that is made up of three bases. It was an interesting puzzle to figure out how codons specify amino acids. In 1968, Robert W. Holley, HarGobind Khorana and Marshall W. Nirenberg were awarded the Nobel Price in Physiology/Medicine for solving the genetic code puzzle. Now it is known that collection of codons directs what, where, when and how much proteins should be made. Since the discovery of the structure of DNA and the genetic code, deciphering the meaning of DNA sequences has been an ongoing quest by many scientists to understand the intricacies of life.
The ability to read a DNA sequence is a prerequisite to decipher its meaning. Not surprisingly then, there has been intense competition to develop better tools to sequence DNA. In the 1970’s, the first revolution in DNA sequencing technology began and there were two major competitors in this area. One was the commonly known Sanger sequencing method2,3 and another was the MAxam-Gilbert sequencing method.4 Over time, the popularity of the Sanger sequencing method and its modifications grew so much that it overshadowed other methods until perhaps 2005 when Next Generation Sequencing (NGS) began to take off.
In 1977, Sanger and colleagues successfully used their sequencing method to sequence the first DNA-based genome, a ΦX174 bacteriophage, which is approximately 5375 bp.5 This discovery heralded the start of the genomic era. Initially, the Sanger sequencing method in 1975 used a two-phase DNA synthesis reaction.2 In the first phase, a DNA polymerase was used to partially extend a primer bound onto a single stranded DNA template to generate DNA fragments of random lengths. In phase two, the partially extended templates from the earlier reaction were split into four parallel DNA synthesis reactions where each reaction only had three of the four deoxyribonucleotide triphosphates (dNTPs; which is made up of dATP, dCTP, dGTP, dTTP). Due to a missing deoxyribonucleotide triphosphate (e.g. dATP), the DNA synthesis reaction would stop at its 3’ end position just one position prior to where the missing base was supposed to be incorporated. All of these syntethized DNA fragments could then be separated by size using electrophoresis on an acrylamide gel. The DNA sequence could be read off a radioautograph since its DNA synthesis happened with the incorporation of radiolabeled nucleotides (e.g. S-sATP).
There were many problems with the initial version of the Sanger sequencing method that required further innovations before its widespread use and this scenario is akin to what is happening in the recent NGS technological developments. Some problems of the early Sanger sequencing method included the cumbersome two-phase procedures, only short length of a DNA sequence could be determined, the requirement of a primer meant some sequences of the template had to be known, hazardous radio labeled nucleotides were used and there was also no automated way to read off a DNA sequence. Sanger and colleagues rapidly improved on the method described in 1975 by eliminating the two-phase procedure with the use of dideoxynucleotides as chain terminators.3 Briefly, the improved method started with four reaction mixtures that already had the single stranded DNA template hybridized to a primer. In each reaction, the DNA synthesis proceeded with four deoxyribonucleotide triphosphates (one with radiolabeled nucleotide) and on dideoxynucleotide (ddNTP). Whenever a diddeoxyribonucleotide was incorporated, the reaction terminated and thereby produced a mixture of truncated fragments of varying lengths. These DNA fragments were then separated by electrophoresis and then read off from a radioautograph. By adjusting the concentration of ddNTPs, chain termination can be manipulated to produce a longer sequence read.
To solve the requirement of knowing some template sequences for primer design, cloning was introduced. For example, the M13 sequencing vector is commonly used as a holder for DNA insert and known primers that bind to the vector sequence are available to be used to sequence the unknown DNA insert. One major innovation to the Sanger sequencing method is the replacement of radioactive labels with fluorescent dyes.6 Four different dye colour lables are available for the for dideoxynucleotide chain terminatros and thus, DNA fragments that terminate at all four bases can be generated in a single reaction and thus analyzed on a single lane of acrylamide gel. The electrophoresis is coupled to a fluorescent detector that is also connected to a computer and thus sequence data can be automatically collected. In 1986, Applied Biosystems commercialized the first automated DNA sequencer (i.e. Model 370A) that is based on the Sanger sequencing method. For an animation of the Sanger sequencing method, the reader should refer to Sanger sequencing Institute (http://www.wellcome.ac.uk/Education-resources/Education-and-learning/Resources/Animation/WTDV026689.htm).
Due to limitations of the chain terminator chemistry and resolution of the electrophoresis method, the Sanger sequencing method is only capable of sequencing a read of about 500 to 800 bases long. Most genes and other interesting DNA sequences are longer than that. Therefore, a method is required to first break up a longer DNA molecule into fragments, sequence the individual fragments and then piece them together to create a contiguous sequence (i.e. contig). In one approach known as the WGS sequencing, the long DNA fragment is randomly sheared and then cloned for sequencing.7 A computer program is then used to assemble the sequences by finding overlaps. It is challenging to find sequence overlaps when thousands to millions of DNA fragments are generated. The problem requires alignment algorithms and some notable examples of early work in this area include the Needleman-Wunsch algorithm8 and Smith-Waterman algorithm.9 Details on the bioinformatics involved in NGS alignment tools and sequence assembly are given in Chapters 4 and 6, respectively.
One of the goals of the Human Genome Project (HGP) is to support advancements in DNA sequencing technology.10 Although the HGP was completed with the Sanger sequencing method, many groups of researchers were already tinkering with new ideas to increase throughput and decrease cost of sequencing prior to the announcement of the first human genome draft in 2001. For example, developments for nanopore sequencing can be traced back to 1996 when researchers experimented with alpha-hemolysin.11 After years of experimentations, the second DNA sequencing technology revolution finally took off in 2005 and ended Sanger sequencing dominance in the market place. The revolution is still going at the time of this writing and it can be seen from the rapid decline in the cost of sequencing since the introduction of NGS technologies (Figure 1).
The sequencing technologies associated with the second revolution are referred to by various names, including second generation sequencing, NGS and high throughput sequencing. It should perhaps be most appropriately termed as high throughput sequencing but NGS seems to be more commonly used to categorize such technologies and hence, this term is used for the book. For the purpose of this book, NGS technology refers to platforms that are able to sequence massive amount of DNA in parallel with a simultaneous sequence detection method and overall achieve a much cheaper cost per base than Sanger. These platform include 454, ABI SOliD, Illumina and Ion Torrent. Due to the popularity of the Illumina platform at the time of writing, the practical chapters (i.e. Chapters 3-10) of the book emphasize on the use of Illumina data as sample datasets.
There is a third revolution in sequencing technology underway with the commercialization of third generation sequencing technolohgies such as those from Pacific Biosciences and Oxford Nanopore Technologies. Third generation sequencing is defined as the sequencing of single DNA molecules without the need to halt between read steps, whether enzymatic or otherwise.13 There are three categories of sigle molecule sequencing: (i) sequencing by synthesis method whereby base detection occur real time (e.g. PacBio), (ii) nanopore technologies whereby DNA thread through a nanopore and are detected as they pass through it (e.g. Oxford Nanopore), and (iii) direct imaging of DNA molecules using advanced microscopy (e.g. Halcyon Molecular).
DNA sequence data generation process among different sequencing platforms may share similarities such as the general ‘wash and scan’ approach but they may differ in terms of cost, runtime and detection methods. The sequence data from different platforms have different characteristics such as error patterns and different tools being used to process the raw data to FASTQ format. Much of the internal workings of NGS sequencers are proprietary matters and user generally rely on providers to come out with their own tool for base calls as well as error calls. After that, a sequence is assumed as ‘correct’ and researchers proceed to analyze it. The subsequent sections aim to introduce the baground and some details of commercially available platforms, which include 454, ABI SoliD, Illumina, Io Torrent, PacBio and Oxford Nanopore. Beside these six platforms, there are other companies out there that also innovate in this space such as SeqLL, GnuBio, Complete Genomics and others, but they will not be covered here. For a list of available sequencing companies, readers are encouraged to read a news article by Michael Eisenstein in 2012 that was published by Nature Biotechnology, which detailed 14 NGS companies.14
A company named 454 Life Sciences Corporation made the first move in the NGS revolution. The company was initially majority owned by CuraGen. It was from this company that the name ‘454’ originated, which was just a code name for a project. 454 was later acquired by Riche in 2007. It made a public announcement in 2003 that it had managed to sequence the entire genome of a virus in a single day.15 Then in 2005, scientists using 454 technology published an article in Nature on the complete sequencing and de novo assembly of Mycoplasma genitalium genome with 96% coverage and 99.96% accuracy in one run of the machine.16 In the same year, the company made a system named Genome Sequencer 20 (GS20) commercially available. This breakthrough in sequencing throughput and speed was an incredible feat when compared to the Sanger technology and it created a lot of excitement.
The principle behind 454 relies on pyrosequencing, which was a technology licensed from Pyrosequencing AB. This method depends on the generation of inorganic pyrophosphate (PPi) during PCR when a complementary base is incorporated17 (Figure 2). PPi is converted to ATP by sulfurylase and luciferase uses ATP to convert luciferin to oxyluciferin and light. The reaction occurs very fast, in the range of milliseconds, and the light produced can be detected by a charge-couple-device (CCD) camera. One of the key innovations of 454 technology is miniaturization of the reactions to occur in a small space using smaller volume of reagents. Another innovation is simultaneous detection of the light signals from many individual reactions.
One of the key drawbacks of the 454 pyrosequencing chemistry is the difficulty in detection of the actual number of bases in homopolymer tract (e.g. AAAAA). There is no blocking mechanism included to prevent multiple same bases incorporation during DNA elongation and thus light signals are stronger in longer homopolymer tracts. The light signal is actually light intensity that is converted to a flow value in the 454 system. It is difficult to distinguish how many bass there are once the homopolymer is more than 8 bases long.16 The presence of homopolymers is the reason why 454 sequence reads do not have fixed lengths, unlike the Illumina platform that includes a blocking mechanism that allow the reading of only a single base each time. Another shortcoming of the 454 system is artificial amplification of replicates of sequences during the PCR step. It was estimated in a metagenomincs study that this type of error is between 11% to 35%.18
Although a pioneer in NGS, 454 has officially lost the race of the sequencing game. As seen in Figure 3 on the comparisons of NGS platforms, the trend for 454 sequencing in articles tracked by Google Scholar has reached a plateau. It used to hold a lot of promises in revolutionizing sequencing and it was even regarded by some as the technology that had won the sequencing race. Roche announced the closing down of 454 in 2013.19 Sequencers from 454 started being phased out in the middle of 2016.
The initial success story of 454 sequencers challenged the dominance of Applied Biosystems (AB), which was the main supplier of Sanger-based sequencing machines for the HGP. The ABI PRISM 3700 was a very popular system and many researchers who needed to perform sequencing prior to 2005 were familiar with the system. In 2006, ABI completed acquisition of AGencourt Personal Genomics, which allowed it to market a novel NGS technology known as Supported Oligo Ligation Detection (SOLiD). Currently, Thermo Fisher Scientific owns SOLiD sequencing technology after it acquired Life Technologies, which is a company formed from the merging of Invitrogen and AB. From Figure 3, it seems that SOLiD sequencing is not that popular as a NGS platform when compared to others even though it has been available since 2006. To our knowledge, SOLiD is the only NGS platform that employs ligation based chemistry with a unique di-base fluorescent probes system.
Understanding the SOLiD sequencing system is akin to solving a jigsaw puzzle due to the di-base encoding system. The sample preparation steps prior to probes ligation are very similar in concept to the 454 system. Briefly, agenomic DNA library is sheared into smaller fragments and both ends of each fragment will be tagged with different adaptors (e.g. Adaptor P1---Fragment 1---Adaptor P2). Then emulsion PCR will take place to create beads enriched with copies of the same DNA fragment on each bead. The beads are then attached to a glass slide through covalent bonds. From here ligation and detection of bases will take place (Figure 4a). Firstly, a universal sequence primer (n) is used to bind to the known adaptor sequence. Then a specific 8-mer probe with sequence structure as depicted in Figure 4b will out compete other probes for binding immediately after the primer-binding site. Ligation then occurs and identity of the bund probe is detected by distinguishing which fluorescent dye is tagged at the probe’s 5’ end. Then cleavage occurs at a position between the 5th and 6th nucleotide of the probe. After cleavage is complete, subsequent ligation is possible as a free phosphate group is now available at the fifth base of the probe. The reason why only one particular 8-mer probe will win the binding site is due to the specific di-base sequence at the 3’ end that distinguishes the collection of probes. Only four types of fluorescence dyes are used and each 8-mer probe with specific di-base sequence is tagged by a dye at the 5’ end. This system is unique in the sense that a di-base sequence is detected in each ligation cycle.
The ligation and cleavage process can be repeated many times to achieve the desired sequence length. However, it will only give sequence information two bases at a time with a gap of 3 bases in between. Next the ligate-cleavage-detect process is repeated with a new universal primer (n-1), which is a primer that binds exactly one base further upstream at the 5’ end of the adaptor sequence. This ligate-cleave-detect process that cycles a few times with a new primer is also known as reset. The entire process is repeated another three more rounds with universal primer (n-2), (n-3) and (n-4). Altogether, five different universal primers are used. Figure 4c shows an example of sequence determination after five rounds of reset. Note that each base is called twice in independent primer round and this increases the accuracy of base call. A check for concordance of the two calls for the same base represents an in-built error checking property of this system and allows it to achieve an overall accuracy greater than 99.94%. Although the SOLiD system is unique in the sense that it can store sequence of oligo colour calls (i.e. colour space) to be used for mutation calls, this method does introduce challenges to bioinformatics analysis as most tools are based on DNA calls rather than colour space model.
In the mid-90’s, Shankar Balasubramanian and David Klenerman, bith from the University of Cambridge conceived the idea of massive parallel sequencing of short reads on solid phase using reversible terminators. They formed Solexa in 1998 after successfully received funding from a ventire capital firm. The sequencing approach by Solexa is also known as sequencing-by-synthesis. The company launched its first sequencer, Genome Analyzer in 2006 and the machine is capable of producing 1 Gb of data in just a single run. Figure 5 shows an overview of the Illumina sequencing-by-synthesis method.
Illumina acquired Solexa in 2007. Soon after its acquisition, there were at least three high profile research publications in Natture 2008 volume 456, which highlighted the capabilities of the Genome Analyzer in sequencing human genomes (e.g. African genome20, Chinese genome21, and cancer patient genome22). In the subsequent years, the popularity of this system grew so much that by 2015, the cumulative number of articles that cited Illumina or brochure by Illumina in 2015, “More than 90% of the world’s sequencing data is generated using Illumina sequencing-by-synthesis method.†The company is also very creative at developing and marketing their products with sequencing systems (e.g. MiniSeq, MiSeq, MiSeqDx, NextSeq 500, HiSeq 2500, HiSeq 3000, HighSeq 4000, HiSeq X Ten, HighSeq X Five) that suit researchers who operate on different budgets and require different level of sequencing throughput. The Illumina system can be used for a wide range of application that include resequencing, whole genome sequencing, exome sequencing, metagenomics, epigenetic studies and sequencing of panel of genes sych as targeting genes linked to cancer (e.g. TruSight Cancer).
One of the key strengths of the Illumina platform is the ability to produce high throughput of DNA sequence data at a lower cost despite only producing short sequences (e.g. paired end of 35 bp in the African genome sequencing20). Improvement in bioinformatics methods allows researchers to do so much more than what was thought possible if only short, accurate reads are available. Nowadays, the Illumina system can produce paired end sequences of 200 bp for each end, which further enhances the power of this technology. Besides the advantage of high throughput low cost sequencing, it also performs better than the 454 system with respect to homopolymer sequencing error because it uses reversible terminator sequencing chemistry. Only a single base is incorporated each time prior to detection in the Illumina system whereas 454 allows for multiple bases incorporation in a homopolymer tract.
However, the Illumina systemalso comes with drawbacks. The 3’ end of the sequence tends to be of lower quality than the 5’ end, which means some sequences from the 3’ end should be filtered out if it is below certain set threshold (see Chapter 3). There can also be tiles associated error when the flow cell is affected by bubbles in reagents or some other unknown causes.23 In addition, sequence-specific error have been found for inverted repeat and GGC sequence24. Furthermore, in a study on 16S rRNA amplicon sequencing on the MiSeq, library preparation method and choice of primers significantly influence the error patterns25.
Beside SOLiD sequencing, Thermo Fisher Scientific has another NGS platform on its portfolio known as Ion Torrent, which was acquired from Life Technologies. Initially Life Technologies developed the platform and released the Ion Personal Genome Machine (PGM) in 2010. The launch of this machine created much excitement among researchers who wanted affordable sequencers and utilized cheap disposable chip of about $25026. In addition, it runs faster when compared with competing machines such as HiSeq from Illumina. However,in terms of DNA data throughput, it loses out in comparison to the Illumina HiSeq.
Like the 454 and SOLiD systems, the library preparation and emulsion PCR steps on beads are present in the Ion Torrent. The main difference lies in the detection of nucleotide incorporation that is not based on fluorescence or chemiluminescence, but instead it measures the H+ ions released during the process. In other words, detection of nucleotide incorporation is done by miniature semiconductor pH sensor. Since each of the four DNA bases are supplied sequentially for DNA elongation, if the base matches the template, then signal will be amplified but accurate detection on the actual number of bases is challenging.27 Only natural nucleotides are needed and no high-resolution camera and complicated image processing are required, which when taken together are some reasons for a faster runtime and lower machinecost. For a video on the Ion Torrent method, the reader should refer to the Thermo Fisher Scientific sequencing…