Biological Sequence Databases


Nucleotide Sequence Databases


image

Introduction


In the early 1960s, a transformative chapter in biological research unfolded with the efforts of Margaret Dayhoff and her colleagues at the National Biomedical Research Foundation (NBRF). In 1965, they compiled a groundbreaking compendium titled the "Atlas of Protein Sequence and Structure," encompassing a modest collection of 65 protein sequences. This pioneering work marked the inception of sequence databases, emphasizing the prevailing focus on protein sequencing through traditional methods like Edman degradation.

As the biological landscape evolved, the late 1970s witnessed a surge in nucleotide sequences. To meet the burgeoning demand for robust public databases, the Los Alamos National Laboratory (LANL) established the Los Alamos DNA Sequence Database in 1979, eventually christened GenBank in 1982. Simultaneously, the European Molecular Biology Laboratory (EMBL) introduced the EMBL Nucleotide Sequence Data Library in 1980. Throughout the 1980s, EMBL, LANL, and later the National Center for Biotechnology Information (NCBI) collaborated to contribute DNA sequence data, addressing the need for electronic dissemination.

In this era, the advent of computer technology facilitated the transition from manual curation and publication of nucleotide sequences in print journals by the late 1980s, prompting the DNA Databank of Japan (DDBJ) to join forces with EMBL and GenBank in what became the International Nucleotide Sequence Database Collaboration (INSDC). Following an agreement in 1988, the INSDC established a standard data exchange format, revolutionizing data submission and distribution practices. DDBJ, EMBL, and GenBank became primary distribution centers, updating records every 24 hours to ensure global accessibility.

Today, the European Bioinformatics Institute (EMBL-EBI) is a pivotal hub for bioinformatics research and data resources, operating under the European Molecular Biology Laboratory (EMBL) umbrella. EMBL-EBI is crucial in managing and providing access to diverse biological data sets, including genomic information. One of its integral components is the European Nucleotide Archive (ENA), a comprehensive repository for nucleotide sequences encompassing DNA and RNA data.

Completing human genome sequencing and advancements in high-throughput technologies marked a paradigm shift in the sequencing landscape. As the vastness of sequence information continues to expand, the history of sequence databases mirrors the relentless progress of biological sciences and the challenges and opportunities presented by the ever-expanding "sequence information space." Today, with new high-throughput technologies, the sequencing landscape continues to evolve, offering unprecedented opportunities for biological scientists amidst the vast and dynamic sea of sequence data.


Lesson Plan (tentative) 


The Data Format In Nucleotide Sequence Databases

For bioinformaticians, understanding the management and exchange of nucleotide sequence data is crucial. Understand the flatfile formats used in biological sequence databases....

Start learning
Understanding RefSeq

The Reference Sequence (RefSeq) database is a pivotal open-access resource introduced in 2000 by the National Center for Biotechnology Information (NCBI). It stands as a meticulously annotated and curated collection encompassing...

Start learning