Biological Sequence Databases


Protein Sequence Databases


image

Introduction


The foundation of the UniProt Knowledgebase (UniProtKB) traces back to the early 1980s, shaped in the early 1980s under the vision of Amos Bairoch. Initially named PIR+, it evolved into Swiss-Prot, distributed via the precursor to the internet. In 2002, the collaborative efforts with EMBL culminated in creating TrEMBL, a supplement to Swiss-Prot. This collaborative spirit persisted, leading to the UniProt Knowledgebase (UniProtKB) in 2003, consolidating Swiss-Prot, TrEMBL, and PIR.

Completing human genome sequencing and advancements in high-throughput technologies marked a paradigm shift in the sequencing landscape. The UniProt consortium and the NCBI Protein Database emerged as pivotal players, consolidating diverse data sources. As the vastness of sequence information continues to expand, the history of sequence databases mirrors the relentless progress of biological sciences and the challenges and opportunities presented by the ever-expanding "sequence information space." Today, with new high-throughput technologies, the sequencing landscape continues to evolve, offering unprecedented opportunities for biological scientists amidst the vast and dynamic sea of sequence data.

Protein databases are crucial in organizing and disseminating protein information, aiding researchers and scientists in various biological and biomedical studies. We can divide protein databases into two broad categories: sequence repositories and curated databases. These databases are valuable resources for researchers, bioinformaticians, and other stakeholders in the scientific community.

Sequence Repositories

Sequence repositories are specialized databases designed to house protein sequences with minimal manual intervention. These repositories serve as comprehensive reservoirs, aggregating vast amounts of data from diverse sources such as experimental studies and high-throughput technologies.

Key characteristics of sequence repositories include the raw nature of the data they contain, often lacking detailed annotations. Automated processes are employed for collection, storage, and retrieval within these repositories to streamline the management of this data. Notable examples of sequence repositories include UniProtKB/TrEMBL and NCBI Protein.

Curated Databases

Curated databases represent a meticulous approach to managing protein information, involving manual curation by domain experts. This process aims to enhance and annotate the original data, ensuring heightened accuracy, reliability, and relevance. Curators enrich the database by adding information about protein functions, structures, interactions, and disease associations.

The characteristic of curated databases is the involvement of expert curators who diligently review, validate and annotate the data to elevate its overall quality. Due to their curated nature, these databases offer significant value for conducting in-depth analyses and interpretations of protein-related information. Examples of curated databases include Swiss-Prot, a section of UniProtKB, and the Protein Data Bank (PDB).

Interoperability, Standards, and Best Practices

Interoperability, Standards, and Best Practices are pivotal in protein databases. Ensuring seamless data exchange and integration hinges on fostering interoperability among these databases. Adopting standards and best practices becomes imperative to uphold consistency, accuracy, and reliability in representing biological knowledge within these repositories.

The challenges in achieving interoperability stem from different databases' varied formats, ontologies, and data structures. Addressing these challenges involves the establishment of standards and best practices, thereby easing the process of data exchange and integration.

Within the landscape of biocuration, the International Society for Biocuration (ISB) assumes a crucial role. The ISB actively advances the fundamental principles of biocuration by nurturing collaboration, setting guidelines, and advocating best practices among curators. This concerted effort enhances the overall quality and utility of biological databases.


Lesson Plan (tentative) 


NCBI Protein Database

As a cornerstone of biological research, this database is an invaluable resource for scientists, researchers, and clinicians seeking to explore the intricate world of proteins. With its extensive coverage and robust features, the NCBI Protein Database is pivotal in advancing our understanding of molecular biology and facilitating breakthroughs in medical and biotechnology fields.

Start learning
UniProt

UniProt is the primary hub for comprehensive protein information, encompassing extensive data, and is freely accessible. Each database housed within UniProt has a distinct purpose, providing varied viewpoints on protein sequences and functions.

Start learning
NCBI Protein Clusters database

The NCBI Protein Clusters database is a free and publicly available resource that groups proteins based on their sequence similarity. These clusters show us how different proteins are evolutionarily related and functionally similar.

Start learning
NCBI Conserved Domain Database (CDD)
NCBI Protein Family Models Database
NCBI Identical Protein Groups (IPG) Database

The NCBI Identical Protein Groups (IPG) Database was created to simplify searching for protein information across vast and diverse datasets. The tutorial on the National Center for Biotechnology Information (NCBI) Identical Protein Groups (IPG) Database will elucidate the concepts, tools, and methodologies integral to understanding and utilizing this comprehensive protein resource.

Start learning