Protein Sequence Databases

NCBI Protein Clusters Database

Prerequisites: Basic knowledge of biology and proteins.

Level: Beginner.
Learning objectives:
- Have Basic understanding of NCBI Protein Clusters Database.

Introduction

The NCBI Protein Clusters database is a free and publicly available resource that groups proteins based on their sequence similarity. These clusters show us how different proteins are evolutionarily related and functionally similar. The database is a part of NCBI's collection that focuses mainly on proteins from complete genomes of prokaryotes, plasmids, and organelles. It also includes scrutinized clusters from plants, chloroplasts, and mitochondria.

Background - Homology, Convergence, and Protein Clustering

Homologous DNA sequences can have a common ancestry due to either a speciation event (orthologs) or a duplication event (paralogs). These events occur gradually over time due to the accumulation of mutations. We often look for sequence similarities to detect homology in proteins or DNA.

However, high sequence similarity can also occur due to convergent evolution, where sequences show similarity by chance without any common ancestor, which is more likely to occur in short sequences. Some protein sequences have conserved domains (CD), specific segments exhibiting high similarity, while others remain variable. These conserved domains are usually preserved for their functional roles in biological processes.

Proteins that have similar structures tend to perform similar functions. Scientists identify similar protein sequences within the same species as paralogs and refer to identical sequences found in different species as orthologs. Regardless of whether they are paralogs or orthologs, homologous proteins exhibit remarkable similarity, allowing scientists to align them and resulting in a higher optimal multiple alignment score.

We can use clustering techniques and unsupervised machine-learning methods to group similar proteins for further analysis. These methods do not need prior knowledge of gene classes and group gene sequences based on scoring distances.

Protein clustering is a method used to group proteins that share a similar sequence based on their similarity. This process helps researchers to identify proteins that are related to each other and have similar or identical functions. We can use these clusters of proteins to construct phylogenetic trees that show the evolutionary relationships between different species.

Phylogenetic trees are essential in various research fields, including evolutionary biology, drug discovery, and protein engineering. By analyzing the similarities and differences between proteins in different species, scientists can better understand how life evolved and how they can use this knowledge to improve human health.

Protein Clusters Database
Navigating Curated and Non-Curated Sets for Evolutionary Insights

The Protein Clusters database contains two main protein clusters: curated and non-curated. Curated clusters have a consistent naming system and detailed descriptions of protein functions, reflecting manual curation efforts.

Non-curated clusters, on the other hand, are generated automatically and lack manual annotations. These clusters may contain orthologs and paralogs, which underlines the importance of researchers manually reviewing and removing redundant sequences to obtain more accurate analyses. Nevertheless, the non-curated clusters are valuable by providing a broader, yet less refined, view than the curated clusters.

To easily distinguish between curated and non-curated data sets in the Protein Clusters database, each cluster is assigned a unique accession prefix accompanied by a distinctive numerical code, as outlined in Table 1.

**Table 1.** Accession Prefix (Three Letters).
Cluster ID Prefix	Cluster Description
PRK	Prokaryotes (Curated Protein Clusters)
CLS	Prokaryotes (Uncurated Protein Clusters)
CHL	Chloroplasts
CLSC	Chloroplasts (Uncurated Chloroplast Clusters)

Navigating the NCBI Protein Cluster Database
File Formats for Efficient Data Retrieval and Analysis

You can explore the NCBI Protein Cluster Database through different access points tailored to your needs. The user-friendly Entrez web interface is the primary way to access the database. This interface enables users to search, browse clusters, and view detailed information. Results can be downloaded in multiple formats, allowing for flexibility based on specific requirements.

The E-Utilities, which include esearch, efetch, and elink, are programming tools that offer an effective solution for programmatically retrieving cluster data in XML format. This format is ideal for seamless integration into various software tools and easy parsing.

When dealing with large datasets, Batch Entrez is a practical tool that allows users to download extensive cluster data in flatfile formats such as TSV or CSV. These flatfiles are structured to include cluster accessions, representative protein sequences, and other relevant information, making offline analysis or integration into external tools convenient.

If you need to focus on specific protein sequences, you can download them in the FASTA file format. The scientific community widely accepts the FASTA format as it ensures compatibility and ease of use for representing biological sequences.

The NCBI Datasets feature provides access to comprehensive cluster datasets in various formats, including TSV, FASTA, and ASN.1 (NCBI's internal data exchange format). For more indepth explanation on file formats, see our tutorial on The Data Format In Nucleotide Sequence Databases.

When choosing the correct format, it is crucial to consider the specific use case and the desired level of detail. There is no single dedicated format that works for all cases. You can use flat file formats like TSV and CSV for bulk data, which are suitable for offline analysis or integration into other tools. For working with protein sequences, FASTA remains the standard and facilitates sequence analysis and alignment.

Take a self test

References

Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012 Dec 1;28(23):3150-2. doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11. PMID: 23060610; PMCID: PMC3516142.
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2018 Jan 4;46(D1):D8-D13. doi: 10.1093/nar/gkx1095. PMID: 29140470; PMCID: PMC5753372.
Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005;39:309-38. doi: 10.1146/annurev.genet.39.073003.114725. PMID: 16285863.
Karamycheva S, Wolf YI, Persi E, Koonin EV, Makarova KS. Analysis of lineage-specific protein family variability in prokaryotes combined with evolutionary reconstructions. Biol Direct. 2022 Aug 30;17(1):22. doi: 10.1186/s13062-022-00337-7. PMID: 36042479; PMCID: PMC9425974.