UniProt Reference Clusters (UniRef)

UniProt Reference Clusters (UniRef) cluster protein sequences from the UniProt Knowledgebase (UniProtKB) and selected UniParc records at 100%, 90%, and 50% sequence identity to reduce redundancy while preserving representative sequences for faster, sensitive sequence-similarity searches and functional annotation.


Key Features:

  • Clustering Mechanism: Clusters sequences by sequence identity at 100%, 90%, and 50% to produce UniRef100, UniRef90, and UniRef50, reducing database size by approximately 10%, 40%, and 70%, respectively.
  • Non-redundancy and Intra-cluster Homogeneity: Applies a sequence length overlap threshold to improve non-redundancy and intra-cluster homogeneity, enhancing the speed, sensitivity, and consistency of similarity searches.
  • Functional Annotation Consistency: Maintains high molecular function consistency, with over 97% of clusters grouping proteins with identical functions.
  • Efficiency in Similarity Searches: Enables faster and shorter similarity-search results (e.g., BLASTP against UniRef50 yields approximately seven times shorter hit lists before expansion and about six-fold faster searches) with over 96% recall at an e-value threshold of <0.0001.
  • Comprehensive Coverage: Clusters sequences from diverse organisms into single entries to provide broad sequence-space coverage, facilitate detection of distant relationships, and reduce sampling bias.
  • Rich Functional Annotation Links: Each UniRef entry includes a representative protein sequence, member counts, common taxonomy, accession numbers, and links to UniProtKB functional annotations.

Scientific Applications:

  • Genome annotation: Support functional assignment and redundancy reduction in genome annotation workflows.
  • Proteomics data analysis: Manage large proteomic datasets by collapsing similar sequences and summarizing protein families for downstream analysis.
  • Similarity searching: Accelerate and shorten sequence-similarity searches (e.g., BLASTP) while retaining high sensitivity.
  • Functional annotation transfer and validation: Facilitate propagation and validation of UniProtKB annotations and help identify annotation inconsistencies.

Methodology:

Identical sequences and subfragments are combined into UniRef100 entries, which are then clustered at 90% and 50% sequence identity to form UniRef90 and UniRef50, with a sequence length overlap threshold applied to improve non-redundancy and intra-cluster homogeneity.

Topics

Collections

Details

Tool Type:
web application
Operating Systems:
Linux, Windows, Mac
Added:
10/7/2015
Last Updated:
6/30/2022

Operations

Data Inputs & Outputs

Query and retrieval

Outputs

    Other operations do not define inputs or outputs.

    Publications

    Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2014;31(6):926-932. doi:10.1093/bioinformatics/btu739. PMID:25398609. PMCID:PMC4375400.

    Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282-1288. doi:10.1093/bioinformatics/btm098. PMID:17379688.

    Documentation