UniProt Reference Clusters (UniRef)

UniProt Reference Clusters (UniRef) cluster protein sequences from the UniProt Knowledgebase (UniProtKB) and selected UniParc records at 100%, 90%, and 50% sequence identity to reduce redundancy while preserving representative sequences for faster, sensitive sequence-similarity searches and functional annotation.

Key Features:

Clustering Mechanism: Clusters sequences by sequence identity at 100%, 90%, and 50% to produce UniRef100, UniRef90, and UniRef50, reducing database size by approximately 10%, 40%, and 70%, respectively.
Non-redundancy and Intra-cluster Homogeneity: Applies a sequence length overlap threshold to improve non-redundancy and intra-cluster homogeneity, enhancing the speed, sensitivity, and consistency of similarity searches.
Functional Annotation Consistency: Maintains high molecular function consistency, with over 97% of clusters grouping proteins with identical functions.
Efficiency in Similarity Searches: Enables faster and shorter similarity-search results (e.g., BLASTP against UniRef50 yields approximately seven times shorter hit lists before expansion and about six-fold faster searches) with over 96% recall at an e-value threshold of <0.0001.
Comprehensive Coverage: Clusters sequences from diverse organisms into single entries to provide broad sequence-space coverage, facilitate detection of distant relationships, and reduce sampling bias.
Rich Functional Annotation Links: Each UniRef entry includes a representative protein sequence, member counts, common taxonomy, accession numbers, and links to UniProtKB functional annotations.

Scientific Applications:

Genome annotation: Support functional assignment and redundancy reduction in genome annotation workflows.
Proteomics data analysis: Manage large proteomic datasets by collapsing similar sequences and summarizing protein families for downstream analysis.
Similarity searching: Accelerate and shorten sequence-similarity searches (e.g., BLASTP) while retaining high sensitivity.
Functional annotation transfer and validation: Facilitate propagation and validation of UniProtKB annotations and help identify annotation inconsistencies.

Methodology:

Identical sequences and subfragments are combined into UniRef100 entries, which are then clustered at 90% and 50% sequence identity to form UniRef90 and UniRef50, with a sequence length overlap threshold applied to improve non-redundancy and intra-cluster homogeneity.

Visit Official Homepage →

Topics

Sequence analysis Gene structure

Collections

DRCAT UniProt

Details

Tool Type:: web application
Operating Systems:: Linux, Windows, Mac
Added:: 10/7/2015
Last Updated:: 6/30/2022

Operations

Data Inputs & Outputs

Query and retrieval

Inputs

Species name

Outputs

Other operations do not define inputs or outputs.

Publications

Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2014;31(6):926-932. doi:10.1093/bioinformatics/btu739. PMID:25398609. PMCID:PMC4375400.

DOI: 10.1093/bioinformatics/btu739

PMID: 25398609

PMCID: PMC4375400

Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282-1288. doi:10.1093/bioinformatics/btm098. PMID:17379688.

DOI: 10.1093/bioinformatics/btm098

PMID: 17379688

Documentation

General

http://www.uniprot.org/help/uniref

← Back to search