cd-hit

cd-hit clusters nucleotide and protein sequences to reduce redundancy and produce representative sequence sets for downstream genomic and proteomic analyses.


Key Features:

  • Redundancy reduction: Clusters sequences into representative groups based on sequence identity thresholds to reduce redundancy in large datasets.
  • Nucleotide and protein support: Handles both nucleotide (DNA/RNA) and protein sequence databases.
  • Specialized programs: Includes cd-hit-2d for comparing two protein datasets, cd-hit-est for clustering DNA/RNA databases, and cd-hit-1d for nucleotide dataset comparison.
  • Parallelization: Implements parallelization strategies to accelerate processing of large sequence collections.
  • Short-word filters: Uses short-word filters and techniques to improve their use, enabling substantial speed increases (reported up to ~100-fold).
  • High performance: Optimized for ultrafast clustering with reported orders-of-magnitude speedups relative to BLAST and example performance clustering ~560,000 protein sequences in about two hours on a high-end PC.
  • Sequence identity thresholds: Supports clustering at different sequence identity levels to control cluster granularity.

Scientific Applications:

  • Dereplication for downstream analysis: Reduces dataset size and computational load for downstream genomic and proteomic analyses by producing representative sequence sets.
  • Public database clustering: Applied to cluster and compress public protein databases such as NCBI NR, SwissProt, and PDB.
  • Comparative dataset analysis: Compares two protein datasets to identify shared and unique sequences (cd-hit-2d).
  • DNA/RNA clustering: Clusters nucleotide (DNA/RNA) sequence databases for applications in transcriptomics and metagenomics (cd-hit-est, cd-hit-1d).
  • NGS-scale processing: Scales to very large sequence collections produced by next-generation sequencing (NGS) technologies.

Methodology:

Clusters sequences into representative groups using sequence identity thresholds, employs short-word filters and techniques to optimize filter use, and applies parallelization strategies to accelerate processing.

Topics

Collections

Details

Maturity:
Mature
Tool Type:
web application
Operating Systems:
Linux, Windows, Mac
Programming Languages:
C++
Added:
12/19/2016
Last Updated:
11/24/2024

Operations

Publications

Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282-283. doi:10.1093/bioinformatics/17.3.282. PMID:11294794.

Li W, Jaroszewski L, Godzik A. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002;18(1):77-82. doi:10.1093/bioinformatics/18.1.77. PMID:11836214.

Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680-682. doi:10.1093/bioinformatics/btq003. PMID:20053844. PMCID:PMC2828112.

Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150-3152. doi:10.1093/bioinformatics/bts565. PMID:23060610. PMCID:PMC3516142.

Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658-1659. doi:10.1093/bioinformatics/btl158. PMID:16731699.

Documentation

Links