cd-hit
cd-hit clusters nucleotide and protein sequences to reduce redundancy and produce representative sequence sets for downstream genomic and proteomic analyses.
Key Features:
- Redundancy reduction: Clusters sequences into representative groups based on sequence identity thresholds to reduce redundancy in large datasets.
- Nucleotide and protein support: Handles both nucleotide (DNA/RNA) and protein sequence databases.
- Specialized programs: Includes cd-hit-2d for comparing two protein datasets, cd-hit-est for clustering DNA/RNA databases, and cd-hit-1d for nucleotide dataset comparison.
- Parallelization: Implements parallelization strategies to accelerate processing of large sequence collections.
- Short-word filters: Uses short-word filters and techniques to improve their use, enabling substantial speed increases (reported up to ~100-fold).
- High performance: Optimized for ultrafast clustering with reported orders-of-magnitude speedups relative to BLAST and example performance clustering ~560,000 protein sequences in about two hours on a high-end PC.
- Sequence identity thresholds: Supports clustering at different sequence identity levels to control cluster granularity.
Scientific Applications:
- Dereplication for downstream analysis: Reduces dataset size and computational load for downstream genomic and proteomic analyses by producing representative sequence sets.
- Public database clustering: Applied to cluster and compress public protein databases such as NCBI NR, SwissProt, and PDB.
- Comparative dataset analysis: Compares two protein datasets to identify shared and unique sequences (cd-hit-2d).
- DNA/RNA clustering: Clusters nucleotide (DNA/RNA) sequence databases for applications in transcriptomics and metagenomics (cd-hit-est, cd-hit-1d).
- NGS-scale processing: Scales to very large sequence collections produced by next-generation sequencing (NGS) technologies.
Methodology:
Clusters sequences into representative groups using sequence identity thresholds, employs short-word filters and techniques to optimize filter use, and applies parallelization strategies to accelerate processing.
Topics
Collections
Details
- Maturity:
- Mature
- Tool Type:
- web application
- Operating Systems:
- Linux, Windows, Mac
- Programming Languages:
- C++
- Added:
- 12/19/2016
- Last Updated:
- 11/24/2024
Operations
Publications
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282-283. doi:10.1093/bioinformatics/17.3.282. PMID:11294794.
Li W, Jaroszewski L, Godzik A. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002;18(1):77-82. doi:10.1093/bioinformatics/18.1.77. PMID:11836214.
Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680-682. doi:10.1093/bioinformatics/btq003. PMID:20053844. PMCID:PMC2828112.
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150-3152. doi:10.1093/bioinformatics/bts565. PMID:23060610. PMCID:PMC3516142.
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658-1659. doi:10.1093/bioinformatics/btl158. PMID:16731699.