cd-hit

cd-hit clusters nucleotide and protein sequences to reduce redundancy and produce representative sequence sets for downstream genomic and proteomic analyses.

Key Features:

Redundancy reduction: Clusters sequences into representative groups based on sequence identity thresholds to reduce redundancy in large datasets.
Nucleotide and protein support: Handles both nucleotide (DNA/RNA) and protein sequence databases.
Specialized programs: Includes cd-hit-2d for comparing two protein datasets, cd-hit-est for clustering DNA/RNA databases, and cd-hit-1d for nucleotide dataset comparison.
Parallelization: Implements parallelization strategies to accelerate processing of large sequence collections.
Short-word filters: Uses short-word filters and techniques to improve their use, enabling substantial speed increases (reported up to ~100-fold).
High performance: Optimized for ultrafast clustering with reported orders-of-magnitude speedups relative to BLAST and example performance clustering ~560,000 protein sequences in about two hours on a high-end PC.
Sequence identity thresholds: Supports clustering at different sequence identity levels to control cluster granularity.

Scientific Applications:

Dereplication for downstream analysis: Reduces dataset size and computational load for downstream genomic and proteomic analyses by producing representative sequence sets.
Public database clustering: Applied to cluster and compress public protein databases such as NCBI NR, SwissProt, and PDB.
Comparative dataset analysis: Compares two protein datasets to identify shared and unique sequences (cd-hit-2d).
DNA/RNA clustering: Clusters nucleotide (DNA/RNA) sequence databases for applications in transcriptomics and metagenomics (cd-hit-est, cd-hit-1d).
NGS-scale processing: Scales to very large sequence collections produced by next-generation sequencing (NGS) technologies.

Methodology:

Clusters sequences into representative groups using sequence identity thresholds, employs short-word filters and techniques to optimize filter use, and applies parallelization strategies to accelerate processing.

Visit Official Homepage →

Topics

Sequencing

Collections

galaxyPasteur

Details

Maturity:: Mature
Tool Type:: web application
Operating Systems:: Linux, Windows, Mac
Programming Languages:: C++
Added:: 12/19/2016
Last Updated:: 11/24/2024

Operations

Sequence clustering

Publications

Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282-283. doi:10.1093/bioinformatics/17.3.282. PMID:11294794.

DOI: 10.1093/bioinformatics/17.3.282

PMID: 11294794

Li W, Jaroszewski L, Godzik A. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002;18(1):77-82. doi:10.1093/bioinformatics/18.1.77. PMID:11836214.

DOI: 10.1093/bioinformatics/18.1.77

PMID: 11836214

Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680-682. doi:10.1093/bioinformatics/btq003. PMID:20053844. PMCID:PMC2828112.

DOI: 10.1093/bioinformatics/btq003

PMID: 20053844

PMCID: PMC2828112

Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150-3152. doi:10.1093/bioinformatics/bts565. PMID:23060610. PMCID:PMC3516142.

DOI: 10.1093/bioinformatics/bts565

PMID: 23060610

PMCID: PMC3516142

Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658-1659. doi:10.1093/bioinformatics/btl158. PMID:16731699.

DOI: 10.1093/bioinformatics/btl158

PMID: 16731699

Documentation

General

https://github.com/weizhongli/cdhit/wiki

Links

Galaxy service

https://galaxy.pasteur.fr/tool_runner?tool_id=toolshed.pasteur.fr/repos/afelten/microbiome_analyses/CD-HIT/4.6.1

← Back to search