SEED
SEED clusters next-generation sequencing (NGS) short-read sequences to reduce redundancy and facilitate genome and transcriptome assembly and small RNA cluster discovery.
Key Features:
- Efficient Clustering Algorithm: Employs a modified spaced seed method called block spaced seeds to form sequence clusters.
- Error and Overhang Tolerance: Forms clusters where sequences can differ by up to three mismatches and three overhanging residues from their virtual center.
- Scalability and Speed: Achieves linear time and memory performance and can cluster 100 million short read sequences in less than four hours, handling datasets with tens of millions of reads.
- Preprocessing for Assembly: When used before Velvet/Oasis assembly, reduces assembler time by 60–85% and memory by 21–41% while producing contigs with N50 values 12–27% larger.
- Performance Comparison: Generates clusters closely resembling true clusters and achieves a 2- to 10-fold improvement in time efficiency over other clustering tools.
- Versatility: Functions as a standalone method for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.
Scientific Applications:
- Genome assembly preprocessing: Reduces redundancy and computational resources for genome assembly workflows using Velvet and similar assemblers.
- Transcriptome assembly preprocessing: Optimizes transcriptome assembly by decreasing time and memory requirements and improving contig length metrics.
- Small RNA cluster discovery: Identifies clusters of small RNA sequences in NGS data from unsequenced organisms.
- Population and diversity analysis: Facilitates estimation of DNA/RNA molecule population sizes and exploration of genomic diversity.
Methodology:
Uses block spaced seeds (a modified spaced seed method) to cluster reads with tolerance of up to three mismatches and three overhanging residues from a virtual center and is implemented with linear time and memory algorithms.
Topics
Details
- Maturity:
- Mature
- Tool Type:
- command-line tool
- Operating Systems:
- Linux, Windows, Mac
- Programming Languages:
- C++
- Added:
- 1/13/2017
- Last Updated:
- 11/25/2024
Operations
Publications
Bao E, Jiang T, Kaloshian I, Girke T. SEED: efficient clustering of next-generation sequences. Bioinformatics. 2011;27(18):2502-2509. doi:10.1093/bioinformatics/btr447. PMID:21810899. PMCID:PMC3167058.
Documentation
General
https://github.com/baoe/SEED