seqkit
seqkit manipulates FASTA and FASTQ files to perform efficient processing and manipulation of nucleotide and protein sequence datasets.
Key Features:
- Format and sequence support: Operates on FASTA and FASTQ formats and supports nucleotide and protein sequences.
- Conversion: Converts between FASTA and FASTQ formats.
- Search and filtering: Searches sequences and filters entries based on specified criteria.
- Deduplication: Identifies and removes redundant sequence entries.
- File partitioning: Splits large sequence files into smaller segments.
- Randomization and sampling: Shuffles sequences for randomization and samples subsets of sequences.
- Performance optimization: Employs optimized algorithms to reduce execution time and memory usage for large datasets.
Scientific Applications:
- Dataset preparation: Prepares and formats FASTA/FASTQ datasets for downstream analyses.
- Large-scale sequencing management: Processes and partitions large sequencing outputs for scalable analysis workflows.
- Variant calling workflows: Performs preliminary processing steps required before variant calling.
- Metagenomics studies: Filters, samples, and partitions metagenomic sequence datasets for taxonomic or functional analysis.
- Comparative genomics: Prepares sequence collections for alignment, clustering, or comparative analyses.
Methodology:
Uses optimized algorithms to perform sequence file manipulations such as searching, filtering, deduplication, splitting, shuffling, and sampling on FASTA and FASTQ files.
Topics
Details
- Added:
- 2/15/2021
- Last Updated:
- 11/24/2024
Operations
Data Inputs & Outputs
DNA transcription
Outputs
Publications
Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE. 2016;11(10):e0163962. doi:10.1371/journal.pone.0163962. PMID:27706213. PMCID:PMC5051824.
PMID: 27706213
PMCID: PMC5051824
Funding: - National Natural Science Foundation of China: 31570173, 81373133