seqkit

seqkit manipulates FASTA and FASTQ files to perform efficient processing and manipulation of nucleotide and protein sequence datasets.


Key Features:

  • Format and sequence support: Operates on FASTA and FASTQ formats and supports nucleotide and protein sequences.
  • Conversion: Converts between FASTA and FASTQ formats.
  • Search and filtering: Searches sequences and filters entries based on specified criteria.
  • Deduplication: Identifies and removes redundant sequence entries.
  • File partitioning: Splits large sequence files into smaller segments.
  • Randomization and sampling: Shuffles sequences for randomization and samples subsets of sequences.
  • Performance optimization: Employs optimized algorithms to reduce execution time and memory usage for large datasets.

Scientific Applications:

  • Dataset preparation: Prepares and formats FASTA/FASTQ datasets for downstream analyses.
  • Large-scale sequencing management: Processes and partitions large sequencing outputs for scalable analysis workflows.
  • Variant calling workflows: Performs preliminary processing steps required before variant calling.
  • Metagenomics studies: Filters, samples, and partitions metagenomic sequence datasets for taxonomic or functional analysis.
  • Comparative genomics: Prepares sequence collections for alignment, clustering, or comparative analyses.

Methodology:

Uses optimized algorithms to perform sequence file manipulations such as searching, filtering, deduplication, splitting, shuffling, and sampling on FASTA and FASTQ files.

Topics

Details

Added:
2/15/2021
Last Updated:
11/24/2024

Operations

Data Inputs & Outputs

DNA transcription

Publications

Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE. 2016;11(10):e0163962. doi:10.1371/journal.pone.0163962. PMID:27706213. PMCID:PMC5051824.

PMID: 27706213
PMCID: PMC5051824
Funding: - National Natural Science Foundation of China: 31570173, 81373133