BigSeqKit

The Next Generation Sequencing (NGS) raw data are stored in FASTA and FASTQ text-based file formats. Common operations on FASTA/Q files include searching, filtering, sampling, deduplication and sorting, among others. We can find several tools in the literature for FASTA/Q file manipulation but none of them are well fitted for large files of tens of GB (likely TBs in the near future) since mostly they are based on sequential processing. The exception is seqkit that allows some routines to use a few threads but, in any case, the scalability is very limited. To deal with this issue, we introduce BigSeqKit, a parallel toolkit to manipulate FASTA/Q files at scale with speed and scalability at its core. BigSeqKit takes advantage of an HPC-Big Data framework (IgnisHPC) to parallelize and optimize the commands included in seqkit. In this way, in most cases it is from tens to hundreds of times faster than other state-of-the-art tools such as seqkit, samtools and pyfastx.

Topics

Details

License:
GPL-3.0
Tool Type:
command-line tool, library
Operating Systems:
Linux
Added:
5/22/2023
Last Updated:
5/22/2023

Operations