FASTX-Toolkit

FASTX-Toolkit performs preprocessing and translation-based sequence comparisons of short-reads in FASTA/FASTQ formats to detect protein-coding regions and frameshifts by aligning translated DNA sequences to protein databases.


Key Features:

  • Sequence comparison programs: The suite includes FASTX, FASTY, TFASTX, and TFASTY for comparative analysis between DNA and protein sequences.
  • Translation and alignment capabilities: FASTX and FASTY translate a DNA sequence into three reading frames and align the translations against a protein database with allowance for gaps and frameshifts, while TFASTX and TFASTY translate sequences in a DNA database across six frames and align them to a protein sequence with gaps and frameshifts.
  • Frameshift and substitution handling: FASTX and TFASTX permit frameshifts only between codons, whereas FASTY and TFASTY permit substitutions and frameshifts within codons.
  • Performance evaluation: The toolkit has been evaluated across penalties for gap openings, gap extensions, frameshifts, and nucleotide substitutions and performs equivalently when query sequences contain up to 10% errors.
  • Statistical accuracy: FASTX and FASTY provide statistical estimates that are generally accurate but can be less reliable when out-of-frame translation yields a low-complexity protein sequence.

Scientific Applications:

  • Protein-coding gene identification: The toolkit is used to detect and characterize protein-coding regions and to identify putative coding sequences in genomic data.
  • Genome-wide scanning and boundary correction: It has been applied to Mycoplasma genitalium, Haemophilus influenzae, and Methanococcus jannaschii, identifying at least nine new protein-coding genes and discovering at least 35 genes with potentially incorrect boundaries.

Methodology:

Translate DNA sequences into multiple reading frames (three for FASTX/FASTY and six for TFASTX/TFASTY) and align translated sequences against a protein database, allowing gaps and frameshifts and evaluating penalties for gap openings, extensions, frameshifts, and nucleotide substitutions.

Topics

Collections

Details

License:
AGPL-3.0
Tool Type:
web application, workflow
Operating Systems:
Linux, Windows, Mac
Programming Languages:
Shell, C++, C
Added:
1/17/2017
Last Updated:
11/24/2024

Operations

Publications

Pearson WR, Wood T, Zhang Z, Miller W. Comparison of DNA Sequences with Protein Sequences. Genomics. 1997;46(1):24-36. doi:10.1006/geno.1997.4995. PMID:9403055.

Documentation

Links

Related Tools

cshl_fastx_artifacts_filter
Relation: includes
cshl_fastx_clipper
Relation: includes
cshl_fastx_collapser
Relation: includes
cshl_fastx_renamer
Relation: includes
cshl_fastx_trimmer
Relation: includes