SOPRA

SOPRA assembles mate-pair and paired-end short reads from high-throughput sequencing platforms (Illumina, SOLiD) into scaffolds by selecting consistent mate-pair constraints to improve de novo genome assembly.


Key Features:

  • Input data: Accepts short reads from high-throughput sequencing platforms including Illumina and SOLiD, including mate-pair and paired-end libraries.
  • Mate-pair constraint selection: Selects an optimal subset of mate-pair constraints that are simultaneously satisfiable to balance scaffold size and quality.
  • Contig connectivity graph optimization: Formulates scaffold assembly as an optimization problem with variables associated with vertices (contigs) and edges (mate-pair relationships) in a contig connectivity graph.
  • Constraint weighting and filtering: Treats all constraints equally during optimization to identify problematic constraints such as chimeric or repetitive contig connections.
  • Iterative refinement: Iteratively solves the optimization and removes inconsistent constraints until a core set of consistent constraints remains.
  • SOLiD color-space translation: Uses a dynamic programming approach to translate color-space assemblies from SOLiD data into base-space.
  • Assembly quality metrics: Assesses assemblies using the no-match/mismatch error rate and various rearrangement error rates.

Scientific Applications:

  • De novo genome scaffolding: Improves scaffold assembly for moderate-sized genomes using mate-pair spatial information to connect contigs.
  • Bacterial genome assembly: Demonstrated assembly of bacterial genomes into scaffolds with high continuity (reported N50 up to 200 Kb) with few introduced errors.
  • Color-space sequence analysis: Processes SOLiD color-space data and converts results into base-space for downstream assembly evaluation.

Methodology:

Scaffold assembly is formulated as an optimization problem on a contig connectivity graph with variables on vertices and edges; SOPRA selects an optimal subset of mate-pair constraints that are simultaneously satisfiable, treats constraints equally to identify and remove problematic (chimeric or repetitive) constraints through iterative solving, applies dynamic programming to translate SOLiD color-space to base-space, and evaluates assemblies using no-match/mismatch and rearrangement error rates.

Topics

Details

Tool Type:
command-line tool
Operating Systems:
Linux, Mac
Programming Languages:
Perl
Added:
12/18/2017
Last Updated:
1/17/2019

Operations

Publications

Dayarian A, Michael TP, Sengupta AM. SOPRA: Scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics. 2010;11(1). doi:10.1186/1471-2105-11-345. PMID:20576136. PMCID:PMC2909219.

Links