Skip to main content

No project description provided

Project description

pyScaf

pyScaf orders contigs from genome assemblies utilising several types of information:

  • paired-end (PE) and/or mate-pair libraries ([NGS-based mode](#ngs-based-scaffolding))

  • long reads ([NGS-based mode](#scaffolding-based-on-long-reads))

  • synteny to the genome of some related species ([reference-based mode](#reference-based-scaffolding))

Scaffolding modes

NGS-based scaffolding

This is under development… Stay tuned.

Scaffolding based on long reads

Experimental version available.

Reference-based scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

  • matches not satisfying cut-offs (–identity and –overlap)

  • suboptimal matches (only best match of each query to reference is kept)

  • and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table). Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

  • Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.

  • pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect…), which breaks synteny.

  • pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.

  • pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!

  • Consider closing gaps after scaffolding.

Usage

Dependencies

Parameters

Given reference genome, the program generates pairwise genome alignment (dotplots) by default.

  • Genral options:

    -h, --help

    show this help message and exit

    -f FASTA, --fasta FASTA

    assembly FASTA file

    -o OUTPUT, --output OUTPUT

    output stream [scaffolds.fa]

    -t THREADS, --threads THREADS

    max no. of threads to run [4]

    --log LOG

    output log to [stderr]

    --dotplot

    generate dotplot as [png]

    --version

    show program’s version number and exit

  • Reference-based scaffolding options:

    -r REF, --ref REF, --reference REF

    reference FastA file

    --identity IDENTITY

    min. identity [0.33]

    --overlap OVERLAP

    min. overlap [0.66]

    -g MAXGAP, --maxgap MAXGAP

    max. distance between adjacent contigs [0.01 * assembly_size]

    --norearrangements

    high identity mode (rearrangements not allowed)

  • Long read-based scaffolding options (EXPERIMENTAL!):

    -n LONGREADS, --longreads LONGREADS

    FastQ/FastA file(s) with PacBio/ONT reads

  • NGS-based scaffolding options (!NOT IMPLEMENTED!):

    -i FASTQ, --fastq FASTQ

    FASTQ PE/MP files

    -j JOINS, --joins JOINS

    min pairs to join contigs [5]

    -a LINKRATIO, --linkratio LINKRATIO

    max link ratio between two best contig pairs [0.7]

    -l LOAD, --load LOAD

    align subset of reads [0.2]

    -q MAPQ, --mapq MAPQ

    min mapping quality [10]

Test run

To perform reference-based assembly, provide assembled contigs and reference genome in FastA format. Dotplots of below runs can be found in [docs](/docs). If you wish to skip dotplot generation (ie. no X11 on your system), provide –dotplot ‘’ parameter.

# scaffold homogenised assembly (reduced contigs)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa

# scaffold reduced contigs using global mode (no norearrangements allowed)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.global.fa --norearrangements

# scaffold heterozygous assembly (de novo assembled contigs)
./pyScaf.py -f test/contigs.fa -r test/ref.fa -o test/contigs.ref.fa

# scaffold reduced contigs using long reads
## pacbio
./pyScaf.py -f test/contigs.reduced.fa -n test/pacbio.fq.gz -o test/contigs.reduced.pacbio.fa
## nanopore
./pyScaf.py -f test/contigs.reduced.fa -n test/nanopore.fa.gz -o test/contigs.reduced.nanopore.fa

# generate dotplot
lastdb test/ref.fa
lastal -f TAB test/ref.fa test/contigs.reduced.pacbio.fa | last-dotplot - test/contigs.reduced.pacbio.fa.ref.png
lastal -f TAB test/ref.fa test/contigs.reduced.nanopore.fa | last-dotplot - test/contigs.reduced.nanopore.fa.ref.png

# clean-up
#rm test/contigs.{,reduced.}fa.* test/ref.fa.* test/*.{nanopore,pacbio,ref}* test/*.log

Proof of concept

pyScaf is under heavy development right now. Nevertheless, the reference-based mode is functional and produces meaningful assemblies. Moverover, it has been implemented in Redundans.

For more info, have a look in workbook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyScaf-0.12a4.tar.gz (34.9 kB view details)

Uploaded Source

Built Distribution

pyScaf-0.12a4-py2-none-any.whl (20.4 kB view details)

Uploaded Python 2

File details

Details for the file pyScaf-0.12a4.tar.gz.

File metadata

  • Download URL: pyScaf-0.12a4.tar.gz
  • Upload date:
  • Size: 34.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pyScaf-0.12a4.tar.gz
Algorithm Hash digest
SHA256 3ce3f6fe80bd058831b6a38a56d464ef10f3ebbdd6bc3dcb0d7f127c0b2c1b36
MD5 c67526747eb04d1e28279ac310916d40
BLAKE2b-256 ee52a947347d00c323a87588d6b6d5ad54b3656a5df2f3bcaad477833a43d1f6

See more details on using hashes here.

File details

Details for the file pyScaf-0.12a4-py2-none-any.whl.

File metadata

File hashes

Hashes for pyScaf-0.12a4-py2-none-any.whl
Algorithm Hash digest
SHA256 8df880c5c0560fa1d2f76b509f964ed14baa0ed884b46616f28be5da4d538dac
MD5 5837572dce0c88f79b8240a2f894bf64
BLAKE2b-256 29e4fdc8ffca0a993076d240bc95afcc26c73feaec6128dd3073d07aad3cbed9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page