No project description provided
Project description
pyScaf
pyScaf orders contigs from genome assemblies utilising several types of information:
paired-end (PE) and/or mate-pair libraries ([NGS-based mode](#ngs-based-scaffolding))
long reads ([NGS-based mode](#scaffolding-based-on-long-reads))
synteny to the genome of some related species ([reference-based mode](#reference-based-scaffolding))
Scaffolding modes
NGS-based scaffolding
This is under development… Stay tuned.
Scaffolding based on long reads
Experimental version available.
Reference-based scaffolding
In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.
Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:
matches not satisfying cut-offs (–identity and –overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.
In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table). Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.
Important remarks:
Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect…), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.
Usage
Dependencies
Parameters
Given reference genome, the program generates pairwise genome alignment (dotplots) by default.
Genral options:
- -h, --help
show this help message and exit
- -f FASTA, --fasta FASTA
assembly FASTA file
- -o OUTPUT, --output OUTPUT
output stream [scaffolds.fa]
- -t THREADS, --threads THREADS
max no. of threads to run [4]
- --log LOG
output log to [stderr]
- --dotplot
generate dotplot as [png]
- --version
show program’s version number and exit
Reference-based scaffolding options:
- -r REF, --ref REF, --reference REF
reference FastA file
- --identity IDENTITY
min. identity [0.33]
- --overlap OVERLAP
min. overlap [0.66]
- -g MAXGAP, --maxgap MAXGAP
max. distance between adjacent contigs [0.01 * assembly_size]
- --norearrangements
high identity mode (rearrangements not allowed)
Long read-based scaffolding options (EXPERIMENTAL!):
- -n LONGREADS, --longreads LONGREADS
FastQ/FastA file(s) with PacBio/ONT reads
NGS-based scaffolding options (!NOT IMPLEMENTED!):
- -i FASTQ, --fastq FASTQ
FASTQ PE/MP files
- -j JOINS, --joins JOINS
min pairs to join contigs [5]
- -a LINKRATIO, --linkratio LINKRATIO
max link ratio between two best contig pairs [0.7]
- -l LOAD, --load LOAD
align subset of reads [0.2]
- -q MAPQ, --mapq MAPQ
min mapping quality [10]
Test run
To perform reference-based assembly, provide assembled contigs and reference genome in FastA format. Dotplots of below runs can be found in [docs](/docs). If you wish to skip dotplot generation (ie. no X11 on your system), provide –dotplot ‘’ parameter.
# scaffold homogenised assembly (reduced contigs)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa
# scaffold reduced contigs using global mode (no norearrangements allowed)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.global.fa --norearrangements
# scaffold heterozygous assembly (de novo assembled contigs)
./pyScaf.py -f test/contigs.fa -r test/ref.fa -o test/contigs.ref.fa
# scaffold reduced contigs using long reads
## pacbio
./pyScaf.py -f test/contigs.reduced.fa -n test/pacbio.fq.gz -o test/contigs.reduced.pacbio.fa
## nanopore
./pyScaf.py -f test/contigs.reduced.fa -n test/nanopore.fa.gz -o test/contigs.reduced.nanopore.fa
# generate dotplot
lastdb test/ref.fa
lastal -f TAB test/ref.fa test/contigs.reduced.pacbio.fa | last-dotplot - test/contigs.reduced.pacbio.fa.ref.png
lastal -f TAB test/ref.fa test/contigs.reduced.nanopore.fa | last-dotplot - test/contigs.reduced.nanopore.fa.ref.png
# clean-up
#rm test/contigs.{,reduced.}fa.* test/ref.fa.* test/*.{nanopore,pacbio,ref}* test/*.log
Proof of concept
pyScaf is under heavy development right now. Nevertheless, the reference-based mode is functional and produces meaningful assemblies. Moverover, it has been implemented in Redundans.
For more info, have a look in workbook.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyScaf-0.12a4.tar.gz
.
File metadata
- Download URL: pyScaf-0.12a4.tar.gz
- Upload date:
- Size: 34.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ce3f6fe80bd058831b6a38a56d464ef10f3ebbdd6bc3dcb0d7f127c0b2c1b36 |
|
MD5 | c67526747eb04d1e28279ac310916d40 |
|
BLAKE2b-256 | ee52a947347d00c323a87588d6b6d5ad54b3656a5df2f3bcaad477833a43d1f6 |
File details
Details for the file pyScaf-0.12a4-py2-none-any.whl
.
File metadata
- Download URL: pyScaf-0.12a4-py2-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8df880c5c0560fa1d2f76b509f964ed14baa0ed884b46616f28be5da4d538dac |
|
MD5 | 5837572dce0c88f79b8240a2f894bf64 |
|
BLAKE2b-256 | 29e4fdc8ffca0a993076d240bc95afcc26c73feaec6128dd3073d07aad3cbed9 |