Assembly Chloroplast Genome
Project description
Quick start
Download the package,
unzip, and then double-click novowrap.exe
or novowrap
.
OR
For command-line users, run
pip install novowrap --user
# Windows
# Initialize, need Internet
python -m novowrap init
python -m novowrap -input input_file_1 input_file_2 -taxon taxonomy name
# Linux and MacOS
# Initialize, need Internet
python3 -m novowrap init
python3 -m novowrap -input input_file_1 input_file_2 -taxon taxonomy name
Table of Contents
Feature
:heavy_check_mark: Assembly chloroplast genomes from given NGS data, with minimal parameters to set. Also, it supports batch mode.
Automatic generate uniform conformation with reference (typically, start from trnH-psbA, and, SSC/LSC region have same direction with reference).
:heavy_check_mark: Merge contigs according to overlapping region. May handle Invert-Repeat fragments.
:heavy_check_mark: Validate assembly results by comparing the synteny and sequence homology with given reference (or taxonomy name).
Prerequisite
Hardware
The assembly function will call NOVOPlasty, which requires 2 GB memory for 1 GB uncompressed data.
The other functions could run in normal computers and have no extra requirements for memory, CPU, et al.
The software requires Internet for the first run to install the missing dependencies. Then, it could work if offline, but better with connection.
Software
For the portable version, nothing need to be installed manually.
For installing from pip, Python is required. Notice that the python version should be 3.6 or higher.
:white_check_mark: All third-party dependencies will be automatically
installed with Internet, including biopython
, matplotlib
, coloredlogs
,
graphviz
(python packages), and
perl(for Windows only),
NOVOPlasty,
BLAST.
Installation
Portable
Download from the link, unpack and run with Internet for the first time.
Install with pip
-
Install Python. 3.6 or newer is required.
-
Open command line, run
pip install novowrap --user
# Windows
python -m novowrap init
# Linux and MacOS
python3 -m novowrap init
Usage
Command line
:exclamation: In Linux and MacOS, Python2 is python2
and Python3 is python3
. However,
in Windows, Python3 is called python
, too. Please notice the difference.
- Show help information of each module
# Windows
python -m novowrap.assembly -h
python -m novowrap.validate -h
python -m novowrap.merge -h
# Linux and MacOS
python3 -m novowrap.assembly -h
python3 -m novowrap.validate -h
python3 -m novowrap.merge -h
- Assembly and validate
# Windows, single sample
python -m novowrap -input [input1] [input2] -taxon [taxonomy]
# Windows, batch mode for numerous samples
python -m novowrap -list [list file]
# Linux and MacOS, single sample
python3 -m novowrap -input [input1] [input2] -taxon [taxonomy]
# Linux and MacOS, batch mode for numerous samples
python3 -m novowrap -list [list file]
- Only validate
# Windows
python -m novowrap.validate -input [input file] -taxon [taxonomy]
# Linux and MacOS
python3 -m novowrap.validate -input [input file] -taxon [taxonomy]
- Only merge
# Windows
python -m novowrap.merge -input [input file]
# Linux and MacOS
python3 -m novowrap.merge -input [input file]
Graphical user interface
If installed with pip,
# Windows
python -m novowrap
# Linux and MacOS
python3 -m novowrap
If use the portable version, just double-click the novowrap.exe
or novowrap
in
the folder.
Then click the button to choose which module to use. Notice that if one of the option was set to the wrong value, the program will refuse to run and hint the user to correct the invalid option.
Input
The assembly
module accepts gz
or fastq
format as input. If use input list
, the list file should be csv
format. If use reference
file instead
of automatically get from NCBI, the file format should be genbank
.
The merge
module accepts fasta
format as input.
The validate
module accepts fasta
format as input. If use reference
file
instead of automatically get from NCBI, the file format should be genbank
or fasta
as long as it is a complete chloroplast genome.
Output
.gb
files: genbank format sequence, with annotation of boundary of
LSC/SSC/IR regions.
.rotate
files: rotated sequence as fasta format; start from trnH-psbA
, same direction with
reference
.pdf
files: figure of validation of assembly
_RC_
files: if filenames contain _RC_
, it means one of the region of the
sequence was adjusted according to the reference. The unadjusted sequence
could be found in Temp
folder.
Options
Assembly
General
These options are for general usage.
-h
or -help
: print help message
-input [filenames]
: input filenames, could be single or pair-end, support gz
and fastq format
-list [filenames]
: input list for batch mode. The list should be a csv file
with three columns,
Input 1,Input 2,Taxonomy
If only have one input file, just leave the Input 2
column empty.
Please use full path of file names, for instance, d:\data\sample-1\forward.fastq
instead of forward.fastq
or sample-1\forward.fastq
.
-ref [filename]
, reference file for assembly and validate, should be
genbank
format contains only one chloroplast genome sequence. Extra
sequences will be ignored. For automatic running, -taxon
is recommended
-taxon [taxonomy name]
: taxonomy name of the sample, space is allowed. For
instance, -taxon Oryza sativa
, will find reference chloroplast genome of
_Oryza sativa_
from NCBI RefSeq database. If not found, will find most related
species' reference, Oryza
, Poaceae
, Poales
et al.
If -ref
and -taxon
were both not set, will use _Nicotiana tabacum_
to get the reference (which is one of the earliest sequenced chloroplast
genome
-out [folder name]
: output folder, if not set, the program will auto
create it according to input file's name
Advanced
These options are for advanced usage. If not sure, just keep the default value.
-platform [illumina/ion]
: sequencing platform, the default is illumina
. If
use ion-torrent, set -platform ion
-insert_size [number]
: the insert size of sequencing library, should be
integer
-seed [names]
: gene names as seeds for assembly, separated by comma, the
default seeds are rbcL,psaB,psaC,rrn23
-seed_file [filename]
: seed file, will overwrite -seed
option
-split [number]
: split input file, only use [number]
of them, useful for
large data while computer memory is limited. For instance, -split 10000000
will only use 10 million reads
-kmer [number]
: kmer size for assembly, should be odd number. Most of time
it's unnecessary to change it
-min [number]
: minimum genome size, default is 100 kB
-max [number]
: maximum genome size, default is 200 kB. Only change -min
and -max
if target genome size is out of the default range. The program
needn't to know the precise size of the genome
-mem [number]
: memory limit, the unit is GB. For instance, -mem 8
will
limit the memory usage to 8 GB. Should be integer
-perc_identity [number]
: the threshold of minimum percent of identity, used
for validation with BLAST. The default value is 0.7
. Should be float number
between 0 and 1.
-len_diff [number]
: the threshold of maximum percent of length different of
query and reference. Used for eliminating invalid assembly results. If the
sequence length's difference of assembly and reference genome is larger than
the value, the assembly result will be discarded. The default value is 0.2
.
Should be float number between 0 and 1.
-debug
: print debug information if set
-mt
: for mitochondria genomes (experimental function)
-simple_validate
: for chloroplast genomes without quadripartite structure
Validate
General
-h
or -help
: print help message
-input [filename]
: input filename. Only support fasta
format
-ref [filename]
, reference file for assembly and validate, should be
genbank
or fasta
format that contains only one chloroplast genome
sequence. Extra sequences will be ignored.
-taxon [taxonomy name]
: taxonomy name of the reference's species, space
is allowed. Recommend to use same genus or family, or higher rank if it's well
known that the target taxonomy's chloroplast genome is conserved.
-out [folder name]
: output folder, if not set, the program will auto create
it according to input file's name
Advanced
-perc_identity [number]
: the threshold of minimum percent of identity, used
for validation with BLAST. The default value is 0.7
. Should be float number
between 0 and 1.
-len_diff [number]
: the threshold of maximum percent of length different of
query and reference. Used for eliminating invalid assembly results. If the
sequence length's difference of assembly and reference genome is larger than
the value, the assembly result will be discarded. The default value is 0.2
.
Should be float number between 0 and 1.
-debug
: print debug information if set
-mt
: for mitochondria genomes (experimental function)
-simple_validate
: for chloroplast genomes without quadripartite structure
Merge
-h
or -help
: print help message
-input [filename]
: input filename. Only support fasta
format
-out [folder name]
: output folder, if not set, the program will auto create
it according to input file's name
Performance
The most time-consuming step is assembly. If the chloroplast genome's reads in sequencing data is plentiful enough, and the computer's memory is big enough for the data size, the assembly will be finished in minutes.
The validation step usually could finish in less than one minute. If slower, please check the Internet connection since the program may query the NCBI database.
The merge module could cost seconds or minutes. It depends on input data. Complex relationship of contigs requires much more time.
Citation
As yet unpublished.
License
The software itself is licensed under AGPL-3.0 (not include third-party software).
Q&A
Please submit your questions in the Issue page :smiley:
-
Q: I can't see the full UI, some part was missing.
A: Please try to drag the corner of the window to enlarge it. We got reports that some users in MacOS have this issue.
-
Q: I got error message that the program failed to install perl/BLAST/NOVOPlasty.
A: Uncommonly, users in specific area have connection issue for those websites. Users have to manually download packages and install (see Software for the download links).
For Windows users, please download and unpack files into
%HOMEDRIVE%%HOMEPATH%/.novowrap
.For Linux and MacOS users, please download and unpack files into
~/.novowrap
. -
Q: I got error message that I don't have
tkinter
module installed.A: If you want to run GUI on Linux computer, this error may happened, because the Python you used did not include tkinter as default package (kind of weird). Run
# Debian and Ubuntu sudo apt install python3-tk # CentOS sudo yum install python3-tk
may help.
-
Q: It says my input is invalid, but I'm sure it's OK!
A: Please check your files' path. The
space
character in the folder name or filename may cause this error. -
Q: It says
ImportError: Bio.Alphabet has been removed from Biopython
and the program failed to start.A: In 2020.9, Biopython removed Bio.Alphabet module in v1.78, which may cause this trouble in the old version of novowrap. Please upgrade your
novowrap
tov0.97
or higher. If you find difficult to upgrade novowrap, please try to use the portable packages. -
Q: I want to assemble mitochondria genomes.
A: add
-mt
option or click related checkbutton on the GUI. Since mitochondria genomes do not have a stable and uniform structure like chloroplast, wet lab experiments may be necessary for verification. -
Q: I want to assemble chloroplast genomes without quadripartite structure.
A: add
-simple_validate
option in commandline or click related checkbutton on the GUI. Note that without quadripartite structure, the Validate module will skip the adjustment of the structure of the sequences.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.