Skip to main content

Assembly Chloroplast Genome

Project description

Build Status PyPI version Anaconda version

Quick start

Download the package, unzip, and then double-click novowrap.exe or novowrap.

OR

Open terminal, run

# Install, using pip (recommended)
pip install novowrap --user
# Or, use conda
conda install -c wpwupingwp novowrap

# Initiliaze with Internet
# Windows
python -m novowrap init
# Linux and MacOS
python3 -m novowrap init

# Run
# Windows
python -m novowrap
# Linux and MacOS
python3 -m novowrap

Table of Contents

Feature

:heavy_check_mark: Assembly chloroplast genomes from given NGS data, with minimal parameters to set. Also, it supports batch mode.

Automatic generate uniform conformation with reference (typically, start from trnH-psbA, and, SSC/LSC region have same direction with reference).

:heavy_check_mark: Merge contigs according to overlapping region. May handle Invert-Repeat fragments.

:heavy_check_mark: Validate assembly results by comparing the synteny and sequence homology with given reference (or taxonomy name).

Prerequisite

Hardware

The assembly function will call NOVOPlasty, which requires 2 GB memory for 1 GB uncompressed data.

The other functions could run in normal computers and have no extra requirements for memory, CPU, et al.

The software requires Internet for the first run to install the missing dependencies. Then, it could work if offline, but better with connection.

Software

For the portable version, nothing need to be installed manually.

For installing from pip, Python is required. Notice that the python version should be 3.6 or higher.

:white_check_mark: All third-party dependencies will be automatically installed with Internet, including biopython, matplotlib, coloredlogs, graphviz (python packages), and perl(for Windows only), NOVOPlasty, BLAST.

Installation

Portable

Download from the link, unpack and run with Internet for the first time.

Install with pip

  1. Install Python. 3.6 or newer is required.

  2. Open command line, run

pip install novowrap --user

Install with conda

After installed anaconda or miniconda, run

conda install -c wpwupingwp novowrap

Initialization

During the first running, novowrap will check and initialize the running environment. Missing dependencies will be automatically installed. This step requires Internet connection.

# Windows
python -m novowrap init
# Linux and MacOS
python3 -m novowrap init

Usage

Command line

:exclamation: In Linux and MacOS, Python2 is python2 and Python3 is python3. However, in Windows, Python3 is called python, too. Please notice the difference.

  • Show help information of each module
# Windows
python -m novowrap.assembly -h
python -m novowrap.validate -h
python -m novowrap.merge -h
# Linux and MacOS
python3 -m novowrap.assembly -h
python3 -m novowrap.validate -h
python3 -m novowrap.merge -h
  • Assembly and validate
# Windows, single sample
python -m novowrap -input [input1] [input2] -taxon [taxonomy]
# Windows, batch mode for numerous samples
python -m novowrap -list [list file]
# Linux and MacOS, single sample
python3 -m novowrap -input [input1] [input2] -taxon [taxonomy]
# Linux and MacOS, batch mode for numerous samples
python3 -m novowrap -list [list file]
  • Only validate
# Windows
python -m novowrap.validate -input [input file] -taxon [taxonomy]
# Linux and MacOS
python3 -m novowrap.validate -input [input file] -taxon [taxonomy]
  • Only merge
# Windows
python -m novowrap.merge -input [input file]
# Linux and MacOS
python3 -m novowrap.merge -input [input file]

Graphical user interface

If installed with pip,

# Windows
python -m novowrap
# Linux and MacOS
python3 -m novowrap

If use the portable version, just double-click the novowrap.exe or novowrap in the folder.

Then click the button to choose which module to use. Notice that if one of the option was set to the wrong value, the program will refuse to run and hint the user to correct the invalid option.

Input

The assembly module accepts gz or fastq format as input. If use input list, the list file should be csv format. If use reference file instead of automatically get from NCBI, the file format should be genbank.

The merge module accepts fasta format as input.

The validate module accepts fasta format as input. If use reference file instead of automatically get from NCBI, the file format should be genbank or fasta as long as it is a complete chloroplast genome.

Output

.gb files: genbank format sequence, with annotation of boundary of LSC/SSC/IR regions.

.rotate files: rotated sequence as fasta format; start from trnH-psbA, same direction with reference

.pdf files: figure of validation of assembly

_RC_ files: if filenames contain _RC_, it means one of the region of the sequence was adjusted according to the reference. The unadjusted sequence could be found in Temp folder.

Options

Assembly

General

These options are for general usage.

-h or -help: print help message

-input [filenames]: input filenames, could be single or pair-end, support gz and fastq format

-list [filenames]: input list for batch mode. The list should be a csv file with three columns,

Input 1,Input 2,Taxonomy

If only have one input file, just leave the Input 2 column empty.

Please use full path of file names, for instance, d:\data\sample-1\forward.fastq instead of forward.fastq or sample-1\forward.fastq.

-ref [filename], reference file for assembly and validate, should be genbank format contains only one chloroplast genome sequence. Extra sequences will be ignored. For automatic running, -taxon is recommended

-taxon [taxonomy name]: taxonomy name of the sample, space is allowed. For instance, -taxon Oryza sativa, will find reference chloroplast genome of _Oryza sativa_ from NCBI RefSeq database. If not found, will find most related species' reference, Oryza, Poaceae, Poales et al.

If -ref and -taxon were both not set, will use _Nicotiana tabacum_ to get the reference (which is one of the earliest sequenced chloroplast genome

-out [folder name]: output folder, if not set, the program will auto create it according to input file's name

Advanced

These options are for advanced usage. If not sure, just keep the default value.

-platform [illumina/ion]: sequencing platform, the default is illumina. If use ion-torrent, set -platform ion

-insert_size [number]: the insert size of sequencing library, should be integer

-seed [names]: gene names as seeds for assembly, separated by comma, the default seeds are rbcL,psaB,psaC,rrn23

-seed_file [filename]: seed file, will overwrite -seed option

-split [number]: split input file, only use [number] of them, useful for large data while computer memory is limited. For instance, -split 10000000 will only use 10 million reads

-kmer [number]: kmer size for assembly, should be odd number. Most of time it's unnecessary to change it

-min [number]: minimum genome size, default is 100 kB

-max [number]: maximum genome size, default is 200 kB. Only change -min and -max if target genome size is out of the default range. The program needn't to know the precise size of the genome

-mem [number]: memory limit, the unit is GB. For instance, -mem 8 will limit the memory usage to 8 GB. Should be integer

-perc_identity [number]: the threshold of minimum percent of identity, used for validation with BLAST. The default value is 0.7. Should be float number between 0 and 1.

-len_diff [number]: the threshold of maximum percent of length different of query and reference. Used for eliminating invalid assembly results. If the sequence length's difference of assembly and reference genome is larger than the value, the assembly result will be discarded. The default value is 0.2. Should be float number between 0 and 1.

-debug: print debug information if set

-mt: for mitochondria genomes (experimental function)

-simple_validate: for chloroplast genomes without quadripartite structure

Validate

General

-h or -help: print help message

-input [filename]: input filename. Only support fasta format

-ref [filename], reference file for assembly and validate, should be genbank or fasta format that contains only one chloroplast genome sequence. Extra sequences will be ignored.

-taxon [taxonomy name]: taxonomy name of the reference's species, space is allowed. Recommend to use same genus or family, or higher rank if it's well known that the target taxonomy's chloroplast genome is conserved.

-out [folder name]: output folder, if not set, the program will auto create it according to input file's name

Advanced

-perc_identity [number]: the threshold of minimum percent of identity, used for validation with BLAST. The default value is 0.7. Should be float number between 0 and 1.

-len_diff [number]: the threshold of maximum percent of length different of query and reference. Used for eliminating invalid assembly results. If the sequence length's difference of assembly and reference genome is larger than the value, the assembly result will be discarded. The default value is 0.2. Should be float number between 0 and 1.

-debug: print debug information if set

-mt: for mitochondria genomes (experimental function)

-simple_validate: for chloroplast genomes without quadripartite structure

Merge

-h or -help: print help message

-input [filename]: input filename. Only support fasta format

-out [folder name]: output folder, if not set, the program will auto create it according to input file's name

Performance

The most time-consuming step is assembly. If the chloroplast genome's reads in sequencing data is plentiful enough, and the computer's memory is big enough for the data size, the assembly will be finished in minutes.

The validation step usually could finish in less than one minute. If slower, please check the Internet connection since the program may query the NCBI database.

The merge module could cost seconds or minutes. It depends on input data. Complex relationship of contigs requires much more time.

Citation

As yet unpublished.

License

The software itself is licensed under AGPL-3.0 (not include third-party software).

Q&A

Please submit your questions in the Issue page :smiley:

  • Q: I can't see the full UI, some part was missing.

    A: Please try to drag the corner of the window to enlarge it. We got reports that some users in MacOS have this issue.

  • Q: I got error message that the program failed to install perl/BLAST/NOVOPlasty.

    A: Uncommonly, users in specific area have connection issue for those websites. Users have to manually download packages and install (see Software for the download links).

    For Windows users, please download and unpack files into %HOMEDRIVE%%HOMEPATH%/.novowrap.

    For Linux and MacOS users, please download and unpack files into ~/.novowrap.

  • Q: I got error message that I don't have tkinter module installed.

    A: If you want to run GUI on Linux computer, this error may happened, because the Python you used did not include tkinter as default package (kind of weird). Run

    # Debian and Ubuntu
    sudo apt install python3-tk
    # CentOS
    sudo yum install python3-tk
    

    may help.

  • Q: It says my input is invalid, but I'm sure it's OK!

    A: Please check your files' path. The space character in the folder name or filename may cause this error.

  • Q: It says ImportError: Bio.Alphabet has been removed from Biopython and the program failed to start.

    A: In 2020.9, Biopython removed Bio.Alphabet module in v1.78, which may cause this trouble in the old version of novowrap. Please upgrade your novowrap to v0.97 or higher. If you find difficult to upgrade novowrap, please try to use the portable packages.

  • Q: I want to assemble mitochondria genomes.

    A: add -mt option or click related checkbutton on the GUI. Since mitochondria genomes do not have a stable and uniform structure like chloroplast, wet lab experiments may be necessary for verification.

  • Q: I want to assemble chloroplast genomes without quadripartite structure.

    A: add -simple_validate option in commandline or click related checkbutton on the GUI. Note that without quadripartite structure, the Validate module will skip the adjustment of the structure of the sequences.

  • Q: I am a conda user...

    A: Install novowrap with conda install -c wpwupingwp novowrap and the usage is same. In order to avoid potential conflicts with other packages, it is highly recommended to create a new running environment with conda before installation. For example,

    conda create -n test
    conda activate test
    conda install -c wpwupingwp novowrap
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

novowrap-0.982.tar.gz (54.8 kB view hashes)

Uploaded Source

Built Distribution

novowrap-0.982-py3-none-any.whl (51.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page