mpwt · PyPI

Multiprocessing for Pathway Tools

These details have not been verified by PyPI

Project links

Homepage

Project description

https://img.shields.io/badge/doi-10.1101/803056-blueviolet.svg

https://img.shields.io/badge/Pathway%20Tools-24.0-brightgreen

mpwt: Multiprocessing Pathway Tools

mpwt is a python package for running Pathway Tools on multiple genomes using multiprocessing.

There is no guarantee that this script will work, it is a Work In Progress in early state.

mpwt: Pipeline summary

The following picture shows the main argument of mpwt:

mpwt needs at least Python3.6. mpwt requires three python depedencies (biopython, docopt and gffutils) and Pathway Tools. For the multiprocessing, mpwt uses the multiprocessing library of Python 3.

You must have an environment where Pathway Tools is installed. Pathway Tools can be obtained here. The last version supported by mpwt is shown in the badge Pathway Tools.

Pathway Tools needs Blast, so it must be install on your system. Depending on your system, Pathway Tools needs a file named .ncbirc to locate Blast, for more informations look at this page.

/!\ For all OS, Pathway-Tools must be in $PATH.

On Linux and MacOS: export PATH=$PATH:your/install/directory/pathway-tools.

Consider adding Pathway Tools in $PATH permanently by running:

echo 'export PATH="$PATH:your/install/directory/pathway-tools:"' >> ~/.bashrc

If your OS doesn’t support Pathway Tools, you can use a docker container. If it’s your case, look at Pathway Tools Multiprocessing Docker. It is a dockerfile that will create a container with Pathway Tools, its dependencies and this package. You just need to give a Pathway Tools installer as input.

You can also look at Pathway Tools Multiprocessing Singularity. More manipulations are required compared to Docker but with this you can create a Singularity image.

Using pip

pip install mpwt

Use

Input data

The script takes a folder containing sub-folders as input. Each sub-folder contains a Genbank/GFF file or multiple PathoLogic Format (PF) files.

Folder_input
├── species_1
│   └── species_1.gbk
├── species_2
│   └── species_2.gff
│   └── species_2.fasta
├── species_3
│   └── species_3.gbk
├── species_4
│   └── scaffold_1.pf
│   └── scaffold_1.fasta
│   └── scaffold_2.pf
│   └── scaffold_2.fsa
taxon_id.tsv
..

Input files must have the same name as the folder in which they are located and also finished with a .gbk/.gbff or a .gff.

For PF files, there is one file for each scaffold/contig and one corresponding fasta file.

Pathway Tools will run on each Genbank/GFF/PF files. It will create the results in the ptools-local folder but you can also choose an output folder.

Genbank

Folder_input
├── species_1
│   └── species_1.gbk
..

Genbank file example:

LOCUS       scaffold1         XXXXXX bp    DNA     linear   INV DD-MMM-YYYY
DEFINITION  My species genbank.
ACCESSION   scaffold1
VERSION     scaffold1
KEYWORDS    Key words.
SOURCE      Source
ORGANISM  Species name
            Taxonomy; Of; My; Species; With;
            The; Genus.
FEATURES             Location/Qualifiers
    source          1..XXXXXX
                    /scaffold="scaffold1"
                    /db_xref="taxon:taxonid"
    gene            START..STOP
                    /locus_tag="gene1"
    mRNA            START..STOP
                    /locus_tag="gene1"
    CDS             START..STOP
                    /locus_tag="gene1"
                    /db_xref="InterPro:IPRXXXXXX"
                    /go_component="GO:XXXXXXX"
                    /EC_number="X.X.X.X"
                    /translation="AMINOAACIDSSEQUENCE"

Look at the NCBI GBK format for more informations. You can also look at the example provided on Pathway Tools site.

GFF

Folder_input
├── species_2
│   └── species_2.gff
│   └── species_2.fasta
..

GFF file example:

##gff-version 3
##sequence-region scaffold_1 1 XXXXXX
scaffold_1  RefSeq  region  1       XXXXXXX .       +       .       ID=region_id;Dbxref=taxon:XXXXXX
scaffold_1  RefSeq  gene    START   STOP    .       -       .       ID=gene_id
scaffold_1  RefSeq  CDS     START   STOP    .       -       0       ID=cds_id;Parent=gene_id;ec_number=X.X.X.X"

Warning: it seems that metabolic networks from GFF file have less reactions/pathways/compounds than metabolic networks from Genbank file or PathoLogic File. Lack of some annotations (EC, GO) can be the reason explaining these differences.

Look at the NCBI GFF format for more informations.

You have to provide a nucleotide sequence file (either ‘.fasta’ or ‘.fsa’ extensions) associated with the GFF file containing the chromosome/scaffold/contig sequence.

>scaffold_1
ATGATGCTGATACTGACTTAGCAT

PathoLogic Format

Folder_input
├── species_4
│   └── scaffold_1.pf
│   └── scaffold_1.fasta
│   └── scaffold_2.pf
│   └── scaffold_2.fsa
taxon_id.tsv
..

PF file example:

;;;;;;;;;;;;;;;;;;;;;;;;;
;; scaffold_1
;;;;;;;;;;;;;;;;;;;;;;;;;
ID  gene_id
NAME        gene_id
STARTBASE   START
ENDBASE     STOP
FUNCTION    ORF
PRODUCT-TYPE        P
PRODUCT-ID  prot gene_id
EC  X.X.X.X
DBLINK      GO:XXXXXXX
INTRON      START1-STOP1
//

Look at the Pathologic format for more informations.

You have to provide one nucleotide sequence (either ‘.fasta’ or ‘.fsa’ extension) for each pathologic containing one scaffold/contig.

>scaffold_1
ATGATGCTGATACTGACTTAGCAT

Also to add the taxon ID we need the taxon_id.tsv (a tsv file with two values: the name of the folder containing the PF files and the taxon ID corresponding).

species	taxon_id
species_4	4

If you don’t have taxon ID in your Genbank or GFF file, you can add one in this file for the corresponding species.

You can also add more informations for the genetic elements like circularity of genome (Y or N), type of genetic element (:CHRSM, :PLASMID, :MT (mitochondrial chromosome), :PT (chloroplast chromosome), or :CONTIG) or codon table (see the corresponding code below).

Example:

species	taxon_id	circular	element_type	codon_table	corresponding_file
species_1	10	Y	:CHRSM	1
species_4	4	N	:CHRSM	1	scaffold_1
species_4	4	N	:MT	1	scaffold_2

As you can see for PF file (species_4) you can use the column corresponding_file to add information for each PF files.

Genetic code for Pathway Tools:

Corresponding number	Genetic code
0	Unspecified
1	The Standard Code
2	The Vertebrate Mitochondrial Code
3	The Yeast Mitochondrial Code
4	The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
5	The Invertebrate Mitochondrial Code
6	The Ciliate, Dasycladacean and Hexamita Nuclear Code
9	The Echinoderm and Flatworm Mitochondrial Code
10	The Euplotid Nuclear Code
11	The Bacterial, Archaeal and Plant Plastid Code
12	The Alternative Yeast Nuclear Code
13	The Ascidian Mitochondrial Code
14	The Alternative Flatworm Mitochondrial Code
15	Blepharisma Nuclear Code
16	Chlorophycean Mitochondrial Code
21	Trematode Mitochondrial Code
22	Scenedesmus obliquus Mitochondrial Code
23	Thraustochytrium Mitochondrial Code

Input files created by mpwt

Three input files are created by mpwt. Informations are extracted from the Genbank/GFF/PF file. myDBName corresponds to the name of the folder and the Genbank/GFF/PF file. taxonid corresponds to the taxonid in the db_xref of the source feature in the Genbank/GFF/PF. The species_name is extracted from the Genbank/GFF/PF files.

**organism-params.dat**
ID  myDBName
STORAGE FILE
NCBI-TAXON-ID   taxonid
NAME    species_name

**genetic-elements.dats**
NAME
ANNOT-FILE  gbk_pathname
//

**dat_creation.lisp**
(in-package :ecocyc)
(select-organism :org-id 'myDBName)
(let ((*progress-noter-enabled?* NIL))
        (create-flat-files-for-current-kb))

Command Line and Python arguments

By using the python multiprocessing library, mpwt launches parallel PathoLogic processes on physical cores. Regarding memory requirements, they depend on the genome but we advise to use at least 2 GB per core.

mpwt can be used with the command line:

mpwt -f path/to/folder/input [-o path/to/folder/output] [--patho] [--hf] [--op] [--tp] [--nc] [-p FLOAT] [--dat] [--md] [--cpu INT] [-r] [--clean] [--log path/to/folder/log] [--ignore-error] [-v]

Optional argument are identified by [].

mpwt can be used in a python script with an import:

import mpwt

folder_input = "path/to/folder/input"
folder_output = "path/to/folder/output"

mpwt.multiprocess_pwt(input_folder=folder_input,
                      output_folder=folder_output,
                      patho_inference=optional_boolean,
                      patho_hole_filler=optional_boolean,
                      patho_operon_predictor=optional_boolean,
                      patho_transporter_inference=patho_transporter_inference,
                      no_download_articles=optional_boolean,
                      dat_creation=optional_boolean,
                      dat_extraction=optional_boolean,
                      size_reduction=optional_boolean,
                      number_cpu=int,
                      patho_log=optional_folder_pathname,
                      ignore_error=optional_boolean,
                      pathway_score=pathway_score,
                      taxon_file=optional_boolean,
                      verbose=optional_boolean)

Command line argument	Python argument	description
-f	input_folder(string: folder pathname)	Input folder as described in Input data
-o	output_folder(string: folder pathname)	Output folder containing PGDB data or dat files (see –dat arguments)
–patho	patho_inference(boolean)	Launch PathoLogic inference on input folder
–hf	patho_hole_filler(boolean)	Launch PathoLogic Hole Filler with Blast
–op	patho_operon_predictor(boolean)	Launch PathoLogic Operon Predictor
–tp	patho_transporter_inference(boolean)	Launch PathoLogic Transport Inference Parser
–nc	no_download_articles(boolean)	Launch PathoLogic without loading PubMed citations (not working)
-p	pathway_score(float)	Launch PathoLogic using a specified pathway prediction score
–dat	dat_creation(boolean)	Create BioPAX/attribute-value dat files
–md	dat_extraction(boolean)	Move only the dat files inside the output folder
–cpu	number_cpu(int)	Number of cpu used for the multiprocessing
-r	size_reduction(boolean)	Delete PGDB in ptools-local to reduce size and return compressed files
–log	patho_log(string: folder pathname)	Folder where log files for PathoLogic inference will be store
–delete	mpwt.remove_pgdbs(string: pgdb name)	Delete a specific PGDB
–clean	mpwt.cleaning()	Delete all PGDBs in ptools-local folder or only PGDB from input folder
–ignore-error	ignore_error(boolean)	Ignore errors and continue the workflow for successful build
–taxon-file	taxon_file(boolean)	Force mpwt to use the taxon ID in the taxon_id.tsv file
-v	verbose(boolean)	Print some information about the processing of mpwt

There is also another argument:

mpwt topf -f input_folder -o output_folder --cpu cpu_number

import mpwt
mpwt.to_pathologic.create_pathologic_file(input_folder, output_folder, cpu_number)

This argument reads the input data inside the input folder. Then it converts Genbank and GFF files into PathoLogic Format files. And if there is already PathoLogic files it copies them.

It can be used to avoid issues with parsing Genbank and GFF files. But it is an early Work in Progress.

Examples

Possible uses of mpwt:

command line

import mpwt
python script

Create PGDBs of studied organisms inside ptools-local:

mpwt -f path/to/folder/input --patho

import mpwt
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
        patho_inference=True)

Convert Genbank and GFF files into PathoLogic files then create PGDBs of studied organisms inside ptools-local:

mpwt topf -f path/to/folder/input -o path/to/folder/pf
mpwt -f path/to/folder/pf --patho

import mpwt
mpwt.create_pathologic_file(input_folder='path/to/folder/input', output_folder='path/to/folder/pf')
mpwt.multiprocess_pwt(input_folder='path/to/folder/pf', patho_inference=True)

Create PGDBs of studied organisms inside ptools-local with Hole Filler, Operon Predictor, Transport Inference Parser and create logs:

mpwt -f path/to/folder/input --patho --hf --op --tp --log path/to/folder/log

import mpwt
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
        patho_inference=True,
        patho_hole_filler=True,
        patho_operon_predictor=True,
        patho_transporter_inference=True,
        patho_log='path/to/folder/log')

Create PGDBs of studied organisms inside ptools-local with pathway prediction score of 1:

mpwt -f path/to/folder/input --patho -p 1.0

import mpwt
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
                    patho_inference=True,
                    pathway_score=1.0)

Create PGDBs of studied organisms inside ptools-local and create dat files:

mpwt -f path/to/folder/input --patho --dat

import mpwt
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
                    patho_inference=True,
                    dat_creation=True)

Create PGDBs of studied organisms inside ptools-local. Then move all the PGDB files to the output folder.

mpwt -f path/to/folder/input --patho -o path/to/folder/output

import mpwt
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
                    output_folder='path/to/folder/output',
                    patho_inference=True)

Create PGDBs of studied organisms inside ptools-local and create dat files. Then move the dat files to the output folder.

mpwt -f path/to/folder/input --patho --dat -o path/to/folder/output --md

import mpwt
mpwt.multiprocess_pwt(input_folder='path/to/folder/input',
                    output_folder='path/to/folder/output',
                    patho_inference=True,
                    dat_creation=True,
                    dat_extraction=True)

Create dat files for the PGDB inside ptools-local. And move them to the output folder.

mpwt --dat -o path/to/folder/output --md

import mpwt
mpwt.multiprocess_pwt(output_folder='path/to/folder/output',
                    dat_creation=True,
                    dat_extraction=True)

Move PGDB from ptools-local to the output folder:

mpwt -o path/to/folder/output

import mpwt
mpwt.multiprocess_pwt(output_folder='path/to/folder/output')

Move dat files from ptools-local to the output folder:

mpwt -o path/to/folder/output --md

import mpwt
mpwt.multiprocess_pwt(output_folder='path/to/folder/output',
        dat_extraction=True)

Useful functions

Run the multiprocess Pathway Tools on input folder

import mpwt
mpwt.multiprocess_pwt(input_folder=folder_input,
        output_folder=folder_output,
        patho_inference=optional_boolean,
        patho_hole_filler=optional_boolean,
        patho_operon_predictor=optional_boolean,
        patho_transporter_inference=patho_transporter_inference,
        no_download_articles=optional_boolean,
        dat_creation=optional_boolean,
        dat_extraction=optional_boolean,
        size_reduction=optional_boolean,
        number_cpu=int,
        patho_log=optional_folder_pathname,
        ignore_error=optional_boolean,
        pathway_score=pathway_score,
        taxon_file=optional_boolean,
        verbose=optional_boolean)

Delete all the previous PGDB and the metadata files

import mpwt
mpwt.cleaning(number_cpu=optional_int, verbose=optional_boolean)
This can also be used with a command line argument:
mpwt --clean
If you use clean and the argument -f input_folder, it will delete input files (‘dat_creation.lisp’, ‘dat_creation.log’, ‘pathologic.log’, ‘pwt_terminal.log’, ‘genetic-elements.dat’ and ‘organism-params.dat’) and the PGDB corresponding to the input folder.
mpwt -f input_folder --clean
For example if you have:
Folder_input
├── species_1
│   └── species_1.gbk
├── species_2
│   └── species_2.gff
│   └── species_2.fasta
├── species_3
│   └── species_3.gbk
And you have in your ptools-local:
ptools-local
├── pgdbs
    ├── user
        ├── species_1cyc
        │   └── ..
        ├── species_2cyc
        │   └── ..
        ├── species_3cyc
        │   └── ..
        ├── species_4cyc
        │   └── ..
The command:
mpwt -f input_folder --clean
will delete species_1cyc, species_2cyc and species_3cyc but not species_4cyc.

Delete a specific PGDB

With this command, it is possible to delete a specific PGDB, where pgdb_name is the name of the PGDB (ending with ‘cyc’). It can be multiple pgdbs, to do this, put all the pgdb IDs in a string separated by a ‘,’.
import mpwt
mpwt.remove_pgdbs(pgdb_name)
And as a command line:
mpwt --delete mydbcyc1,mydbcyc2

Return the path of ptools-local

import mpwt
ptools_local_path = mpwt.find_ptools_path()

Return a list containing all the PGDBs inside ptools-local folder

import mpwt
list_of_pgdbs = mpwt.list_pgdb()
Can be used as a command with:
mpwt --list

Errors

If you encounter errors (and it is highly possible) there is informations that can help you resolved them.

For error during PathoLogic inference, you can use the log arguments. The log contains the summary of the build and the error for each species. There is also a pathologic.log (created by Pathway Tools), a pwt_terminal.log (log of the terminal during PathoLogic process) and a dat_creation.log (log of the terminal during attributes-values files creation) in each sub-folders.

If the build passed you have also the possibility to see the result of the inference with the file resume_inference.tsv. For each species, it contains the number of genes/proteins/reactions/pathways/compounds in the metabolic network.

If Pathway Tools crashed, mpwt can print some useful information in verbose mode. It will show the terminal in which Pathway Tools has crashed. Also, if there is an error in pathologic.log, it will be shown after === Error in Pathologic.log ===.

There is a Pathway Tools forum where you can find informations on Pathway Tools errors.

You can also ignore PathoLogic errors by using the argument –ignore-error/ignore_error. This option will ignore error and continue the mpwt workflow on the successful PathoLogic build.

Output

If you did not use the output argument, results (PGDB with/without BioPAX/dat files) will be inside your ptools-local folder ready to be used with Pathway Tools. Have in mind that mpwt does not create the cellular overview and does not used the hole-filler. So if you want these results you should run them after.

If you used the output argument, there is two potential outputs depending on the use of the option –md/dat_extraction:

without –md/dat_extraction, you will have a complete PGDB folder inside your results, for example:

Folder_output
├── species_1
│   └── default-version
│   └── 1.0
│       └── data
│           └── contains BioPAX/dat files if you used the --dat/dat_creation option.
│       └── input
│           └── species_1.gbk
│           └── genetic-elements.dat
│           └── organism-init.dat
│           └── organism.dat
│       └── kb
│           └── species_1.ocelot
│       └── reports
│           └── contains Pathway Tools reports.
├── species_2
..
├── species_3
..

with –md/dat_extraction, you will only have the dat files, for example:

Folder_output
├── species_1
│   └── classes.dat
│   └── compounds.dat
│   └── dnabindsites.dat
│   └── enzrxns.dat
│   └── genes.dat
│   └── pathways.dat
│   └── promoters.dat
│   └── protein-features.dat
│   └── proteins.dat
│   └── protligandcplxes.dat
│   └── pubs.dat
│   └── reactions.dat
│   └── regulation.dat
│   └── regulons.dat
│   └── rnas.dat
│   └── species.dat
│   └── terminators.dat
│   └── transunits.dat
│   └── ..
├── species_2
..
├── species_3
..

with the -r /size_reduction argument, you will have compressed zip files (and PGDBs inside ptools-local will be deleted):

Folder_output
├── species_1.zip
├── species_2.zip
├── species_3.zip
..

For developer

mpwt uses logging so you need to create the handler configuration if you want mpwt’s log in your application:

import logging

from mpwt import multiprocess_pwt

logging.basicConfig()

multiprocess_pwt(...)

Release Notes

Changes between version are listed on the release page.

Citation

Arnaud Belcour, Clémence Frioux, Meziane Aite, Anthony Bretaudeau, Anne Siegel (2019) Metage2Metabo: metabolic complementarity applied to genomes of large-scale microbiotas for the identification of keystone species. bioRxiv 803056; doi: https://doi.org/10.1101/803056.

Acknowledgements

Mézaine Aite for his work on the first draft of this package.

Clémence Frioux for her work and feedbacks.

Peter Karp, Suzanne Paley, Markus Krummenacker, Richard Billington and Anamika Kothari from the Bioinformatics Research Group of SRI International for their help on Pathway Tools and on Genbank format.

GenOuest bioinformatics (https://www.genouest.org/) core facility for providing the computing infrastructure to test this tool.

All the users that have tested this tool.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.8.6

Feb 26, 2024

0.8.5

Feb 26, 2024

0.8.4

Jul 12, 2023

0.8.3

Jan 13, 2023

0.8.2

Dec 16, 2022

0.8.1

Sep 30, 2022

0.8.0

Sep 21, 2022

0.7.2

Apr 15, 2022

0.7.1

Mar 18, 2022

0.7.0

Feb 3, 2022

0.6.3

Jun 2, 2021

0.6.2

Jun 1, 2021

0.6.1

Mar 16, 2021

0.6.0

Dec 14, 2020

0.5.9

Nov 20, 2020

This version

0.5.8

Oct 12, 2020

0.5.7

Oct 1, 2020

0.5.6

Jul 28, 2020

0.5.5

May 27, 2020

0.5.4

Apr 10, 2020

0.5.3

Jan 9, 2020

0.5.2

Oct 17, 2019

0.5.1

Jul 31, 2019

0.5

Jul 2, 2019

0.4.2.4

Jun 7, 2019

0.4.2.3

Jun 7, 2019

0.4.2.2

Apr 18, 2019

0.4.2.1

Mar 28, 2019

0.4.2

Mar 21, 2019

0.4.1

Mar 18, 2019

0.4

Feb 25, 2019

0.3.7.3a1 pre-release

Dec 14, 2018

0.3.5a1 pre-release

Nov 28, 2018

0.3.4.4a1 pre-release

Oct 18, 2018

0.2.9.12a1 pre-release

Sep 28, 2018

0.2.9.11.3a1 pre-release

Jul 5, 2018

0.2.9.9.1a1 pre-release

Jun 28, 2018

0.2.9.6a1 pre-release

May 30, 2018

0.1a1 pre-release

May 17, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mpwt-0.5.8.tar.gz (45.7 kB view details)

Uploaded Oct 12, 2020 Source

Built Distribution

mpwt-0.5.8-py3-none-any.whl (46.2 kB view details)

Uploaded Oct 12, 2020 Python 3

File details

Details for the file mpwt-0.5.8.tar.gz.

File metadata

Download URL: mpwt-0.5.8.tar.gz
Upload date: Oct 12, 2020
Size: 45.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.6.12

File hashes

Hashes for mpwt-0.5.8.tar.gz
Algorithm	Hash digest
SHA256	`3d3b3e12aee9a832b273638a8dd40e72ac67c130fec82e6d753d86da895964d4`
MD5	`813960e6738284200e50b16e546f6eff`
BLAKE2b-256	`86562a2a7c5ca730d947fee64b8acabd84323f8140aed6dda4bfd071dcafa648`

See more details on using hashes here.

File details

Details for the file mpwt-0.5.8-py3-none-any.whl.

File metadata

Download URL: mpwt-0.5.8-py3-none-any.whl
Upload date: Oct 12, 2020
Size: 46.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.6.12

File hashes

Hashes for mpwt-0.5.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91cb5c772a15f540f6d0b905cdfda2a13c6270fea423d5262abd9cc20cb62992`
MD5	`2c05b136724fc96928e7d5fbbbe3e95a`
BLAKE2b-256	`913af5a8c0006a9d0d7a12cf12e5b709a53442614cf56b00985f3b0207637228`