A metagenomics pipeline to estimate relative cell periods.
Project description
Menace
======
This bundle of software is a basic implementation of the algorithm for
extracting Peak-to-Trough Ratios from Metagenomic data, as first
described in `(Korem et. al, Science,
2015) <http://science.sciencemag.org/content/349/6252/1101>`__.
Installation:
-------------
Pip
~~~
Make sure that "pip" is the PyPi command of your *python2* installation,
then:
.. code:: bash
pip install menace
Git
^^^
.. code:: bash
git clone git@github.com:zertan/Menace.git
cd Menace
python setup.py install
This should install the below *python* dependencies. The other
dependencies have to be installed manually (if you have questions about
this I suggest you consult your cluster IT help desk).
The software has been tested on the "hebbe" cluster at
`C3SE <c3se.chalmers.se>`__ which uses the "slurm" system for resource
management (thus slurm is the only queueing system currently supported).
Dependencies:
~~~~~~~~~~~~~
::
Python2:
numpy
scipy
pandas
biopython
matplotlib
xmltodict
configparser
lmfit
newick
Jinja2
doric
-e git+https://github.com/PathoScope/PathoScope.git#egg=pathoscope
`samtools <http://www.htslib.org/download/>`__
`bamtools <https://github.com/pezmaster31/bamtools/wiki/Building-and-installing>`__
`bowtie2 <https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/>`__
`Pathoscope
2.0 <https://sourceforge.net/projects/pathoscope/files/?source=navbar>`__
(should be installed by the above pip command but make sure 'pathoscope
ID' is accessible in the shell, ie. is on the system path)
`parallel <http://www.gnu.org/software/parallel/>`__
`DoriC <http://tubic.tju.edu.cn/doric/download.php>`__ is a databse of
chromosome origin locations (OriCs) which is a (recommended) optional
dependency for the pipeline. Please visit the link and enter your e-mail
to download.
Usage
-----
You can get an overview of the menace functionality by running
``menace -h``.
1. Initialize a project in current directory by running ``menace init``.
Identify a set of NCBI genome reference accession numbers and put
them in "./searchStrings" (or use the default one which includes a
*minimal* set of references to bacteria common in the human gut).
2. Identify a metagenomic cohort of interest (download manually or add
URLs as described below) and add to the Data folder. Supported input:
raw/gzipped/bzipped ".fastq" files.
3. Add information to the ``project.conf`` file.
4. Edit ``loadmodules.sh`` to include the **python2** module of the
cluster (or comment out the lines if python2 is accessible by
default).
5. Run ``menace full`` (use "nohup {cmd} &" to keep alive after logout
if on a cluster login node).
6. Wait for job to complete. Run ``menace collect`` in project
directory.
Notes
^^^^^
The menace script is a common utility for all parts of the pipeline
including downloading of references and metagenomic data, bulding a
reference index, setting up the necessary file structure and submitting
to slurm. Hence, all configuration is intended to be set up in
project.conf (please see ``bin/project.conf.example`` for an example).
The default 'searchStrings' will most probably not fit your purposes but
is only an example. A more comprehensive Reference library will yield
higher coverage and more accurate values. A more comprehensive list of
human gut bacteria is available at 'extra/referenceACClong.txt'.
Directory structure (*example*)
-------------------------------
With the above usage example the path structure(s) will look something
like below.
::
$DATA_PATH
├ "Sample01" (eg. ERR525688)
. ├ {sample01_1.fastq.gz}
. └ {sample01_2.fastq.gz} paired metagenomic reads
.
$REF_PATH
├ Index
| └ {REF_NAME.*.bt2l} bowtie2 index files
├ Fasta
| └ {accession.fasta}
├ Headers
| └ {accession.xml} xml files containing extra genome references info
└ taxIDs.txt
$DORIC_PATH
├ bacteria_record.dat
└ bacteria_seq.fas
$OUTPUT_PATH
├ "Sample01"
. ├ depth
. | └ {accession.depth} coverage files for each reference
. ├ log
| └ {accession.log} output logs from piecewiseFit
├ npy
| └ {accession_OriC_TerC.npy} numpy files with origin/terminus locations and relative C periods
├ png
| └ {accession_fit.png} images of piecewise fit of the smoothed coverage
└ accession-sam-report.tsv Pathoscope2 reassignment report
Contents
--------
Below follows a description of the main scripts in the package.
jobscript
^^^^^^^^^
A submit script for sending a batch job to slurm for parallel processing
on a computing cluster.
**input:** none
**output:** directory structure as specified in "project.conf"
mainBuild.sh
^^^^^^^^^^^^
The main build script with commands intended to be executed on the
cluster.
**input:** none
**output:** temporary paths and files on compute nodes
PTRMatrix.py
^^^^^^^^^^^^
Traverses the specified directory generated by mainBuild.sh and
assembles information from each sample into tabular form (eg. averages
origin locations from many samples for a better estimate).
**input:** $OUTPUT\_PATH, $DORIC\_PATH, $REF\_PATH, bin/accLoc.csv
**output:** Abundance.csv, PTR.csv, DoublingTime.csv, Header.csv
piecewiseFit.py
^^^^^^^^^^^^^^^
Implements the piecewise linear fit and prior checks on the generated
depth files to filter out those instances in which enough data was
generated to produce a reliable coverage signal for estimating
replication origins. This data can be used further on, once those has
been estimated using the full cohort, to produce PTR-vaules for each
sample.
**input:** {reference.depth}
**output:** {reference\_OriC.npy}, {reference\_TerC.npy},
{reference\_coverage.png}, {reference\_fit.log}
fetchSeq.py
^^^^^^^^^^^
This utility can be used to download '.fasta' reference files from the
NCBI servers.
**input:** searchStrings.txt,
**output:** {reference.fasta}, {reference.xml}, taxIDs.txt
======
This bundle of software is a basic implementation of the algorithm for
extracting Peak-to-Trough Ratios from Metagenomic data, as first
described in `(Korem et. al, Science,
2015) <http://science.sciencemag.org/content/349/6252/1101>`__.
Installation:
-------------
Pip
~~~
Make sure that "pip" is the PyPi command of your *python2* installation,
then:
.. code:: bash
pip install menace
Git
^^^
.. code:: bash
git clone git@github.com:zertan/Menace.git
cd Menace
python setup.py install
This should install the below *python* dependencies. The other
dependencies have to be installed manually (if you have questions about
this I suggest you consult your cluster IT help desk).
The software has been tested on the "hebbe" cluster at
`C3SE <c3se.chalmers.se>`__ which uses the "slurm" system for resource
management (thus slurm is the only queueing system currently supported).
Dependencies:
~~~~~~~~~~~~~
::
Python2:
numpy
scipy
pandas
biopython
matplotlib
xmltodict
configparser
lmfit
newick
Jinja2
doric
-e git+https://github.com/PathoScope/PathoScope.git#egg=pathoscope
`samtools <http://www.htslib.org/download/>`__
`bamtools <https://github.com/pezmaster31/bamtools/wiki/Building-and-installing>`__
`bowtie2 <https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/>`__
`Pathoscope
2.0 <https://sourceforge.net/projects/pathoscope/files/?source=navbar>`__
(should be installed by the above pip command but make sure 'pathoscope
ID' is accessible in the shell, ie. is on the system path)
`parallel <http://www.gnu.org/software/parallel/>`__
`DoriC <http://tubic.tju.edu.cn/doric/download.php>`__ is a databse of
chromosome origin locations (OriCs) which is a (recommended) optional
dependency for the pipeline. Please visit the link and enter your e-mail
to download.
Usage
-----
You can get an overview of the menace functionality by running
``menace -h``.
1. Initialize a project in current directory by running ``menace init``.
Identify a set of NCBI genome reference accession numbers and put
them in "./searchStrings" (or use the default one which includes a
*minimal* set of references to bacteria common in the human gut).
2. Identify a metagenomic cohort of interest (download manually or add
URLs as described below) and add to the Data folder. Supported input:
raw/gzipped/bzipped ".fastq" files.
3. Add information to the ``project.conf`` file.
4. Edit ``loadmodules.sh`` to include the **python2** module of the
cluster (or comment out the lines if python2 is accessible by
default).
5. Run ``menace full`` (use "nohup {cmd} &" to keep alive after logout
if on a cluster login node).
6. Wait for job to complete. Run ``menace collect`` in project
directory.
Notes
^^^^^
The menace script is a common utility for all parts of the pipeline
including downloading of references and metagenomic data, bulding a
reference index, setting up the necessary file structure and submitting
to slurm. Hence, all configuration is intended to be set up in
project.conf (please see ``bin/project.conf.example`` for an example).
The default 'searchStrings' will most probably not fit your purposes but
is only an example. A more comprehensive Reference library will yield
higher coverage and more accurate values. A more comprehensive list of
human gut bacteria is available at 'extra/referenceACClong.txt'.
Directory structure (*example*)
-------------------------------
With the above usage example the path structure(s) will look something
like below.
::
$DATA_PATH
├ "Sample01" (eg. ERR525688)
. ├ {sample01_1.fastq.gz}
. └ {sample01_2.fastq.gz} paired metagenomic reads
.
$REF_PATH
├ Index
| └ {REF_NAME.*.bt2l} bowtie2 index files
├ Fasta
| └ {accession.fasta}
├ Headers
| └ {accession.xml} xml files containing extra genome references info
└ taxIDs.txt
$DORIC_PATH
├ bacteria_record.dat
└ bacteria_seq.fas
$OUTPUT_PATH
├ "Sample01"
. ├ depth
. | └ {accession.depth} coverage files for each reference
. ├ log
| └ {accession.log} output logs from piecewiseFit
├ npy
| └ {accession_OriC_TerC.npy} numpy files with origin/terminus locations and relative C periods
├ png
| └ {accession_fit.png} images of piecewise fit of the smoothed coverage
└ accession-sam-report.tsv Pathoscope2 reassignment report
Contents
--------
Below follows a description of the main scripts in the package.
jobscript
^^^^^^^^^
A submit script for sending a batch job to slurm for parallel processing
on a computing cluster.
**input:** none
**output:** directory structure as specified in "project.conf"
mainBuild.sh
^^^^^^^^^^^^
The main build script with commands intended to be executed on the
cluster.
**input:** none
**output:** temporary paths and files on compute nodes
PTRMatrix.py
^^^^^^^^^^^^
Traverses the specified directory generated by mainBuild.sh and
assembles information from each sample into tabular form (eg. averages
origin locations from many samples for a better estimate).
**input:** $OUTPUT\_PATH, $DORIC\_PATH, $REF\_PATH, bin/accLoc.csv
**output:** Abundance.csv, PTR.csv, DoublingTime.csv, Header.csv
piecewiseFit.py
^^^^^^^^^^^^^^^
Implements the piecewise linear fit and prior checks on the generated
depth files to filter out those instances in which enough data was
generated to produce a reliable coverage signal for estimating
replication origins. This data can be used further on, once those has
been estimated using the full cohort, to produce PTR-vaules for each
sample.
**input:** {reference.depth}
**output:** {reference\_OriC.npy}, {reference\_TerC.npy},
{reference\_coverage.png}, {reference\_fit.log}
fetchSeq.py
^^^^^^^^^^^
This utility can be used to download '.fasta' reference files from the
NCBI servers.
**input:** searchStrings.txt,
**output:** {reference.fasta}, {reference.xml}, taxIDs.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
menace-0.1.3.tar.gz
(3.6 MB
view hashes)