The `pacmill` python package is a bioinformatics pipeline that is developed to process microbial 16S amplicon sequencing data.
Project description
pacmill
version 0.6.1
The pacmill
python package is a bioinformatics pipeline that is developed to process microbial 16S amplicon sequencing data. It is specialized in the analysis of long reads such as those provided by PacBio sequencers.
Prerequisites
Since pacmill
is written in python, it is compatible with all operating systems: Linux, macOS and Windows. The only prerequisite is python3
(which is often installed by default) along with the pip3
package manager.
To check if you have python3
installed, type the following on your terminal:
$ python3 -V
If you do not have python3
installed, please refer to the section obtaining python3.
To check if you have pip3
installed, type the following on your terminal:
$ pip3 -V
If you do not have pip3
installed, please refer to the section obtaining pip3.
Installing
To install the pacmill
package, simply type the following commands on your terminal:
$ pip3 install --user pacmill
Alternatively, if you want to install it for all users of the system:
$ sudo pip3 install pacmill
These commands will also automatically install all the other python modules on which pacmill
depends.
External programs
The pacmill
pipeline also depends on several shell commands being available. The following executables should be present in your $PATH
environment variable:
fastQValidator
,fastqc
,barrnap
,vsearch
,mothur
,xelatex
,fastq-dump
If any of these required external programs are missing, you will be prompted to install them and given easy instructions to do so.
Usage
Metadata
The first thing to do when starting a new analysis is to fill in a metadata file that details all there is to know about the biological samples being processed.
An empty template for such a file is found under this repository at pacmill/metadata/metadata_blank.xlsx
. You can make a copy of this file for every new project.
In addition, another file named pacmill/metadata/metadata_example.xlsx
shows typical values that the fields are supposed to take along with a short documentation for each entry. A excerpt of this file is shown below:
Loading the project
Bellow are some examples to illustrate the various ways there are to use this package.
# This example is not completed yet. TODO.
Customizing report headers
To change the text that appears inside the header of the PDF reports generated, you can adjust these three environment variables to your liking. Credit is appreciated where credit is due, but the software has a very permissive license that lets you decide what is best.
$ export PACMILL_HEADER="From the \textbf{pacmill} project"
$ export PACMILL_SUBHEADER="Written by consultants at \url{www.sinclair.bio}""
$ export PACMILL_LINK="Hosted at \url{www.github.com/xapple/pacmill}"
Demo project
In order to test and evaluate the pipeline, we have provided a demonstration project ready to be processed. This enables the user to see what type of outputs are generated by pacmill
without having to bring their own DNA sequence data. Five samples are included and are taken from the following publication:
- "Confident phylogenetic identification of uncultured prokaryotes through long read amplicon sequencing of the 16S-ITS-23S rRNA operon."
- Joran Martijn, (many others), Thijs Ettema.
- Science for Life Laboratory, Uppsala University
- https://doi.org/10.1111/1462-2920.14636
The samples are publicly accessible on the Sequence Read Archive and are described as follows:
mock
: Genomic DNA from 38 phylogenetically distinct and diverse bacteria and archaea.p19
: Sediment sample obtained from hot spring Radiata Pool, Ngatamariki, New Zealand.pm3
: Sediment sample taken from 1.25m below the sea floor using a gravity core at Aarhus Bay, Denmark.sala
: Black biofilm that was taken at 60m depth in an old silver mine near Sala, Sweden.tns08
: Sediment sample taken from a shallow submarine hydrothermal vent field near Taketomi Island, Japan.
To run the demo project, start by executing the download.py
script which is placed at:
python3 ~/repos/pacmill/pacmill/demo/download.py
Once that completes you are ready to launch the pacmill
pipeline:
python3 ~/repos/pacmill/pacmill/run_pacmill.py demo_project ~/repos/pacmill/demo_project/demo_metadata/metadata_demo.xlsx
Note: if you are on macOS you should set the following environment variable for parallelization to work without crashing:
$ export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Example graphs
The pacmill
pipeline produces a multitude of graphs and visualizations after having processed the sequence data. Below are two examples. Firstly a sequence length distribution of cleaned reads. Secondly, a bar-stack of taxonomic assignments for five different samples at the phylum level.
Example reports
After running the pipeline on a set of FASTQ files, several PDF reports are auto-generated. Examples of three reports are given below. The first concerns an individual sample while the second details the results of a project containing several samples. The third focuses on taxonomic assignment results and visualizations.
Project report |
Sample report |
Taxonomy report |
Flowchart
Below is presented a flowchart detailing the multiple processing steps that occur in the pacmill
pipeline in a chronological order.
Extra documentation
More documentation is available at:
http://xapple.github.io/pacmill/pacmill
This documentation is simply generated from the source code with:
$ pdoc --html --output-dir docs --force pacmill
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pacmill-0.6.1.tar.gz
.
File metadata
- Download URL: pacmill-0.6.1.tar.gz
- Upload date:
- Size: 49.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9197b48a1558ccc4f97af690b3d54430ea6141695ce234a175cd6b6d189f5b62 |
|
MD5 | 4ace185e18c059ee0fab6c7c7e1c3b76 |
|
BLAKE2b-256 | 4ea2fef1263da5c65cae05ea78c58453fb800052887be0fc71ffad8aa511203b |