a python package for super-fast and accurate annotation of molecular functionality using read data without prior assembly or gene finding
Project description
mi-faser
microbiome - functional annotation of sequencing reads
A super-fast ( < 20min/10GB of reads ) and accurate ( > 90% precision ) method for annotation of molecular functionality encoded in sequencing read data without the need for assembly or gene finding.
Web Service: https://bromberglab.org/services/mifaser/
Docker: A pre-build docker image is available at https://hub.docker.com/r/bromberglab/mifaser
Pre-Requirements
mi-faser runs on LINUX, MacOSX and WINDOWS systems.
Dependencies
- Python >= 3.6
- DIAMOND >= 0.8.8 (included; sources: https://github.com/bbuchfink/diamond)
- WINDOWS: Visual C++ Redistributable *
Note: mi-faser was developed and optimized using DIAMOND v0.8.8, which is included in all release up to v1.11.4. This is also the version used in the accompanying publication [1]. All newer releases of mi-faser use the latest stable release of DIAMOND. mi-faser results for the first release (v1.2) with an updated version of DIAMOND (v0.9.13) were not affected by this (<0.1% difference; based on results for the artificial metagenome supplied as example dataset). According to the authors, more recent versions of DIAMOND offer substantial improvements regarding speed and memory usage as well as bugfixes. Thus, we strongly recommend to always use the latest version of DIAMOND (see Section: DIAMOND upgrade). This might alter mi-faser results slightly. However, results are expected to be enriched by new correct annotations rather than introducing mis-annotations.
Note that it is recommended to download and compile DIAMOND locally (https://github.com/bbuchfink/diamond) as this might have a significant impact on performance (due to special CPU instructions). However, this repository includes a pre-compiled version of DIAMOND to use.
Note that different split sizes could, at very rare occasions, result in minor deviations in mi-faser annotations. This is due to certain heuristics applied by DIAMOND when generating sequence alignments. We suggest to retain the split size for comparable analyses.
Optional extensions
-
SRA Toolkit >= 2.9.1 (NCBI)
If installed enables mi-faser to automatically retrieve and process read files deposited in the NCBI Sequence Read Archives SRA. Currently SRR, ERR and DRR identifiers are suppotted.
Reference Database
mi-faser was developed using a manually curated reference database of protein functions (GS database; DOI 10.5281/zenodo.1048269).
Since version 1.5 mi-faser also contains a new GS+ database, which extends the default GS database. The GS+ database includes additional 55 manually curated protein sequences, introducing 28 new E.C.s that represent important microbial functions in the environment.
To create an new reference database, refer to the paragraph Creating a reference database.
Installation
Standalone VS Web Service
The Standalone version of mi-faser partitions the user input into subsets analogue to the Web Service (http://services.bromberglab.org/mifaser/). However, those partitions are processed sequentially and not in parallel as in the Web Service. Thus the Standalone Version is only recommended for smaller jobs and is mainly thought to provide the mi-faser code base.
Python package mi-faser is available as python package. To install mi-faser using pip run:
pip install mifaser
mi-faser can the be used directly from the command line:
mifaser
The mi-faser module can be imported in a Python project by import mifaser
.
Docker
The pre-build mi-faser docker image is probably the most convenient way to run mi-faser locally or in any cloud infrastructure. The docker image can be used in the same way as the standalone version, however mounting of a common working directory into the virtual environment is required.
To create and execute a single instance of mi-faser using a locally mounted working directory run:
docker run --rm \
-v <LOCAL_INPUT_DIRECTORY>:/input \
-v <LOCAL_OUTPUT_DIRECTORY>:/output \
bromberglab/mifaser -f <INPUT_FILE>
<INPUT_FILE> is a valid mi-faser input file located in <LOCAL_INPUT_DIRECTORY> on your host environment. By default, mi-faser reads inputfiles relative to /input
and writes any output to /output
. Thus, by bind mounting your local <LOCAL_INPUT_DIRECTORY> to /input
inside the docker container, input files can be passed simply as relative paths to your <LOCAL_INPUT_DIRECTORY>. Similarly, by mounting a <LOCAL_OUTPUT_DIRECTORY> to /output
inside the docker container, all mi-faser outputs can be accessed at the <LOCAL_OUTPUT_DIRECTORY>.
Python source (git repository)
Open a terminal and checkout the mi-faser repository:
git clone https://git@bitbucket.org/bromberglab/mifaser.git
or download the zipped version:
curl --remote-name https://bitbucket.org/bromberglab/mifaser/get/master.zip
unzip master.zip
Usage
In case mi-faser was downloaded using the git repository:
- navigate to the mi-faser repository base directory
- all examples in the following documentation have to be run using
python -m mifaser
instead ofmifaser
.
run mi-faser (Single or 2-Lane mode)
Single: input-file containing DNA reads, single http[s]/ftp[s] url or SRA accession ID (sra:<accession_id>):
mifaser -f/--inputfile <INPUT_FILE>
2-Lane: two files (R1/R2), http[s]/ftp[s] urls or SRA accession IDs (sra:<accession_id1> sra:<accession_id2>):
mifaser -l/--lanes <R1_FILE> <R2_FILE>
CLI
mi-faser help:
usage: mifaser [-h] [-f INPUTFILE] [-l R1 R2] [-o OUTPUTFOLDER]
[-d DATABASEFOLDER] [-i DIAMONDFOLDER] [-m] [-s SPLIT]
[-S [SPLITMB]] [-t THREADS] [-c CPU] [-p] [-n] [-u UPDATE]
[-D [arg [arg ...]]] [-v] [-q] [--version]
mi-faser, microbiome - functional annotation of sequencing reads
a super-fast ( < 10min/10GB of reads ) and accurate ( > 90% precision ) method
for annotation of molecular functionality encoded in sequencing read data
without the need for assembly or gene finding.
Public web service: https://services.bromberglab.org/mifaser
Version: 1.60 [03/23/20]
optional arguments:
-h, --help show this help message and exit
-f INPUTFILE, --inputfile INPUTFILE
input DNA reads file, http[s]/ftp[s] url or SRA
accession id (sra:<id>)
-l R1 R2, --lanes R1 R2
2-Lane format (R1/R2) files, http[s]/ftp[s] url or SRA
accession ids (sra:<id_1> sra:<id_2>)
-o OUTPUTFOLDER, --outputfolder OUTPUTFOLDER
path to base output folder; default: INPUTFILE_out
-d DATABASEFOLDER, --databasefolder DATABASEFOLDER
name of database located in database/ directory OR
absolute path to folder containing database files
-i DIAMONDFOLDER, --diamondfolder DIAMONDFOLDER
path to folder containing diamond binary
-m, --mapping if flag is set all reads mappings will be generated
(reads{n=*} -> EC{n=1}, fasta)
-s SPLIT, --split SPLIT
split by X sequences; default: 100k; 0 forces no split
-S [SPLITMB], --splitmb [SPLITMB]
split by X MB; default: 25; (requires split from GNU
Coreutils)
-t THREADS, --threads THREADS
number of threads; default: 1
-c CPU, --cpu CPU max cpus per thread; default: all available
-p, --preserve if flag is set intermediate results are kept
-n, --no-check if flag is set check for compatibility between diamond
database and binary is omitted
-u UPDATE, --update UPDATE
valid update commands: { diamond[:version] }
-D [arg [arg ...]], --createdb [arg [arg ...]]
create new reference database: <db_name>
<db_sequences.fasta> [merge_db=<name of db to merge
with>] [update_ec_annotations=<1|0>; default: 0]
-v, --verbose set verbosity level; default: log level INFO
-q, --quiet if flag is set console output is logged to file
--version show program's version number and exit
If you use *mi-faser* in published research, please cite:
Zhu, C., Miller, M., ... Bromberg, Y. (2017).
Functional sequencing read annotation for high precision microbiome analysis.
Nucleic Acids Res. [doi:10.1093/nar/gkx1209]
(https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1209/4670955)
mi-faser is developed by Chengsheng Zhu and Maximilian Miller.
Feel free to contact us for support at services@bromberglab.org.
This project is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0)
Test: mifaser -f mifaser/files/test/artificial_mg.fasta -o mifaser/files/test/out
Example
A demo dataset containing 10k reads is provided to verify a local mi-faser installation. Navigate to the mifaser repository base directory and run mi-faser with the following arguments:
mifaser -f mifaser/files/test/artificial_mg.fasta -o mifaser/files/test/out
The resulting analysis will be located relative to the mifaser base directory at: mifaser/files/test/out/.
DIAMOND upgrade
As DIAMOND (https://github.com/bbuchfink/diamond) is actively developed, we provide an easy way to upgrade (or downgrade) to another version. In case a specific version of DIAMOND is given as parameter, this version will be automatically downloaded and installed (default: latest release).
mifaser --update diamond[:<DIAMOND_VERSION>]
Creating a reference database
mi-faser uses a manually curated reference database of protein functions. To create an alternative reference database, first store the desired set of protein sequences in a multi-FASTA file using the following format for the sequence headers:
>id|annotation|e.c.-number|additional_details
sequences.fasta:
>id|annotation|e.c.-number|additional_details
MKPNTDFMLIADGAKVLTQGNLTEHCAIEVSDGIICGLKSTISAEWTADKPHYRLTSGTL
VAGFIDTQVNGGGGLMFNHVPTLETLRLMMQAHRQFGTTAMLPTVITDDIEVMQAAADAV
AEAIDCQVPGIIGIHFEG
>id|annotation|e.c.-number|additional_details
MYYGLDIGGTKIELAIFDTQLALQDKWRLSTPGQDYSAFMATLAEQIEKADQQCGERGTV
GIALPGVVKADGTVISSNVPCLNQRRVAHDLAQLLNRTVAIGNDCRCFALSEAVLGVGRG
YSRVLGMI
Then run mi-faser using the -D/--createdb argument to create a new reference database my_database:
mifaser -D my_database path/to/sequences.fasta
To use the new database run:
mifaser -d my_database -f mifaser/files/test/artificial_mg.fasta -o mifaser/files/test/out
See the help menu (--help) for more details.
License
This project is licensed under NPOSL-3.0.
Citation
If you use mi-faser in published research, please cite:
Zhu, C., Miller, M., Marpaka, S., Vaysberg, P., Rühlemann, M. C., Wu, G. H. F.-A., . . . Bromberg, Y. (2017). Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. doi:10.1093/nar/gkx1209
About
mi-faser is developed by Chengsheng Zhu and Maximilian Miller. Feel free to contact us for support: services@bromberglab.org.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.