Skip to main content
Python Software Foundation 20th Year Anniversary Fundraiser  Donate today!

No project description provided

Project description

# mutacc
[![Build Status](](
[![Coverage Status](](

## The mutation accumulation database

mutacc is a tool that makes it possible to create synthetic datasets to be used
for quality control or benchmarking of bioinformatic tools and pipelines intended
for variant calling of clinical variants. Using raw reads that supports a known
variant from a real NGS data, mutacc stores the relevant read from each case into
a database. This database can then be queried to create validation sets with true
positives with the same properties as a real NGS data.

## Installation
### Conda
For installation of mutacc and the external prerequisites, this is made easy by
creating conda environment

conda create -n <env_name> python=3.6 pip numpy cython

activate environment

source activate <env_name>
### External Prerequisites
mutacc takes use of two external packages, [seqkit](>=v0.9 ,
and [picard](>=v2.18. These can be
installed within a conda environment by

conda install -c bioconda picard
conda install -c bioconda seqkit

### Install mutacc
Within the conda environment, do
git clone
pip install -e mutacc
## Usage

### Configuration File

Some options are best passed to mutacc through a configuration file. below is an
example of a config file, using the YAML format.

host: <host> #Defaults to 'localhost'
port: <port> #Defaults to 27017
database: <database_name> #Defaults to 'mutacc'
username: <username>
password: <password>
root_dir: <path_to_root>

The 'root_dir' entry specifies an existing directory in the file system, where
all files generated by mutacc will be stored in corresponding subdirectories.
E.g. all generated fastq files will be stored in /.../root_dir/reads/

### Populate the mutacc database

To export data sets from the mutacc DB, the database must first be populated. To
extract the raw reads supporting a known variant, mutacc takes use of all
relevant files generated from a NGS experiment up to the variant calling itself.
That is the bam file, and vcf file containing only the variants of interest.

This information is specified as a 'case', represented in yaml format


case_id: 'case123' #REQUIRED CASE_ID

- sample_id: 'sample1' #REQUIRED
analysis_type: 'wgs' #REQUIRED
sex: 'male' #REQUIRED
mother: 'sample2' #REQUIRED (CAN BE 0 if no mother)
father: 'sample3' #REQUIRED (CAN BE 0 if no father)
bam_file: /path/to/sorted_bam #REQUIRED
phenotype: 'affected'

- sample_id: 'sample2'
analysis_type: 'wgs'
sex: 'female'
mother: '0' #0 if no parent
father: '0'
bam_file: /path/to/sorted_bam
phenotype: 'unaffected'

- sample_id: 'sample2'
analysis_type: 'wgs'
sex: 'male'
mother: '0'
father: '0'
bam_file: /path/to/sorted_bam
phenotype: 'affected'

variants: /path/to/vcf

This will find the reads from the bam files specified for each sample. If it
is desired that the reads are found from the fastq files instead, this can be
done by specifying the fastq files as such

- sample_id: 'sample1'
analysis_type: 'wgs'
sex: 'male'
mother: 'sample2'
father: 'sample3'
bam_file: /path/to/sorted_bam
- /path/to/fastq1
- /path/to/fastq2
phenotype: 'affected'
To extract the reads from the case

mutacc --config-file <config_file> extract --padding 600 --case <case_file>
the --padding option takes the number of basepairs that the desired region is
padded with.

This will create a file 'case_id'.mutacc stored in the directory specified in the
/.../root_dir/imports directory.

To import the case into the database

mutacc db import /.../root_dir/imports/<case_id>.mutacc

The db command is called each time mutacc needs to do any operation against the

This will try to establish a connection to an instance of mongodb, by default
running on 'localhost' on port 27017. If this is not wanted, it can be specified
with the --host and --port options.

mutacc db -h <host> -p <port> import case_id.mutacc

If authentication is required, this can be specified with the --username and
--password options.

or in a configuration file e.g.
host: <host>
port: <port>
username: <username>
password: <password>

mutacc --config-file <config.yaml> db import case_id.mutacc

### Export datasets from the database
The datasets are exported one sample at the time. At the moment, mutacc only
supports father/mother/child-trios and single samples. To export a synthetic
dataset, the export command is used together with options.


-m/--member [child|father|mother|affected]
specifies what family member to create a dataset for. Finds the correct
member in each case (if trio) in the database, and uses the reads from this
sample only to enrich the background samples. If a single sample dataset is
required, the option can be passed with the 'affected' argument, use the
reads from only one of the affected samples from each case.

-c/--case-query \
Query to search among the case collection in the mongodb. A json string,
with valid mongodb query language.

-v/--variant-query \
Query to search among the variants collection.

-s/--sex [male|female] \
Specify the sex of the sample

-n/--sample-name \
name of the sample

-p/--proband \
This flag will make the sample 'proband', this will force all variants from
single cases to be included into this sample

--vcf-dir \
Specify the directory where the vcf file (truth set) is stored. defaults
to /.../root_dir/variants/


mutacc --config-file <config.yaml> db export -m affected -c '{}'
will find all the cases from the mutacc DB, and store this
information in a file /.../root_dir/queries/sample_name_query.mutacc.

to export an entire trio, this can be done by

mutacc --config-file <config_file> db export -m child -c '{}' -p -n child
mutacc --config-file <config_file> db export -m father -c '{}' -n father
mutacc --config-file <config_file> db export -m mother -c '{}' -n mother
This will create three files child_query.mutacc, father_query.mutacc, and

the export subcommand will also generate a truth set vcf-file for each exported
samples, containing all queried variants.

To make a dataset (fastq-files) from a query file the synthesize command is used
with the following options

-b/--background-bam \
Path to the bam file for sample to be used as background

-f/--background-fastq \
Path to fastq file for sample to be used as background

-f2/--background-fastq2 \
Path to second fastq file (if paired end experiment)

-q/--query \
Path to the query files created with the export command

--dataset-dir \
Directory where fastq files will be stored. defaults to

example, using the query files created above

mutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_child> -f2 <fastq2_child> -q child_query.mutacc
mutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_father> -f2 <fastq2_father> -q father_query.mutacc
mutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_mother> -f2 <fastq2_mother> -q mother_query.mutacc

The created fastq-files will be stored in the directory /.../root_dir/datasets/
or in directory specified by ---dataset-dir

### Remove case from database

To remove a case from the mutacc DB, and all the generated bam, and fastq files
generated from that case from disk, the remove command is used

mutacc --config-file <config.yaml> db remove <case_id>

## Limitations

mutacc is currently under development and only supports either single cases
(cases with one sample) or mother/father/child trios. Furthermore, all cases
uploaded, and exported from the mutacc DB are assumed to be paired-end reads

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for mutacc, version 1.1.0
Filename, size File type Python version Upload date Hashes
Filename, size mutacc-1.1.0-py2.py3-none-any.whl (40.8 kB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size mutacc-1.1.0.tar.gz (31.9 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page