"Automated quality control for GenBank genomes."
Project description
GenBank Quality Control
Complete documentation lives at genbank-qc.readthedocs.io. It is a work in progress.
GenBankQC is an effort to address the quality control problem for public databases such as the National Center for Biotechnology Information’s GenBank. The goal is to offer a simple, efficient, and automated solution for assessing the quality of your genomes.
Note
Please note that GenbankQC is currently in beta. As a proof of concept for a specific use case, it currently has limitations that users should be aware of. If there is interest, we will address the issues to make it more convenient to use. Please see caveats for more details.
Features
Labelling/annotation-independent quality control based on:
Simple metrics
Genome distance estimation using MASH
Flag potential outliers to exclude them from polluting your pipelines
The genbankqc work-flow consists of the following steps:
Generate statistics for each genome based on the following metrics:
Number of unknown bases
Number of contigs
Assembly size
Average MASH distance compared to other genomes
Flag potential outliers based on these statistics:
Flag genomes containing more than a certain number of unknown bases.
Flag genomes outside of a range based on the median absolute deviation.
Applies to number of contigs and assembly size
Flag genomes whose MASH distance is greater than the upper end of the median absolute deviation.
Visualize the results with a color coded tree
Usage
genbankqc /path/to/genomes open /path/to/genomes/Escherichia_coli/qc/200_3.0_3.0_3.0/tree.svg
Installation
If you don’t yet have a functional conda environment, please download and install Miniconda.
conda create -n genbankqc -c etetoolkit -c biocore pip ete3 scikit-bio
source activate genbankqc
pip install genbankqc
Caveats
There are some arbitrary, hard-coded limitations regarding file names. This is because the project originally began as a part of the NCBI Tool Kit (NCBITK) which we use for downloading genomes from NCBI. NCBITK generates a specific directory structure and file naming scheme which GenbankQC currently expects.
If you’d like to use GenBankQC without using NCBITK, all that is required is that your file names match the python regular expression re.compile('.*(GCA_\d+\.\d.*)(.fasta)'). You can quickly test this by following my example at pythex.org.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file GenBankQC-0.1a0.tar.gz
.
File metadata
- Download URL: GenBankQC-0.1a0.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d492b8199b62a5396006f8234cb88e89091cfc6123221e30242c2c51242a806 |
|
MD5 | ef2e58193a7d0ae26dd8eaa12436f585 |
|
BLAKE2b-256 | ce3567c554d2f973ec6b67073043e008b821fb3d2083be393947998bb75cfc2e |