Skip to main content

"Automated quality control for GenBank genomes."

Project description

https://api.travis-ci.org/andrewsanchez/GenBankQC.svg?branch=master

GenBank Quality Control

Complete documentation lives at genbank-qc.readthedocs.io. It is a work in progress.

GenBankQC is an effort to address the quality control problem for public databases such as the National Center for Biotechnology Information’s GenBank. The goal is to offer a simple, efficient, and automated solution for assessing the quality of your genomes.

Note

Please note that GenbankQC is currently in beta. As a proof of concept for a specific use case, it currently has limitations that users should be aware of. If there is interest, we will address the issues to make it more convenient to use. Please see caveats for more details.

Features

  • Labelling/annotation-independent quality control based on:

    • Simple metrics

    • Genome distance estimation using MASH

  • Flag potential outliers to exclude them from polluting your pipelines

The genbankqc work-flow consists of the following steps:

  1. Generate statistics for each genome based on the following metrics:

    • Number of unknown bases

    • Number of contigs

    • Assembly size

    • Average MASH distance compared to other genomes

  2. Flag potential outliers based on these statistics:

    • Flag genomes containing more than a certain number of unknown bases.

    • Flag genomes outside of a range based on the median absolute deviation.

      • Applies to number of contigs and assembly size

    • Flag genomes whose MASH distance is greater than the upper end of the median absolute deviation.

  3. Visualize the results with a color coded tree

Usage

genbankqc /path/to/genomes
open /path/to/genomes/Escherichia_coli/qc/200_3.0_3.0_3.0/tree.svg

Installation

If you don’t yet have a functional conda environment, please download and install Miniconda.

conda create -n genbankqc -c etetoolkit -c biocore pip ete3 scikit-bio

source activate genbankqc

pip install genbankqc

Caveats

There are some arbitrary, hard-coded limitations regarding file names. This is because the project originally began as a part of the NCBI Tool Kit (NCBITK) which we use for downloading genomes from NCBI. NCBITK generates a specific directory structure and file naming scheme which GenbankQC currently expects.

If you’d like to use GenBankQC without using NCBITK, all that is required is that your file names match the python regular expression re.compile('.*(GCA_\d+\.\d.*)(.fasta)'). You can quickly test this by following my example at pythex.org.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GenBankQC-0.1a0.tar.gz (11.6 kB view details)

Uploaded Source

File details

Details for the file GenBankQC-0.1a0.tar.gz.

File metadata

  • Download URL: GenBankQC-0.1a0.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for GenBankQC-0.1a0.tar.gz
Algorithm Hash digest
SHA256 9d492b8199b62a5396006f8234cb88e89091cfc6123221e30242c2c51242a806
MD5 ef2e58193a7d0ae26dd8eaa12436f585
BLAKE2b-256 ce3567c554d2f973ec6b67073043e008b821fb3d2083be393947998bb75cfc2e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page