Indexes genomics data (nucleotide variants, kmers, MLST) for fast querying of features.
Project description
Genomics data index
This project is to design a system which can index large amounts of genomics data and enable rapid querying of this data.
Indexing breaks genomes up into individual features (nucleotide mutations, k-mers, or genes/MLST) and stores the index in a directory which can easily be shared with other people. Indexes can be generated direct from sequence data or loaded from existing intermediate files (e.g., VCF files, MLST results).
# Analyze sequence data (reads/assemblies, compressed/uncompressed)
gdi analysis --reference-file genome.gbk.gz *.fasta.gz *.fastq.gz
# (Alternatively) Index features in previously computed files (VCF files, or MLST results)
gdi load vcf --reference-file reference.gbk.gz vcf-files.txt
gdi load mlst-tseemann mlst.tsv # Load from https://github.com/tseemann/mlst
gdi load mlst-sistr sistr-profiles.csv # Load from https://github.com/phac-nml/sistr_cmd
Querying provides both a Python API and Command-line interface to select sets of samples using this index or attached external data (e.g., phylogenetic trees or DataFrames of metadata).
Python API:
# Select samples with a D614G mutation on gene S
r = s.hasa('hgvs:MN996528.1:S:D614G')
# Select samples with Allele 100 for Locus (gene) adk in MLST scheme ecoli
r = s.hasa('ecoli:adk:100')
Summaries of the features (mutations, kmers, MLST) can be exported from a set of samples alongside nucleotide alignments, distance matrices or trees constructed from subsets of features.
r.summary_features()
Mutation | Count |
---|---|
10 G>T | 1 |
20 C>T | 3 |
30 A>G | 5 |
Visualization of trees and sets of selected samples can be constructed using the provided Python API and the visualization tools provided by the ETE Toolkit.
r.tree_styler() \
.highlight(set1) \
.highlight(set2) \
#...
.render()
You can see more examples of this software in action in the provided Tutorials.
Table of contents
1. Overview
The software is divided into two main components: (1) Indexing and (2) Querying.
1.1. Indexing
The indexing component provides a mechanism to break genomes up into individual features and store these features in a database. The types of features supported include: Nucleotide mutations, K-mers, and Genes/MLST.
1.1.1. Naming features
Indexing assigns names to the individual features, represented as strings inspired by the Sequence Position Deletion Insertion (SPDI) model.
- Nucleotide mutations:
sequence:position:deletion:insertion
(e.g.,ref:100:A:T
) - Genes/MLST:
scheme:locus:allele
(e.g.,ecoli:adk:100
) - Kmers: Not implemented yet
Alternatively, for Nucleotide mutations names can be given using hgvs (as output by SnpEff).
- Nucleotide mutations:
hgvs:sequence:gene:p.protein_change
(e.g.,hgvs:ref:geneX:p.P20H
).
1.2. Querying
The querying component provides a Python API or command-line interface for executing queries on the genomics index. The primary type of query is a Samples query which returns sets of samples based on different criteria. These criteria are grouped into different Methods. Each method operates on a particular type of Data which could include features stored in the genomics index as well as trees or external metadata.
1.2.1. Python API
An example query on an existing set of samples s
would be:
r = s.isa('B.1.1.7', isa_column='lineage') \
.isin(['SampleA'], distance=1, units='substitutions') \
.hasa('MN996528.1:26568:C:A')
This would be read as:
Select all samples in
s
which are a B.1.1.7 lineage as defined in some attached DataFrame (isa()
) AND which are within 1 substitution of SampleA as defined on a phylogenetic tree (isin()
) AND which have a MN996528.1:26568:C:A mutation (hasa()
).
Note: I have left out some details in this query. Full examples for querying are available at Tutorial 1: Salmonella dataset.
2. Background
A paper on this project is in progress. A detailed description is found in my Thesis.
Additionally, a poster on this project can be found at immem2022.
3. Installation
3.1. Conda
Conda is a package and environment management software which makes it very easy to install and maintain dependencies of software without requiring administrator/root access. Packages from conda are provided using different channels and the bioconda channel contains a very large collection of bioinformatics software which can be automatically installed. To make use of conda you will have to first download and install conda. Once installed you can use the command conda
to install software/manage conda environments.
To install this software, we will first, create a conda environment with the necessary dependencies as follows (a full conda package is not available yet https://github.com/apetkau/genomics-data-index/issues/51 ).
conda create -c conda-forge -c bioconda -c defaults --name gdi python=3.8 pyqt bedtools iqtree 'bcftools>=1.13' 'htslib>=1.13'
# Activate environment. Needed to install additional Python dependencies below.
conda activate gdi
Now, you can install with:
pip install genomics-data-index
If everything is working you should be able to run:
gdi --version
You should see gdi, version 0.1.0
printed out.
Additional dependencies
For snpeff to work you will need to install the package mkisofs
on Ubuntu (e.g., sudo apt install mkisofs
).
I do not know the exact package name on other systems.
3.2. PyPI/pip
To install just the Python component of this project from PyPI you can run the following:
pip install genomics-data-index
Note that you will have to install some additional dependencies separately. Please see the conda-env.yaml environment file for details.
3.3. From GitHub
To install the project from the source on GitHub please first clone the git repository:
git clone https://github.com/apetkau/genomics-data-index.git
cd genomics-data-index
Now install all the dependencies using conda and bioconda with:
conda env create -f conda-env.yaml
conda activate gdi
Once these are installed you can setup the Python package with:
pip install .
4. Usage
The main command is called gdi
. A quick overview of the usage is as follows:
4.1. Indexing
# Create new index in `index/`
# cd to `index/` to make next commands easier to run
gdi init index
cd index
# Creates an index of mutations (VCF files) and kmer sketches (sourmash)
gdi analysis --use-conda --include-kmer --kmer-size 31 --reference-file genome.gbk.gz *.fastq.gz
# (Optional) build tree from mutations (against reference genome `genome`) for phylogenetic querying
gdi rebuild tree --align-type full genome
The produced index will be in the directory index/
.
4.2. Querying
# List indexed samples
gdi list samples
# Query for genomes with mutation
gdi query mutation 'genome:10:A:T'
4.3. Main usage statement
Usage: gdi [OPTIONS] COMMAND [ARGS]...
Options:
--project-dir TEXT A project directory containing the data and
connection information.
--ncores INTEGER RANGE Number of cores for any parallel processing
[default: 8]
--log-level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
Sets the log level [default: INFO]
--version Show the version and exit.
--config FILE Read configuration from FILE.
--help Show this message and exit.
Commands:
analysis
build
db
export
init
input
list
load
query
rebuild
5. Tutorial
Tutorials and a demonstration of the software are available below (code in separate repository). You can select the [launch | binder] badge to launch each of these tutorials in an interactive Jupyter environment within the cloud environment using Binder.
- Tutorial 1: Querying (Salmonella) -
- In case GitHub link is not rendering try here
- Tutorial 2: Indexing assemblies (SARS-CoV-2) -
- In case GitHub link is not rendering try here
- Tutorial 3: Querying overview -
- In case GitHub link is not rendering try here
Alternatively, you can run these tutorials on your local machine. In order to run these tutorials you will first have to install the genomics-data-index
software (see the Installation section for details). In addition, you will have to install Jupyter Lab. If you have already installed the genomics-data-index
software with conda you can install Jupyter Lab as follows:
conda activate gdi
conda install jupyterlab
To run Jupyter you can run the following:
# QT_QPA_PLATFORM The below is useful to avoid having to set the DISPLAY env variable for Qt
# You can ignore setting this environment variable if you are running on a machine with an X server installed and configured
QT_QPA_PLATFORM="offscreen" jupyter lab
Please see the instructions for Jupyter Lab for details.
6. Acknowledgements
I would like to acknowledge the Public Health Agency of Canada, the University of Manitoba, and the VADA Program for providing me with the opportunity, resources and training for working on this project.
Some icons used in this documentation are provided by Font Awesome and licensed under a Creative Commons Attribution 4.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genomics-data-index-0.9.2.tar.gz
.
File metadata
- Download URL: genomics-data-index-0.9.2.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db7f190e8638e1fe16fc9e969021aef808d55a6a4b359632fcadb930b3682590 |
|
MD5 | d51b380847f04bcd3d0dcb37bb2e584f |
|
BLAKE2b-256 | 62c05b6005136fba207d64fd97a99366bdf1d7657be101e9d6e6f878c6e547ba |
File details
Details for the file genomics_data_index-0.9.2-py3-none-any.whl
.
File metadata
- Download URL: genomics_data_index-0.9.2-py3-none-any.whl
- Upload date:
- Size: 2.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a123efbd1cc91100667e1a467c2a42f2aa2f673312b8a16c9bf9309c826cffb4 |
|
MD5 | 599f1f06a8aca845f7532a886ecd9d0d |
|
BLAKE2b-256 | 55df8c5b14e36815287cfd9638290a31f9dbb4cf63aa9de1a76c70a23b47976c |