A sketch-based surveillance platform
Project description
Mashpit
Create a database of mash signatures and find the most similar genomes to a target sample
Dependencies
- Python >= 3.8
- NCBI datasets
Installation
Install NCBI datasets
curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets'
chmod +x datasets
export PATH=$PATH:$PWD
Install mashpit using pip:
pip install mashpit
Or git clone from github:
git clone https://github.com/tongzhouxu/mashpit.git
cd mashpit
pip install .
Mashpit Database
A mashpit database is a directory containing:
$DB_NAME.db
$DB_NAME.sig
Mashpit database can be built using:
- A taxonomic name A standard database is a collection of representative genomes from each cluster on Pathogen Detection. By default mashpit will download the latest version of a specified species and find the centroid of each SNP cluter (SNP tree).
- BioSample accessions
A custom database is a collection of genomes based on a proveded biosample accesion list.
Usage
1. Build a mashpit database
usage: mashpit build [-h] [--quiet] [--number NUMBER] [--ksize KSIZE] [--species SPECIES] [--email EMAIL] [--key KEY] [--pd_version PD_VERSION] [--list LIST] {taxon,accession} name
positional arguments:
{taxon,accession} mashpit database type.
name mashpit database name
optional arguments:
-h, --help show this help message and exit
--quiet disable logs
--number NUMBER maximum number of hashes for sourmash, default is 1000
--ksize KSIZE kmer size for sourmash, default is 31
--species SPECIES species name
--email EMAIL Entrez email
--key KEY Entrez api key
--pd_version PD_VERSION
a specified Pathogen Detection version (PDG accession). Default is the latest.
--list LIST Path to a list of NCBI BioSample accessions
- Example command
mashpit build standard salmonella -s Salmonella
Note: Supported species names can be found in this list
2. Query against a mashpit database
usage: mashpit query [-h] [--number NUMBER] [--threshold THRESHOLD] [--annotation ANNOTATION] sample database
positional arguments:
sample path to query sample
database path to the database folder
optional arguments:
-h, --help show this help message and exit
--number NUMBER number of isolates in the query output, default is 200
--threshold THRESHOLD
minimum jaccard similarity for mashtree, default is 0.85
--annotation ANNOTATION
mashtree tip annoatation, default is none
- Example command
mashpit query sample.fasta path/to/database
Optional: Update the database
usage: mashpit update [-h] [--metadata METADATA] [--quiet] database name
positional arguments:
database path for the database folder
name database name
optional arguments:
-h, --help show this help message and exit
--metadata METADATA metadata file in csv format
--quiet disable logs
- Example command
mashpit update path/to/database salmonella
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mashpit-0.9.3.tar.gz
(1.5 MB
view details)
Built Distribution
File details
Details for the file mashpit-0.9.3.tar.gz
.
File metadata
- Download URL: mashpit-0.9.3.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 893385011e620071e89046a7c3543b09acbf08eef27f0567e04d6aa88bfaa523 |
|
MD5 | 97c16efa81a6de30e00b9d3e53b077eb |
|
BLAKE2b-256 | d2b55ea7cda83c2f106f8c86a66426bf8a8d760b39e0e74e829bd7c11b51a9e6 |
File details
Details for the file mashpit-0.9.3-py3-none-any.whl
.
File metadata
- Download URL: mashpit-0.9.3-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db4a5ab4509a8d5b95ff495c9477fb748f92e6455c305b6338fde80b20f69dbb |
|
MD5 | 25e51295903f0c3a5650ccf8fe2f2d96 |
|
BLAKE2b-256 | e8d83ac8312132687e2dccc1ffb1069dc6ebd512e842baa862b7bb5557250f79 |