Skip to main content

Fast and accurate set similarity estimation via containment min hash (for genomic datasets).

Project description

# CMash CMash is a fast and accurate way to estimate the similarity of two sets. This is a probabilisitic data analysis approach, and uses containment min hashing. Please see the [associated paper](http://www.biorxiv.org/content/early/2017/09/04/184150) for further details (and please cite if you use it): >Improving Min Hash via the Containment Index with applications to Metagenomic Analysis >David Koslicki, Hooman Zabeti >bioRxiv 184150; doi: https://doi.org/10.1101/184150

## Installation The easiest way to install this is to use [virtualenv](https://virtualenv.pypa.io/en/stable/): `bash virtualenv CMashVE source CMashVE/bin/activate pip install -U pip pip install CMash ` You can also just use pip install CMash if you don’t want to create a virtual environment.

To get the absolute latest edition of CMash, then you can build from the Github repository via: `bash virtualenv CMashVE source CMashVE/bin/activate pip install -U pip git clone https://github.com/dkoslicki/CMash.git cd CMash pip install -r requirements.txt `

Note that the python code in this repository is python2 and python3 compatible, but the dependency khmer technically requires python3 (but khmer version 2.1.1 runs just fine in python2.) The external dependency pytst requires python2, so I’m making this a python2 repository. ## Usage The basic paradigm is to create a reference/training database, form a sample bloom filter, and then query the database.

#### Forming a reference/training database Say you have three reference fasta/q file: ref1.fa, ref2.fa and ref3.fa. In a file (here called FileNames.txt), place the absolute paths pointing to the fasta/q files: `bash cat FileNames.txt # /abs/path/to/ref1.fa # /abs/path/to/ref2.fa # /abs/path/to/ref3.fa ` Then you can create the training database via: `bash MakeDNADatabase.py FileNames.txt TrainingDatabase.h5 ` See MakeDNADatabase.py -h for more options when forming a database.

#### Creating a sample bloom filter Given a (large) query fasta/q file Metagenome.fa, you can optionally create a bloom filter via MakeNodeGraph.py Metagenome.fa .. See MakeNodeGraph.py -h for more details about this function.

This step is not strictly necessary (as the next step automatically forms a nodegraph/bloom filter if you didn’t already create one). However, I’ve provided this script in case you want to pre-process a bunch of metagenomes.

#### Query the database To get containment and Jaccard index estimates of the references files in your query file Metagenome.fa, use something like the following QueryDNADatabase.py Metagenome.fa TrainingDatabase.h5 Output.csv.

There are a bunch of options available: QueryDNADatabase.py -h. The output file is a CSV file with rows corresponding (in this case) to ref1.fa, ref2.fa, and ref3.fa and columns corresponding to the containment index estimate, intersection cardinality, and Jaccard index estimate.

#### Other functionality The module MinHash (imported in python via from CMash import MinHash as MH) has a bunch more functionality, including (but not limited to!): 1. Fast updates to the training databases (via help(MH.delete_from_database), help(MH.insert_to_database), help(MH.union_databases)) 2. Ability to form a matrix of Jaccard indexes (for comparison of all pairwise Jaccard indexes of organisms in the training database). This is useful for identifying redundances/patterns/structure in your training database: help(MH.form_jaccard_count_matrix) and help(MH.form_jaccard_matrix). 3. Access to the k-mers that MinHash randomly selected (see the class CountEstimator and the associated _kmers data structure.)

I’d encourage you to poke through the source code of MinHash.py and take a look at the scripts as well.

Protein databases (and for that matter, arbitrary K-length strings) coming soon…

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for CMash, version 0.3.0
Filename, size File type Python version Upload date Hashes
Filename, size CMash-0.3.0-py2-none-any.whl (89.6 kB) File type Wheel Python version py2 Upload date Hashes View hashes
Filename, size CMash-0.3.0.tar.gz (81.7 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page