Skip to main content

sourmash plugin to do pangenomics.

Project description

sourmash_plugin_pangenomics: tools for sourmash-based pangenome analyses

Installation

pip install sourmash_plugin_pangenomics

Quickstart

You can run all of these commands in the test_workflow directory of the git repository.

Build a pangenome database using lineages

(CTB: explain contents!)

The following command builds a pangenome database for the species present in the lineages file gtdb-rs214-agatha.lineages.csv.gz (currently only s__Agathobacter faecis), using the sketches present in the gtdb-rs214-agatha-k21.zip.

sourmash scripts pangenome_createdb \
    gtdb-rs214-agatha-k21.zip \
    -t gtdb-rs214-agatha.lineages.csv.gz \
    -o agatha-merged.sig.zip --abund -k 21

The output file is agatha-merged.sig.zip and contains the following:

% sourmash sig summarize agatha-merged.sig.zip

...
num signatures: 1
** examining manifest...
total hashes: 27398
summary of sketches:
   1 sketches with DNA, k=21, scaled=1000, abund      27398 total hashes

Note: the command pangenome_merge (see below) will construct a pangenome sketch by merging all provided signatures.

Build a pangenome "ranktable"

A "ranktable" is our name for a database that assigns hashes a pangenomic "rank" - central core, external core, shell, inner cloud, or surface cloud.

The following command builds a ranktable for the species s__Agathobacter faecis, selected from the pangenome database created above:

sourmash scripts pangenome_ranktable \
    agatha-merged.sig.zip \
    -o test_output/agathobacter_faecis.csv \
    -k 21 -l 'GCF_020557615 s__Agathobacter faecis'

The output file is test_output/agathobacter_faecis.csv, and it contains two columns:

hashval,pangenome_classification
96834755571756,1
119187685848053,1
129679169912030,1
...
18440589591308259,4
18443409651295626,4
18446214016691046,4

where the first column is the hash value, and the second column is the pangenome rank for that hash.

Summarize the ranks of the hashes in a sketch

We can now use our ranktable to summarize any sketch, including a metagenome. Here we use a human gut metagenome, SRR5650070:

sourmash scripts pangenome_classify \
    SRR5650070.trim.sig.zip \
    test_output/agathobacter_faecis.csv \
    -k 21

This will yield the following output:

For 'test_output/agathobacter_faecis.csv', signature 'SRR5650070' contains:
         497 (12.5%) hashes are classified as central core
         427 (10.8%) hashes are classified as external core
         1791 (45.2%) hashes are classified as shell
         1251 (31.5%) hashes are classified as inner cloud
         0 (0.0%) hashes are classified as surface cloud
         ...and 262716 hashes are NOT IN the csv file

Build a pangenome sketch without using lineages

(CTB: explain contents!)

The following command builds a pangenome sketch by combining all provided sketches. Here we use the sketches present in the gtdb-rs214-agatha-k21.zip file:

sourmash scripts pangenome_merge \
    gtdb-rs214-agatha-k21.zip \
    -o agatha-merged-2.sig.zip-k 21

The output file is agatha-merged-2.sig.zip and is identical (via e.g. sourmash compare) to the agatha-merged.sig.zip file.

Support

We suggest filing issues in the main sourmash issue tracker as that receives more attention (and is monitored by the same people anyway)!

Dev docs

sourmash_plugin_pangenomics is developed at https://github.com/sourmash-bio/sourmash_plugin_pangenomics.

Testing

The current tests are implemented as Snakemake workflow in test_workflow/. To run them, execute the following command in the main directory:

make cleanrun

Generating a release

Bump version number in pyproject.toml and push.

Make a new release on github.

Then pull, and:

python -m build

followed by twine upload dist/....

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sourmash_plugin_pangenomics-0.3.2.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sourmash_plugin_pangenomics-0.3.2-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file sourmash_plugin_pangenomics-0.3.2.tar.gz.

File metadata

File hashes

Hashes for sourmash_plugin_pangenomics-0.3.2.tar.gz
Algorithm Hash digest
SHA256 e6bf6d5aa33ab5dd10f00c476bf14898b2d2b9431866e5f2b06c61c4abc51f22
MD5 c1f620a018d94bc74e4653a5deb3c812
BLAKE2b-256 23ca7d4cf22f0a4dbdaa71b8cad4f9b81fa22de210cc65c3d8de661d53df5d5b

See more details on using hashes here.

File details

Details for the file sourmash_plugin_pangenomics-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for sourmash_plugin_pangenomics-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 54a59c0b543b33db990a2db57d653eddc969f98d746e46e41304cd289229a78c
MD5 58e0633e4cedb1f90d735f71d2ccbf19
BLAKE2b-256 2c0be7078c930f037906ed6ced700daa6b3b6ea69b7bd83ec5b3c0a398d96fd7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page