Skip to main content

sourmash plugin for repeat-robust mutation rate estimation (r_pp, r_pc, r_cc).

Project description

sourmash-plugin-repeat-robust-mutation-rate-estimators

sourmash is a tool for biological sequence analysis and comparisons.

This plugin implements repeat-robust substitution rate estimators r_pp, r_pc, and r_cc based on FracMinHash sketches, as described in:

Wu, H. and Medvedev, P. (2026). Repeat-robust estimation of substitution rates from k-mer sketches. bioRxiv. https://www.biorxiv.org/content/10.64898/2026.04.01.715966v1

Installation

Install sourmash, then install this plugin:

# Option 1:
conda install -c conda-forge -c bioconda sourmash
pip install sourmash-plugin-repeat-robust-mutation-rate-estimators

# Option 2:
pip install sourmash
pip install sourmash-plugin-repeat-robust-mutation-rate-estimators

Verify the plugin is recognized:

sourmash scripts

You should see sketch and mutation_rate listed under available plugin commands.

Usage

Background

The three estimators treat the two input sequences asymmetrically: we assume string t is mutated from string s.

If unsure which is s and which is t, use the longer sequence as s.

Each estimator requires a specific sketch mode:

Estimator s sketch mode t sketch mode
r_pp standard standard
r_pc standard multiplicity
r_cc extended multiplicity

In general, estimators that use more information achieve higher accuracy.

Step 1: Sketch your sequences

# For r_pp
sourmash scripts sketch s.fa --sketch-mode standard    -o s.sig -k 21 --scaled 1000
sourmash scripts sketch t.fa --sketch-mode standard    -o t.sig -k 21 --scaled 1000

# For r_pc
sourmash scripts sketch s.fa --sketch-mode standard    -o s.sig -k 21 --scaled 1000
sourmash scripts sketch t.fa --sketch-mode multiplicity -o t.sig -k 21 --scaled 1000

# For r_cc
sourmash scripts sketch s.fa --sketch-mode extended    -o s.sig -k 21 --scaled 1000
sourmash scripts sketch t.fa --sketch-mode multiplicity -o t.sig -k 21 --scaled 1000

Sketch modes:

  • standard: stores distinct k-mer hashes and L, where L = |x| - k + 1 is the total number of k-mers in string x. Use as s or t for r_pp.
  • multiplicity: stores k-mer hashes with per-hash counts and L. Use as t for r_pc and r_cc.
  • extended: stores distinct k-mer hashes, L, and a precomputed correction constant sum_occ_h1. Use as s for r_cc. Note: computing sum_occ_h1 requires reading the full sequence and may take longer for large genomes.

Step 2: Estimate mutation rate

sourmash scripts mutation_rate --estimator r_pp --s-sig s.sig --t-sig t.sig
sourmash scripts mutation_rate --estimator r_pc --s-sig s.sig --t-sig t.sig
sourmash scripts mutation_rate --estimator r_cc --s-sig s.sig --t-sig t.sig

Example output:

Estimator : r_cc
k         : 21
scaled    : 1000
L_s       : 4800000
Estimated mutation rate : 0.012345

Support

Please file issues at https://github.com/Wu-Haonan/sourmash-plugin-repeat-robust-mutation-rate-estimators/issues

Dev docs

sourmash-plugin-repeat-robust-mutation-rate-estimators is developed at https://github.com/Wu-Haonan/sourmash-plugin-repeat-robust-mutation-rate-estimators.

Citation

If you use this plugin, please cite:

Wu, H. and Medvedev, P. (2026). Repeat-robust estimation of substitution rates
from k-mer sketches. bioRxiv.
https://www.biorxiv.org/content/10.64898/2026.04.01.715966v1

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file sourmash_plugin_repeat_robust_mutation_rate_estimators-0.1.1.tar.gz.

File metadata

File hashes

Hashes for sourmash_plugin_repeat_robust_mutation_rate_estimators-0.1.1.tar.gz
Algorithm Hash digest
SHA256 289f5f52f52dcc30902495db7ef72f34bd8a6c9a2ef295cdf59be22d72a0e22f
MD5 665256a428f3d13ecfbffa7f19257891
BLAKE2b-256 e929e09163de1ef765d6e1240deab8d61141c1cf463aa43b872990ae09ca5384

See more details on using hashes here.

File details

Details for the file sourmash_plugin_repeat_robust_mutation_rate_estimators-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sourmash_plugin_repeat_robust_mutation_rate_estimators-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c91dc629d66565eae48398d00446ea0b2e6952a53ac01a7e582d3f103afe815b
MD5 8216c227c158bb46e24c7aad74e763c1
BLAKE2b-256 ca3d5aabd019c69f7d6d2dc1e149c4f7301ab14e8616c40cf29ae4e1873ec277

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page