Comprehensive CLI for Oxford Nanopore end_reason analysis: discover, tag, filter, analyze, visualize.
Project description
🧬 ont-end-reason
Comprehensive CLI for Oxford Nanopore end_reason analysis.
Discover, tag, filter, analyse, and visualise read-termination patterns.
🚀 → Interactive dashboard & tutorials
Companion to the end-reason paper.
Table of contents
- Why this tool
- Install
- Quickstart
- The headline result
- CLI surface
- Python API
- End_reason taxonomy
- How the UMC posterior works
- Testing
- Lab infrastructure integration
- Status / roadmap
- Citing
- License
Why this tool
Oxford Nanopore sequencers tag every read with an end_reason explaining
why sequencing stopped. A read can have high base quality (Q>20) and
still be truncated or rejected by adaptive sampling — filtering by
Q-score alone is not enough for accurate downstream analysis.
ont-end-reason unifies the eight published analyses from the end-reason
paper into a single PyPI-installable CLI, including the paper's novel
posterior length model for adaptive-sampling-truncated reads.
Before this tool, the analyses lived in scattered scripts inside
End_Reason_Manuscript/pipeline/bin/
(now archived). Every script was promoted to this repo with provenance
headers crediting commit b47166a of the source. The package is the
canonical implementation going forward.
Install
# Static figures only (matplotlib)
pip install ont-end-reason
# + Plotly for interactive HTML reports
pip install "ont-end-reason[interactive]"
# Development (from source)
git clone https://github.com/Single-Molecule-Sequencing/ont-end-reason.git
cd ont-end-reason
pip install -e ".[dev,interactive]"
Python 3.10+ required. Tested on Linux + macOS, Python 3.10 through 3.13.
Quickstart
Five commands cover the canonical pipeline:
# 1. Inventory what's in a sequencing-run directory
ont-end-reason discover /path/to/run --manifest run.json
# 2. Tag a BAM with end_reason from sequencing_summary.txt
ont-end-reason tag --summary sequencing_summary.txt \
--bam aligned.bam --out tagged.bam
# 3. Filter to complete reads only (signal_positive)
ont-end-reason filter --bam tagged.bam --keep SP --out complete.bam
# 4. Run the paper's central novel analysis
ont-end-reason analyze umc-posterior sequencing_summary.txt --plot umc.pdf
# 5. Build a self-contained 6-section HTML report
ont-end-reason report interactive sequencing_summary.txt --out report.html
→ Full walkthrough with live charts on the dashboard.
The headline result
On the synthetic 5000-read test fixture:
$ ont-end-reason analyze umc-posterior tests/fixtures/sequencing_summary_synthetic.txt
UMC reads: 600
Prior class: signal_positive (log μ=8.488, log σ=0.600)
Observed mean length: 926.2 bp
Posterior expected mean: 5868.1 bp
Posterior bonus mean: 4941.9 bp/read
Posterior bonus total: 2,965,111 bp ← ~3 Mb of unobserved sequence
Adaptive-sampling truncation hides ~5× more sequence than the observed read length suggests. Scaled to a real PromethION run with millions of UMC reads, the recovered-sequence estimate grows linearly. This is exactly what the paper's central analysis is for — and the tool surfaces it as one command on any sequencing_summary.txt.
CLI surface
Discovery + filter operations
| Command | Purpose |
|---|---|
ont-end-reason discover <path> |
Walk a directory, inventory POD5 / Fast5 / summary / BAM / FASTQ files |
ont-end-reason tag |
Add end_reason tag to BAM reads from sequencing_summary.txt |
ont-end-reason filter |
Keep / drop BAM reads by end_reason short code (parallel sharded, --threads N) |
ont-end-reason export-fastq |
Convert filtered BAM → FASTQ for NanoPack tools |
ont-end-reason stats |
Streaming QC summary from sequencing_summary.txt |
Analysis (9 subcommands)
| Command | What it does |
|---|---|
ont-end-reason analyze distribution |
Per-end_reason counts + OK/CHECK/FAIL quality gate |
ont-end-reason analyze length |
Length distributions per end_reason (N50, percentiles) |
ont-end-reason analyze quality |
Q-score distributions with Gaussian Mixture Model fit |
ont-end-reason analyze temporal |
End_reason rates over sequencing-run time |
ont-end-reason analyze hypothesis |
Mann-Whitney U / KS tests with Cliff's Δ effect size |
ont-end-reason analyze umc-posterior ⭐ |
Bayesian posterior on truncated UMC length (paper's central analysis) |
ont-end-reason analyze signal-trace |
Raw POD5 current trace extraction for a single read |
ont-end-reason analyze sma-metrics |
Optional bridge to the smaseq-qc package |
ont-end-reason analyze tables |
Generate summary/per-class/quality tables (TSV/CSV/md/LaTeX) |
Paper-figure reproducers + reports
| Command | Output |
|---|---|
ont-end-reason figure fig3 <source> |
Paper Figure 3 — distribution bar chart |
ont-end-reason figure fig5 <source> |
Paper Figure 5 — Q-score violins |
ont-end-reason figure fig6 <source> |
Paper Figure 6 — UMC posterior diagram |
ont-end-reason report interactive |
6-section self-contained HTML report with embedded Plotly |
ont-end-reason report static |
Paginated PDF report (v0.3.0 roadmap) |
Run ont-end-reason <cmd> --help for full flag documentation. Examples and screenshots: dashboard.
Python API
Every CLI subcommand has a public Python API equivalent. Functions return typed dataclasses so callers can compose, persist, or pipe results without re-parsing CLI output:
from ont_end_reason import discover, classify
from ont_end_reason.analyze.distribution import distribution
from ont_end_reason.analyze.umc_posterior import umc_posterior
from ont_end_reason.viz.static import plot_umc_posterior
# Discovery → Manifest
manifest = discover("/path/to/sequencing_run")
print(f"Found {manifest.total_files()} files")
# Analysis → typed result
result = umc_posterior("sequencing_summary.txt")
print(f"Posterior bonus total: {result.posterior_bonus_total:,.0f} bp")
# Visualisation → matplotlib Figure
fig = plot_umc_posterior(result)
fig.savefig("umc.pdf")
Each analysis result has a .to_dict() for JSON serialisation and roundtrip.
End_reason taxonomy
The lab's canonical 7-class taxonomy. Print from the CLI any time with ont-end-reason codes:
| Code | Full name | Class | Action |
|---|---|---|---|
SP |
signal_positive | keep | Complete read — always keep |
UMC |
unblock_mux_change | truncated | Filter unless studying artifacts |
MC |
mux_change | truncated | Filter |
DUMC |
data_service_unblock_mux_change | truncated | Filter (software-triggered) |
PART |
partial | truncated | Filter |
SN |
signal_negative | failed | Always filter |
UNK |
unknown | unknown | Investigate distribution |
--keep SP is the canonical recommendation (Table 1 of end-reason-paper). Use --keep SP,UMC to retain truncated reads for artifact studies.
How the UMC posterior works
The paper's novel analytic contribution, in one paragraph:
Given an observed UMC read of length o, the molecule's true length L is unknown but at least o (it was truncated, not foreshortened). Fitting a lognormal prior L ~ Lognormal(μ, σ²) to signal_positive reads gives the prior on what completed reads look like; the posterior on a UMC read's true length is then the prior left-truncated at the observation:
P(L | L ≥ o) ∝ Lognormal(L; μ, σ²) · 𝟙[L ≥ o]
The truncated mean has a closed form via the normal CDF's Mills ratio:
E[L | L ≥ o] = exp(μ + σ²/2) · Φ(σ - z) / (1 - F(o)) where z = (log o − μ)/σ
Implementation: scipy.stats.lognorm, vectorised over all UMC reads. O(n).
Aggregated, this is the paper's headline "sequence lost to adaptive sampling" estimate — runnable on any sequencing_summary.txt with one command.
Testing
pytest # 175 tests, ~40s
pytest --cov=ont_end_reason # with coverage (currently 71%)
ruff check . # lint
mypy src/ont_end_reason # type-check
Coverage gate is 69% in CI (1 pp below actual, to absorb cross-platform variance).
Tests run against:
- Synthetic fixture (5000 reads, deterministic distributions) for every analysis
- Hypothesis property tests for the SP/UMC/MC taxonomy (round-trips, classification disjointness)
- CliRunner integration tests for every subcommand's
--helpand dispatch - Real-data smoke against the AWG074 MinION run (1,571 tagged reads → 1,451 SP / 89 UMC / 31 SN)
Performance
The filter subcommand runs sequential by default; --threads N engages a parallel
sharded path for inputs above ~50k reads:
- Shard boundaries are placed by virtual offset (
bam.tell()) during a single sequential scan — workersseek()directly to their slice, avoiding the original port's O(N²/2) linear-skip pattern. - Shard merge uses
pysam.cat, which splices BGZF blocks without re-decompressing. - Auto-fallback to sequential below
MIN_READS_FOR_PARALLEL(50k reads) — worker-pool setup outweighs the gain on small inputs. - Bit-identical output: enforced by an integration test that compares kept-read sets across both paths on every CI run.
Synthetic micro-bench (bench/bench_parallel_filter.py, ONT-shaped 2 kb reads,
8-core dev machine):
| n_reads | threads | shard_size | shards | seq_s | par_s | speedup |
|---|---|---|---|---|---|---|
| 20,000 | 4 | 2,000 | 2 | 0.02 | 0.05 | 0.45× (below threshold) |
| 100,000 | 4 | 12,500 | 9 | 0.78 | 0.72 | 1.08× |
| 300,000 | 4 | 37,500 | 9 | 2.14 | 1.91 | 1.12× |
Speedup is intentionally modest at this workload — pysam already pipelines BGZF decompression internally, so the worker pool only parallelizes tag-lookup + write. Larger gains expected on multi-GB real ONT BAMs where per-record CPU cost is higher.
Cross-run atlas
The analyze atlas subcommand answers a question single-run analysis can't:
"is THIS run normal relative to all the lab's prior runs (and the public ONT
community)?"
# Aggregate across the qc_baseline store + external peer cache
ont-end-reason analyze atlas --json atlas.json --plot atlas.png
# Stratify on different metadata dimensions
ont-end-reason analyze atlas --strata flowcell_type,basecaller_model
# Tighten outlier flagging (default z >= 2.0)
ont-end-reason analyze atlas --z-threshold 3.0
Data sources:
- Internal lab peers — auto-populated into
~/.ont-qc-baselines/by everyont-end-reason analyze distributioninvocation (see--baseline-store). - External peers — public ONT datasets (GIAB, hereditary-cancer ONT Open
Data) cached as Parquet fingerprints at
~/.ont-qc-baselines/external_peers/, refreshed by the lab's/ont-public-dataskill.
Backfill: one-time scripts/atlas_backfill.py --dry-run lists every
registry experiment eligible for ingest; drop the --dry-run to run them all.
Output shape: AtlasResult JSON with per_stratum (mean/median/std/min/max
for all 5 end-reason metrics per stratum), outliers (runs with composite
anomaly_score = max(|z_i|) >= --z-threshold), and a human-readable
interpretation. Designed-for: paper figure regenerators, dashboard panels,
batch QC gates.
Spec: docs/superpowers/specs/2026-05-12-end-reason-atlas-design.md
Lab infrastructure integration
ont-end-reason is part of the Single-Molecule-Sequencing org's analytic toolchain:
| Repo | How it integrates |
|---|---|
| end-reason-paper | Companion paper. Claim atoms (results.alignment_rate_filtered, results.snv_f1_filtered, etc.) pin to this tool for reproducibility. |
| ont-ecosystem | Lab Claude Code skills /end-reason and /end-reason-filter will become thin wrappers that pip install ont-end-reason (tracked in issue #6). |
| lab-onboarding | Bundled in the canonical lab-repo manifest. Cloned automatically by bash wsl/bootstrap.sh on every new lab device. |
| End_Reason_Manuscript | Archived. Each script in this repo carries a provenance header crediting commit b47166a of that source. |
| smaseq-qc | Optional dependency for analyze sma-metrics. Tool detects-and-skips when missing. |
Status / roadmap
Current: v0.2.0a1 (alpha)
- ✅ 9 analysis subcommands fully implemented
- ✅ Bayesian posterior model for UMC truncation (paper's central novel analysis)
- ✅ Interactive HTML reports with embedded Plotly
- ✅ 143 tests, CI matrix on Python 3.10–3.13 × Ubuntu/macOS
- ✅ Interactive dashboard with live examples
- 🚧 Reproducibility CI against end-reason-paper claim atoms (#4)
- 🚧 Parallel sharded BAM filtering (#5)
- 🚧 Lab-skill thin-wrap migration after PyPI release (#6)
- ⏳ conda-forge feedstock (post-v0.1.0 PyPI)
See CHANGELOG.md for per-release detail and open issues for roadmap items.
Citing
If you use ont-end-reason in published work, please cite the companion paper:
Athey BD et al. (in preparation). End reason filtering for accurate analysis
of Oxford Nanopore sequencing data. Single-Molecule-Sequencing Lab,
University of Michigan.
https://github.com/Single-Molecule-Sequencing/end-reason-paper
Machine-readable citation metadata is in CITATION.cff.
License
MIT — see LICENSE.
Built by the Athey Lab at the University of Michigan.
Dashboard · Issues · CHANGELOG · Design spec
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ont_end_reason-0.2.0.tar.gz.
File metadata
- Download URL: ont_end_reason-0.2.0.tar.gz
- Upload date:
- Size: 69.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
416e241ec70be023faadcd309629b161147d552752d2e8398cf1b0b41f3e4dd7
|
|
| MD5 |
9ffe8da061126985a79b55f4fa62988d
|
|
| BLAKE2b-256 |
95ec33c628d1810ba61219a139add9cb3b37fcce4b1b56fb7ffe51c780d6daac
|
Provenance
The following attestation bundles were made for ont_end_reason-0.2.0.tar.gz:
Publisher:
release.yml on Single-Molecule-Sequencing/ont-end-reason
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ont_end_reason-0.2.0.tar.gz -
Subject digest:
416e241ec70be023faadcd309629b161147d552752d2e8398cf1b0b41f3e4dd7 - Sigstore transparency entry: 1521846568
- Sigstore integration time:
-
Permalink:
Single-Molecule-Sequencing/ont-end-reason@36c0195ee84f212198a2a6accf1725dbd43799ed -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Single-Molecule-Sequencing
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@36c0195ee84f212198a2a6accf1725dbd43799ed -
Trigger Event:
push
-
Statement type:
File details
Details for the file ont_end_reason-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ont_end_reason-0.2.0-py3-none-any.whl
- Upload date:
- Size: 79.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2371aeff7c473c087b29798b9cfe9c34a75827e1261eaa3ed8e225650910e79b
|
|
| MD5 |
9f54f77f54e6504086609b3a0d9b98cc
|
|
| BLAKE2b-256 |
1631bab96eab2cf78ec278f6f8726eba046290ff7e8b18d5f0c474e60c9cef15
|
Provenance
The following attestation bundles were made for ont_end_reason-0.2.0-py3-none-any.whl:
Publisher:
release.yml on Single-Molecule-Sequencing/ont-end-reason
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ont_end_reason-0.2.0-py3-none-any.whl -
Subject digest:
2371aeff7c473c087b29798b9cfe9c34a75827e1261eaa3ed8e225650910e79b - Sigstore transparency entry: 1521846586
- Sigstore integration time:
-
Permalink:
Single-Molecule-Sequencing/ont-end-reason@36c0195ee84f212198a2a6accf1725dbd43799ed -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Single-Molecule-Sequencing
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@36c0195ee84f212198a2a6accf1725dbd43799ed -
Trigger Event:
push
-
Statement type: