Metrics for nucleotide-level DNA segmentation benchmarking
Project description
DNA Segmentation Benchmark
Diagnostic evaluation toolkit for nucleotide-level DNA segmentation models (gene finders like Augustus, Helixer, Tiberius, SegmentNT) against reference annotations (e.g., GENCODE).
Goes beyond standard precision/recall with an 8-type INDEL error taxonomy, boundary bias/reliability landscapes, strict intron chain plus per-transcript soft exon distributions, transcript match classification, junction error diagnosis, and state transition analysis -- metrics not available in gffcompare, Mikado, or EGASP.
pip install dna-segmentation-benchmark
Quick Start
From GFF/GTF files
from dna_segmentation_benchmark import (
LabelConfig, EvalMetrics, benchmark_from_gff, compare_multiple_predictions,
)
label_config = LabelConfig(
labels={0: "CDS", 2: "INTRON", 8: "NONCODING"},
background_label=8,
coding_label=0,
)
results = benchmark_from_gff(
gt_path="ground_truth.gtf",
pred_paths={"augustus": "predictions.gff"},
label_config=label_config,
classes=[0],
metrics=[EvalMetrics.REGION_DISCOVERY, EvalMetrics.NUCLEOTIDE_CLASSIFICATION],
exclude_features=["gene"],
)
figures = compare_multiple_predictions(
per_method_benchmark_res=results,
label_config=label_config,
classes=[0],
metrics_to_eval=[EvalMetrics.REGION_DISCOVERY, EvalMetrics.NUCLEOTIDE_CLASSIFICATION],
)
From label arrays
import numpy as np
from dna_segmentation_benchmark import (
LabelConfig, EvalMetrics, benchmark_gt_vs_pred_multiple, compare_multiple_predictions,
)
label_config = LabelConfig(
labels={0: "EXON", 2: "INTRON", 8: "NONCODING"},
background_label=8,
coding_label=0,
)
results = benchmark_gt_vs_pred_multiple(
gt_labels=gt_arrays, # list[np.ndarray]
pred_labels=pred_arrays, # list[np.ndarray]
label_config=label_config,
classes=[0],
metrics=[
EvalMetrics.INDEL,
EvalMetrics.REGION_DISCOVERY,
EvalMetrics.BOUNDARY_EXACTNESS,
EvalMetrics.NUCLEOTIDE_CLASSIFICATION,
EvalMetrics.STRUCTURAL_COHERENCE,
EvalMetrics.DIAGNOSTIC_DEPTH,
],
)
CLI
dna-benchmark run \
--gt ground_truth.gtf \
--pred augustus:predictions.gff \
--config label_config.yaml \
--classes 0 \
--exclude-features gene \
--output results.json
Metrics
Seven metric groups, each answering a distinct question about prediction quality:
| Group | Question |
|---|---|
NUCLEOTIDE_CLASSIFICATION |
Per-base, how accurate is it? |
REGION_DISCOVERY |
Did we find the right regions? |
BOUNDARY_EXACTNESS |
How precise are the boundaries? |
INDEL |
What structural errors exist? |
FRAMESHIFT |
Is the reading frame preserved? |
STRUCTURAL_COHERENCE |
Is the overall segment arrangement correct? |
DIAGNOSTIC_DEPTH |
Why is the prediction structurally wrong? |
Nucleotide Classification
Per-base TP/TN/FP/FN with precision, recall, and F1. The most basic metric -- treats each position independently.
Region Discovery (4-level Precision/Recall)
Evaluates section matching at increasing strictness using 1:1 greedy matching by overlap length:
| Level | TP condition | What it forgives |
|---|---|---|
neighborhood_hit |
Any overlap | Over- and under-prediction |
internal_hit |
Prediction inside GT | Over-prediction |
full_coverage_hit |
Prediction covers GT | Under-prediction |
perfect_boundary_hit |
Exact match (sweep-based) | Nothing |
Boundary Exactness
How precise are predicted boundaries? Includes IoU distributions and two diagnostic matrices:
- Bias matrix (21x21): Signed boundary residuals revealing systematic directional errors (e.g., "predictions consistently start 2bp early")
- Reliability matrix (11x11): Cumulative recall at tolerances 0--10 bp, showing how quickly recall degrades as boundary tolerance tightens
INDEL Error Taxonomy
Classifies every contiguous mismatch region into one of 8 structural error types:
| Insertions (pred has class, GT does not) | Deletions (GT has class, pred does not) |
|---|---|
| 5' extension | 5' deletion |
| 3' extension | 3' deletion |
| Joined (merges two GT sections) | Split (splits one GT section) |
| Whole insertion (new section) | Whole deletion (missing section) |
Structural Coherence
Evaluates the predicted segment chain as a whole -- not per-section, but as a complete ordered arrangement.
Intron Chain Comparison (strict, gffcompare-style)
intron_chain emits per-sequence tp/fp/fn ∈ {0, 1}: a sequence counts as TP only if the entire set of GT introns equals the set of predicted introns. Aggregated across sequences this becomes the familiar corpus precision/recall — directly comparable to gffcompare's intron-chain P/R.
Per-transcript Soft Exon Metrics
The binary intron_chain metric hides "nearly right" predictions — a transcript with 9 of 10 exons correct scores the same as one with 0 correct. Two complementary per-transcript scalars surface this gradation and are kept as raw per-sequence lists so plotting can draw the distribution across transcripts:
exon_recall_per_transcript— fraction in[0, 1]of GT exons whose(start, end)was recovered exactly. A transcript with 9/10 exons right scores0.9. Transcripts with zero GT exons are excluded.hallucinated_exon_count_per_transcript— integer ≥ 0: predicted exons whose(start, end)is absent from GT. Captures the precision side without conflating it with boundary errors.
Rendered as two overlayed histograms; a fat left tail of recall combined with a fat right tail of hallucinations flags a model that guesses rather than recovering true structure.
Transcript Match Classification
Holistic structural classification of each (GT, prediction) pair into one of 6 categories:
| Class | Condition |
|---|---|
exact |
Identical segment chains |
boundary_shift |
Same segment count, shifted boundaries |
missing_segments |
Prediction is ordered subset of GT (segments skipped) |
extra_segments |
GT is ordered subset of prediction (segments inserted) |
structurally_different |
None of the above |
missed |
No prediction for this class |
Segment Count Delta
Over-segmentation (positive) vs under-segmentation (negative).
Diagnostic Depth
Causal diagnosis of structural errors -- answering why the prediction is wrong, not just that it is wrong.
Junction Error Taxonomy
| Error type | Description |
|---|---|
| Exon skip | GT segments merged (intervening segment absent) |
| Segment retention | GT segment absorbed by neighbours |
| Novel insertion | Extra segment splits a GT segment |
| Cascade shift | Boundary error propagates across 3+ segments |
| Compensating errors | Paired errors that cancel out |
Position Bias
Match rate stratified by genomic position (5' / interior / 3'), revealing whether errors concentrate at sequence ends.
Frameshift
Reading frame deviation (mod-3) between GT and predicted coding exons. Only valid on single-transcript sequences with coding_label configured.
State Transitions (always computed)
Two analyses run on every benchmark call:
- GT Transition Confusion Matrices: At every position where GT changes label, what did the predictor do? One heatmap per source label.
- False Transition Analysis: At positions where GT is stable (no label change), did the predictor introduce a spurious transition?
Label Configuration
All metrics are label-agnostic. Define your own token mapping:
from dna_segmentation_benchmark import LabelConfig
config = LabelConfig(
labels={0: "EXON", 1: "DONOR", 2: "INTRON", 3: "ACCEPTOR", 8: "NONCODING"},
background_label=8,
coding_label=0, # Required for FRAMESHIFT
splice_donor_label=1, # Reserved for future splice metrics
splice_acceptor_label=3, # Reserved for future splice metrics
intron_label=2,
)
A pre-built config for the BEND benchmark is available as BEND_LABEL_CONFIG.
W&B Integration
Log metrics during training and full diagnostic reports after:
from dna_segmentation_benchmark import init_wandb_with_presets, log_benchmark_scalars, log_benchmark_full
run = init_wandb_with_presets("my-project", "run-name", label_config, classes=[0])
# During training -- lightweight scalar logging per epoch
log_benchmark_scalars(val_results, label_config, step=epoch, method_prefix="val")
# After training -- full report with figures
log_benchmark_full({"my_model": final_results}, figures, label_config)
Install with: pip install dna-segmentation-benchmark[wandb]
Examples
See the examples/ folder:
- GTF programmatic example -- end-to-end GFF/GTF evaluation
- Array benchmark example -- starting from numpy label arrays
- W&B training loop -- integration with Weights & Biases
Documentation
- Metrics Reference -- complete documentation of every metric with formulas and aggregation details
- Design Rationale -- architectural decisions and comparison with gffcompare, Mikado, EGASP
Updating README Plots
The plots in this README are auto-generated from synthetic data. To refresh them:
python scripts/generate_readme_plots.py
This writes PNGs to docs/images/ which are referenced by the README.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dna_segmentation_benchmark-0.1.2.tar.gz.
File metadata
- Download URL: dna_segmentation_benchmark-0.1.2.tar.gz
- Upload date:
- Size: 143.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f00d9b912f49e75886845205566dc988e8584f034cc458cebc1595730551b1da
|
|
| MD5 |
f3e5153f911d0f2fccdadd8011638480
|
|
| BLAKE2b-256 |
3e44fe6bffe419ca373f7ece397b2172c0159eaaf4050e61682b851f8cad4592
|
File details
Details for the file dna_segmentation_benchmark-0.1.2-py3-none-any.whl.
File metadata
- Download URL: dna_segmentation_benchmark-0.1.2-py3-none-any.whl
- Upload date:
- Size: 141.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a9be660e35cd8421a662a148afafe26feac0d2c448bf6e8e7a840dda37c6cf6
|
|
| MD5 |
85fe327453dd091054a9b566f819848e
|
|
| BLAKE2b-256 |
bfe8369656e2ab3174aac84964505c6165a63dc57162cfb1d27fc67587b04103
|