Metrics for nucleotide-level DNA segmentation benchmarking

These details have not been verified by PyPI

Project links

Project description

DNA Segmentation Benchmark

Diagnostic evaluation toolkit for nucleotide-level DNA segmentation models (gene finders like Augustus, Helixer, Tiberius, SegmentNT) against reference annotations (e.g., GENCODE).

Goes beyond standard precision/recall with an 8-type INDEL error taxonomy, boundary bias/reliability landscapes, gap chain analysis (label-agnostic intron chain comparison), transcript match classification, junction error diagnosis, and state transition analysis -- metrics not available in gffcompare, Mikado, or EGASP.

pip install dna-segmentation-benchmark

Quick Start

From GFF/GTF files

from dna_segmentation_benchmark import (
    LabelConfig, EvalMetrics, benchmark_from_gff, compare_multiple_predictions,
)

label_config = LabelConfig(
    labels={0: "CDS", 2: "INTRON", 8: "NONCODING"},
    background_label=8,
    coding_label=0,
)

results = benchmark_from_gff(
    gt_path="ground_truth.gtf",
    pred_paths={"augustus": "predictions.gff"},
    label_config=label_config,
    classes=[0],
    metrics=[EvalMetrics.REGION_DISCOVERY, EvalMetrics.NUCLEOTIDE_CLASSIFICATION],
    exclude_features=["gene"],
)

figures = compare_multiple_predictions(
    per_method_benchmark_res=results,
    label_config=label_config,
    classes=[0],
    metrics_to_eval=[EvalMetrics.REGION_DISCOVERY, EvalMetrics.NUCLEOTIDE_CLASSIFICATION],
)

From label arrays

import numpy as np
from dna_segmentation_benchmark import (
    LabelConfig, EvalMetrics, benchmark_gt_vs_pred_multiple, compare_multiple_predictions,
)

label_config = LabelConfig(
    labels={0: "EXON", 2: "INTRON", 8: "NONCODING"},
    background_label=8,
    coding_label=0,
)

results = benchmark_gt_vs_pred_multiple(
    gt_labels=gt_arrays,       # list[np.ndarray]
    pred_labels=pred_arrays,   # list[np.ndarray]
    label_config=label_config,
    classes=[0],
    metrics=[
        EvalMetrics.INDEL,
        EvalMetrics.REGION_DISCOVERY,
        EvalMetrics.BOUNDARY_EXACTNESS,
        EvalMetrics.NUCLEOTIDE_CLASSIFICATION,
        EvalMetrics.STRUCTURAL_COHERENCE,
        EvalMetrics.DIAGNOSTIC_DEPTH,
    ],
)

CLI

dna-benchmark run \
    --gt ground_truth.gtf \
    --pred augustus:predictions.gff \
    --config label_config.yaml \
    --classes 0 \
    --exclude-features gene \
    --output results.json

Metrics

Seven metric groups, each answering a distinct question about prediction quality:

Group	Question
`NUCLEOTIDE_CLASSIFICATION`	Per-base, how accurate is it?
`REGION_DISCOVERY`	Did we find the right regions?
`BOUNDARY_EXACTNESS`	How precise are the boundaries?
`INDEL`	What structural errors exist?
`FRAMESHIFT`	Is the reading frame preserved?
`STRUCTURAL_COHERENCE`	Is the overall segment arrangement correct?
`DIAGNOSTIC_DEPTH`	Why is the prediction structurally wrong?

Nucleotide Classification

Per-base TP/TN/FP/FN with precision, recall, and F1. The most basic metric -- treats each position independently.

Nucleotide classification

Region Discovery (4-level Precision/Recall)

Evaluates section matching at increasing strictness using 1:1 greedy matching by overlap length:

Level	TP condition	What it forgives
`neighborhood_hit`	Any overlap	Over- and under-prediction
`internal_hit`	Prediction inside GT	Over-prediction
`full_coverage_hit`	Prediction covers GT	Under-prediction
`perfect_boundary_hit`	Exact match (sweep-based)	Nothing

Region discovery - neighborhood

Region discovery - perfect boundary

Boundary Exactness

How precise are predicted boundaries? Includes IoU distributions and two diagnostic matrices:

Bias matrix (21x21): Signed boundary residuals revealing systematic directional errors (e.g., "predictions consistently start 2bp early")
Reliability matrix (11x11): Cumulative recall at tolerances 0--10 bp, showing how quickly recall degrades as boundary tolerance tightens

IoU average

IoU distribution

INDEL Error Taxonomy

Classifies every contiguous mismatch region into one of 8 structural error types:

Insertions (pred has class, GT does not)	Deletions (GT has class, pred does not)
5' extension	5' deletion
3' extension	3' deletion
Joined (merges two GT sections)	Split (splits one GT section)
Whole insertion (new section)	Whole deletion (missing section)

INDEL error counts

INDEL error lengths

Structural Coherence

Evaluates the predicted segment chain as a whole -- not per-section, but as a complete ordered arrangement.

Gap Chain Comparison

Compares ordered gaps between consecutive segments. For exons, gaps = introns -- making this the label-agnostic equivalent of intron chain comparison (gffcompare's key metric).

gap_chain_match_rate: Fraction of sequences with identical gap chains
gap_count_match_rate: Fraction with the same number of gaps
gap_chain_lcs_ratio: LCS-based partial credit (0--1), ordering-aware

Gap chain metrics

Transcript Match Classification

Holistic structural classification of each (GT, prediction) pair into one of 6 categories:

Class	Condition
`exact`	Identical segment chains
`boundary_shift`	Same segment count, shifted boundaries
`missing_segments`	Prediction is ordered subset of GT (segments skipped)
`extra_segments`	GT is ordered subset of prediction (segments inserted)
`structurally_different`	None of the above
`missed`	No prediction for this class

Transcript match classification

Segment Count Delta

Over-segmentation (positive) vs under-segmentation (negative).

Segment count delta

Diagnostic Depth

Causal diagnosis of structural errors -- answering why the prediction is wrong, not just that it is wrong.

Junction Error Taxonomy

Error type	Description
Exon skip	GT segments merged (intervening segment absent)
Segment retention	GT segment absorbed by neighbours
Novel insertion	Extra segment splits a GT segment
Cascade shift	Boundary error propagates across 3+ segments
Compensating errors	Paired errors that cancel out

Junction error taxonomy

Position Bias

Match rate stratified by genomic position (5' / interior / 3'), revealing whether errors concentrate at sequence ends.

Position bias

Frameshift

Reading frame deviation (mod-3) between GT and predicted coding exons. Only valid on single-transcript sequences with coding_label configured.

State Transitions (always computed)

Two analyses run on every benchmark call:

GT Transition Confusion Matrices: At every position where GT changes label, what did the predictor do? One heatmap per source label.
False Transition Analysis: At positions where GT is stable (no label change), did the predictor introduce a spurious transition?

Label Configuration

All metrics are label-agnostic. Define your own token mapping:

from dna_segmentation_benchmark import LabelConfig

config = LabelConfig(
    labels={0: "EXON", 1: "DONOR", 2: "INTRON", 3: "ACCEPTOR", 8: "NONCODING"},
    background_label=8,
    coding_label=0,           # Required for FRAMESHIFT
    splice_donor_label=1,     # Reserved for future splice metrics
    splice_acceptor_label=3,  # Reserved for future splice metrics
    intron_label=2,
)

A pre-built config for the BEND benchmark is available as BEND_LABEL_CONFIG.

W&B Integration

Log metrics during training and full diagnostic reports after:

from dna_segmentation_benchmark import init_wandb_with_presets, log_benchmark_scalars, log_benchmark_full

run = init_wandb_with_presets("my-project", "run-name", label_config, classes=[0])

# During training -- lightweight scalar logging per epoch
log_benchmark_scalars(val_results, label_config, step=epoch, method_prefix="val")

# After training -- full report with figures
log_benchmark_full({"my_model": final_results}, figures, label_config)

Install with: pip install dna-segmentation-benchmark[wandb]

Examples

See the examples/ folder:

GTF programmatic example -- end-to-end GFF/GTF evaluation
Array benchmark example -- starting from numpy label arrays
W&B training loop -- integration with Weights & Biases

Documentation

Metrics Reference -- complete documentation of every metric with formulas and aggregation details
Design Rationale -- architectural decisions and comparison with gffcompare, Mikado, EGASP

Updating README Plots

The plots in this README are auto-generated from synthetic data. To refresh them:

python scripts/generate_readme_plots.py

This writes PNGs to docs/images/ which are referenced by the README.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 4, 2026

0.1.1

Mar 24, 2026

This version

0.1.0

Mar 24, 2026

0.0.4

Oct 7, 2025

0.0.3

Jun 16, 2025

0.0.2

May 20, 2025

0.0.1

May 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_segmentation_benchmark-0.1.0.tar.gz (112.4 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dna_segmentation_benchmark-0.1.0-py3-none-any.whl (121.3 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file dna_segmentation_benchmark-0.1.0.tar.gz.

File metadata

Download URL: dna_segmentation_benchmark-0.1.0.tar.gz
Upload date: Mar 24, 2026
Size: 112.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for dna_segmentation_benchmark-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5ebc39f71ebeafd1fc7604e97912cf02f45660c1e8be7a0825ed60fe81e4590`
MD5	`85206eea482f0ebc947d0e66f4f979ba`
BLAKE2b-256	`7fa93bb2b0ffa7d1fe561c60146d4f3d42b4a3a96c855f0915fe53dae85d3e5f`

See more details on using hashes here.

File details

Details for the file dna_segmentation_benchmark-0.1.0-py3-none-any.whl.

File metadata

Download URL: dna_segmentation_benchmark-0.1.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 121.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for dna_segmentation_benchmark-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`504cfb9e0ff5d66a1d0df4ea6f9638470c5932dd86a50bce1b4ad137be043379`
MD5	`4d5c2e950ea8c4eebfad8d83eab5cb95`
BLAKE2b-256	`acf535b40a3ad427228755e7612323e14797a1b89b1434fa6cc79937a2afbab7`

See more details on using hashes here.

dna-segmentation-benchmark 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DNA Segmentation Benchmark

Quick Start

From GFF/GTF files

From label arrays

CLI

Metrics

Nucleotide Classification

Region Discovery (4-level Precision/Recall)

Boundary Exactness

INDEL Error Taxonomy

Structural Coherence

Gap Chain Comparison

Transcript Match Classification

Segment Count Delta

Diagnostic Depth

Junction Error Taxonomy

Position Bias

Frameshift

State Transitions (always computed)

Label Configuration

W&B Integration

Examples

Documentation

Updating README Plots

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes