Skip to main content

Metrics for nucleotide-level DNA segmentation benchmarking

Project description

DNA Segmentation Benchmark

Diagnostic evaluation toolkit for nucleotide-level DNA segmentation models (gene finders like Augustus, Helixer, Tiberius, SegmentNT) against reference annotations (e.g., GENCODE).

Goes beyond standard precision/recall with an 8-type INDEL error taxonomy, boundary bias/reliability landscapes, gap chain analysis (label-agnostic intron chain comparison), transcript match classification, junction error diagnosis, and state transition analysis -- metrics not available in gffcompare, Mikado, or EGASP.

pip install dna-segmentation-benchmark

Quick Start

From GFF/GTF files

from dna_segmentation_benchmark import (
    LabelConfig, EvalMetrics, benchmark_from_gff, compare_multiple_predictions,
)

label_config = LabelConfig(
    labels={0: "CDS", 2: "INTRON", 8: "NONCODING"},
    background_label=8,
    coding_label=0,
)

results = benchmark_from_gff(
    gt_path="ground_truth.gtf",
    pred_paths={"augustus": "predictions.gff"},
    label_config=label_config,
    classes=[0],
    metrics=[EvalMetrics.REGION_DISCOVERY, EvalMetrics.NUCLEOTIDE_CLASSIFICATION],
    exclude_features=["gene"],
)

figures = compare_multiple_predictions(
    per_method_benchmark_res=results,
    label_config=label_config,
    classes=[0],
    metrics_to_eval=[EvalMetrics.REGION_DISCOVERY, EvalMetrics.NUCLEOTIDE_CLASSIFICATION],
)

From label arrays

import numpy as np
from dna_segmentation_benchmark import (
    LabelConfig, EvalMetrics, benchmark_gt_vs_pred_multiple, compare_multiple_predictions,
)

label_config = LabelConfig(
    labels={0: "EXON", 2: "INTRON", 8: "NONCODING"},
    background_label=8,
    coding_label=0,
)

results = benchmark_gt_vs_pred_multiple(
    gt_labels=gt_arrays,       # list[np.ndarray]
    pred_labels=pred_arrays,   # list[np.ndarray]
    label_config=label_config,
    classes=[0],
    metrics=[
        EvalMetrics.INDEL,
        EvalMetrics.REGION_DISCOVERY,
        EvalMetrics.BOUNDARY_EXACTNESS,
        EvalMetrics.NUCLEOTIDE_CLASSIFICATION,
        EvalMetrics.STRUCTURAL_COHERENCE,
        EvalMetrics.DIAGNOSTIC_DEPTH,
    ],
)

CLI

dna-benchmark run \
    --gt ground_truth.gtf \
    --pred augustus:predictions.gff \
    --config label_config.yaml \
    --classes 0 \
    --exclude-features gene \
    --output results.json

Metrics

Seven metric groups, each answering a distinct question about prediction quality:

Group Question
NUCLEOTIDE_CLASSIFICATION Per-base, how accurate is it?
REGION_DISCOVERY Did we find the right regions?
BOUNDARY_EXACTNESS How precise are the boundaries?
INDEL What structural errors exist?
FRAMESHIFT Is the reading frame preserved?
STRUCTURAL_COHERENCE Is the overall segment arrangement correct?
DIAGNOSTIC_DEPTH Why is the prediction structurally wrong?

Nucleotide Classification

Per-base TP/TN/FP/FN with precision, recall, and F1. The most basic metric -- treats each position independently.

Nucleotide classification


Region Discovery (4-level Precision/Recall)

Evaluates section matching at increasing strictness using 1:1 greedy matching by overlap length:

Level TP condition What it forgives
neighborhood_hit Any overlap Over- and under-prediction
internal_hit Prediction inside GT Over-prediction
full_coverage_hit Prediction covers GT Under-prediction
perfect_boundary_hit Exact match (sweep-based) Nothing

Region discovery - neighborhood

Region discovery - perfect boundary


Boundary Exactness

How precise are predicted boundaries? Includes IoU distributions and two diagnostic matrices:

  • Bias matrix (21x21): Signed boundary residuals revealing systematic directional errors (e.g., "predictions consistently start 2bp early")
  • Reliability matrix (11x11): Cumulative recall at tolerances 0--10 bp, showing how quickly recall degrades as boundary tolerance tightens

IoU average

IoU distribution


INDEL Error Taxonomy

Classifies every contiguous mismatch region into one of 8 structural error types:

Insertions (pred has class, GT does not) Deletions (GT has class, pred does not)
5' extension 5' deletion
3' extension 3' deletion
Joined (merges two GT sections) Split (splits one GT section)
Whole insertion (new section) Whole deletion (missing section)

INDEL error counts

INDEL error lengths


Structural Coherence

Evaluates the predicted segment chain as a whole -- not per-section, but as a complete ordered arrangement.

Gap Chain Comparison

Compares ordered gaps between consecutive segments. For exons, gaps = introns -- making this the label-agnostic equivalent of intron chain comparison (gffcompare's key metric).

  • gap_chain_match_rate: Fraction of sequences with identical gap chains
  • gap_count_match_rate: Fraction with the same number of gaps
  • gap_chain_lcs_ratio: LCS-based partial credit (0--1), ordering-aware

Gap chain metrics

Transcript Match Classification

Holistic structural classification of each (GT, prediction) pair into one of 6 categories:

Class Condition
exact Identical segment chains
boundary_shift Same segment count, shifted boundaries
missing_segments Prediction is ordered subset of GT (segments skipped)
extra_segments GT is ordered subset of prediction (segments inserted)
structurally_different None of the above
missed No prediction for this class

Transcript match classification

Segment Count Delta

Over-segmentation (positive) vs under-segmentation (negative).

Segment count delta


Diagnostic Depth

Causal diagnosis of structural errors -- answering why the prediction is wrong, not just that it is wrong.

Junction Error Taxonomy

Error type Description
Exon skip GT segments merged (intervening segment absent)
Segment retention GT segment absorbed by neighbours
Novel insertion Extra segment splits a GT segment
Cascade shift Boundary error propagates across 3+ segments
Compensating errors Paired errors that cancel out

Junction error taxonomy

Position Bias

Match rate stratified by genomic position (5' / interior / 3'), revealing whether errors concentrate at sequence ends.

Position bias


Frameshift

Reading frame deviation (mod-3) between GT and predicted coding exons. Only valid on single-transcript sequences with coding_label configured.


State Transitions (always computed)

Two analyses run on every benchmark call:

  • GT Transition Confusion Matrices: At every position where GT changes label, what did the predictor do? One heatmap per source label.
  • False Transition Analysis: At positions where GT is stable (no label change), did the predictor introduce a spurious transition?

Label Configuration

All metrics are label-agnostic. Define your own token mapping:

from dna_segmentation_benchmark import LabelConfig

config = LabelConfig(
    labels={0: "EXON", 1: "DONOR", 2: "INTRON", 3: "ACCEPTOR", 8: "NONCODING"},
    background_label=8,
    coding_label=0,           # Required for FRAMESHIFT
    splice_donor_label=1,     # Reserved for future splice metrics
    splice_acceptor_label=3,  # Reserved for future splice metrics
    intron_label=2,
)

A pre-built config for the BEND benchmark is available as BEND_LABEL_CONFIG.


W&B Integration

Log metrics during training and full diagnostic reports after:

from dna_segmentation_benchmark import init_wandb_with_presets, log_benchmark_scalars, log_benchmark_full

run = init_wandb_with_presets("my-project", "run-name", label_config, classes=[0])

# During training -- lightweight scalar logging per epoch
log_benchmark_scalars(val_results, label_config, step=epoch, method_prefix="val")

# After training -- full report with figures
log_benchmark_full({"my_model": final_results}, figures, label_config)

Install with: pip install dna-segmentation-benchmark[wandb]


Examples

See the examples/ folder:


Documentation

  • Metrics Reference -- complete documentation of every metric with formulas and aggregation details
  • Design Rationale -- architectural decisions and comparison with gffcompare, Mikado, EGASP

Updating README Plots

The plots in this README are auto-generated from synthetic data. To refresh them:

python scripts/generate_readme_plots.py

This writes PNGs to docs/images/ which are referenced by the README.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_segmentation_benchmark-0.1.0.tar.gz (112.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dna_segmentation_benchmark-0.1.0-py3-none-any.whl (121.3 kB view details)

Uploaded Python 3

File details

Details for the file dna_segmentation_benchmark-0.1.0.tar.gz.

File metadata

File hashes

Hashes for dna_segmentation_benchmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f5ebc39f71ebeafd1fc7604e97912cf02f45660c1e8be7a0825ed60fe81e4590
MD5 85206eea482f0ebc947d0e66f4f979ba
BLAKE2b-256 7fa93bb2b0ffa7d1fe561c60146d4f3d42b4a3a96c855f0915fe53dae85d3e5f

See more details on using hashes here.

File details

Details for the file dna_segmentation_benchmark-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dna_segmentation_benchmark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 504cfb9e0ff5d66a1d0df4ea6f9638470c5932dd86a50bce1b4ad137be043379
MD5 4d5c2e950ea8c4eebfad8d83eab5cb95
BLAKE2b-256 acf535b40a3ad427228755e7612323e14797a1b89b1434fa6cc79937a2afbab7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page