Pipeline-agnostic evaluation and observability for knowledge graph, RAG, and KOS pipelines

These details have not been verified by PyPI

Project links

Project description

spindle-eval

Pipeline-agnostic evaluation and observability framework for knowledge graph, RAG, and KOS pipelines. spindle-eval wraps any pipeline defined as a sequence of Stage objects with structured experiment tracking, automated metrics, parameter sweeps, quality gates, baseline comparisons, and CI/CD regression detection.

Originally built for spindle (a Graph RAG pipeline), spindle-eval is designed to evaluate any pipeline — full end-to-end systems, individual stages, or partial subsets.

Why spindle-eval?

Multi-stage pipelines have many interacting parameters. Tuning them requires more than ad-hoc scripts. spindle-eval provides:

Stage-gated evaluation — each stage must meet quality thresholds before downstream stages run, enforcing upstream-first optimization
Pipeline-agnostic execution — define stages with the Stage protocol, wire them with StageDef, run them with PipelineExecutor
Composable configs — Hydra config groups for every pipeline aspect, enabling single runs or multi-dimensional parameter sweeps
Multiple tracking backends — MLflow for experiments, file-based for CI, composite for multi-backend, no-op for benchmarking
Structured events — thread-safe event store with duration analysis, token tracking, and error filtering
KOS metrics — intrinsic quality metrics for SKOS taxonomies and OWL ontologies (taxonomy depth, label quality, SHACL conformance, etc.)
Automated regression detection — CI compares metrics against baselines with bootstrap confidence intervals
Golden dataset management — versioned evaluation datasets with a question-type taxonomy and extensible reference fields for extraction and KOS evaluation

Architecture overview

                    ┌─────────────────────────────┐
                    │     Hydra Configuration      │
                    │  (composable YAML per stage)  │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │     spindle-eval runner      │
                    │  (discovery + orchestration) │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │      PipelineExecutor        │
                    │  (stage wiring, metrics,     │
                    │   gates, event logging)      │
                    └──────────────┬──────────────┘
                                   │
          ┌────────────┬───────────┼───────────┬────────────┐
          ▼            ▼           ▼           ▼            ▼
      Stage 1      Stage 2     Stage 3     Stage N    Metric fns
      (any)        (any)       (any)       (any)     (attached)
          │            │           │           │            │
          └────────────┴───────────┴───────────┴────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │       Tracker backends       │
                    ├──────────┬─────────┬────────┤
                    ▼          ▼         ▼        ▼
                 MLflow     File     Langfuse   No-op
              (experiments) (JSON)  (traces)  (benchmarks)

Installation

pip install spindle-eval

For co-development with a pipeline package (editable install):

pip install -e ".[dev]"
pip install -e /path/to/your-pipeline

Quick start

Full pipeline evaluation

# Single evaluation run
python -m spindle_eval.runner retrieval=hybrid generation=claude evaluation=quick

# Parameter sweep
python -m spindle_eval.runner --multirun \
  preprocessing.chunk_size=256,512,1024 \
  retrieval.top_k=5,10,20

Evaluate a single stage

from spindle_eval.pipeline import PipelineExecutor
from spindle_eval.protocols import StageDef, StageResult
from spindle_eval.tracking import create_tracker
from spindle_eval.metrics.chunk_metrics import boundary_coherence, size_distribution

class MyChunker:
    name = "chunking"
    def run(self, inputs, cfg):
        chunks = do_chunking(cfg)
        return StageResult(outputs={"chunks": chunks})

tracker = create_tracker("file", output_dir="./results")
stages = [
    StageDef(
        name="chunking",
        stage=MyChunker(),
        metrics=[boundary_coherence, size_distribution],
    ),
]
result = PipelineExecutor(tracker).execute(stages, cfg)
tracker.end_run()

Evaluate a KOS builder

from spindle_eval.metrics.kos_metrics import taxonomy_depth, label_quality, orphan_concept_ratio

stages = [
    StageDef(
        name="taxonomy",
        stage=MyTaxonomyBuilder(),
        input_keys={"chunks": "preprocessing.chunks"},
        metrics=[taxonomy_depth, label_quality, orphan_concept_ratio],
        gate=lambda m: m.get("orphan_concept_ratio", 1.0) < 0.3,
    ),
]

Configuration

Hydra config groups live in spindle_eval/conf/ (packaged with the install) and compose together:

Group	Options	Controls
`preprocessing`	`default`, `small_chunks`, `large_chunks`	Chunking strategy and size
`ontology`	`schema_first`, `schema_free`, `hybrid`	Entity/relation schema discovery
`extraction`	`llm`, `nlp`, `finetuned`	Triple extraction method
`retrieval`	`hybrid`, `local`, `global`, `drift`	Graph retrieval strategy
`generation`	`gpt4`, `claude`, `gemini`	LLM for answer generation
`evaluation`	`quick`, `full`	Number of evaluation examples
`sweep`	`none`, `er_threshold`, `retrieval`, `chunk_size`	Predefined sweep dimensions

Pipeline packages can register additional config groups via Hydra's SearchPathPlugin. See docs/hydra-config-conventions.md.

Metrics

RAG quality (via Ragas)

Faithfulness, context recall, context precision, answer correctness, answer relevancy.

Graph quality

Connectivity, modularity, B-CUBED clustering, CEAF entity alignment, subgraph completeness.

Extraction quality

Triple extraction precision, recall, and F1 — with configurable stage gates.

KOS quality

Taxonomy depth/breadth, label quality, definition completeness, thesaurus connectivity, orphan ratio, axiom density, SHACL conformance. See docs/kos-evaluation-guide.md.

Chunk and provenance quality

Boundary coherence, size distribution, evidence span coverage.

Statistical rigor

Bootstrap confidence intervals for all metrics, used for regression detection in CI.

Tracking backends

Backend	Class	Use case
MLflow	`MLflowTracker`	Production experiment tracking
File	`FileTracker`	Local development, CI
Langfuse	Via OpenTelemetry	Trace-level debugging
No-op	`NoOpTracker`	Benchmarking, unit tests
Composite	`CompositeTracker`	Fan out to multiple backends

from spindle_eval.tracking import create_tracker

tracker = create_tracker("mlflow")
tracker = create_tracker("file", output_dir="./results")
tracker = create_tracker("noop")

Documentation

Guide	Audience
Spindle Developer Guide	Pipeline developers integrating with spindle-eval
Custom Pipeline Guide	Developers building non-spindle pipelines
KOS Evaluation Guide	Developers evaluating SKOS/OWL knowledge structures
Hydra Config Conventions	Config authors and sweep designers
Tracking Setup	Setting up MLflow/Langfuse (GKE or local Docker)
PyPI Publishing	Building and uploading releases to PyPI

Requirements

Python 3.10+
Pipeline package (optional — mocks used if unavailable, controlled via runner.allow_mock_fallback)

Project structure

spindle-eval/
├── src/spindle_eval/
│   ├── runner.py           # Hydra entrypoint, pipeline discovery
│   ├── pipeline.py         # PipelineExecutor (stage wiring, metrics, gates)
│   ├── protocols.py        # Stage, StageDef, StageResult, Tracker protocols
│   ├── compat.py           # Legacy component dict → StageDef adapter
│   ├── mocks.py            # Mock Stage implementations for testing
│   ├── metrics/            # Ragas, graph, extraction, KOS, chunk, provenance
│   ├── tracking/           # MLflow, file, noop, composite trackers
│   ├── events/             # Event store, duration/token/error analysis
│   ├── datasets/           # Golden dataset loading, KOS reference extraction
│   ├── baselines/          # Baseline runner implementations
│   ├── ci/                 # Regression detection, PR report generation
│   └── production/         # Feedback loops, staleness monitoring
│   ├── conf/               # Hydra config groups (packaged for pip install)
│   └── golden_data/        # Default evaluation datasets (JSONL)
├── docs/                   # Developer guides
├── baselines/              # Baseline metric snapshots
└── tests/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spindle_eval-0.1.0.tar.gz (57.0 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spindle_eval-0.1.0-py3-none-any.whl (57.1 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file spindle_eval-0.1.0.tar.gz.

File metadata

Download URL: spindle_eval-0.1.0.tar.gz
Upload date: Mar 11, 2026
Size: 57.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for spindle_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7c079c1c42f003f655cd31a52d48375ed13d7a78441c22a6ed9cfd1e37762999`
MD5	`f3465476b0631233d1243c4f86a50b1a`
BLAKE2b-256	`ea0ac22dcfa392e7b2bf6f0864e0a8868fec8aa45c864df93892bb9b83bb5d0a`

See more details on using hashes here.

File details

Details for the file spindle_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: spindle_eval-0.1.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 57.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for spindle_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72be04f4a9f8c1f7debcff064ef0c5a72784aa0665c72460f09581517de4e380`
MD5	`d37cbb4c36cb87a3e7a981a19745eabd`
BLAKE2b-256	`e7d03861e718dc4818279016781c7d4be8ffabe3de329e1da60f7a7645627335`

See more details on using hashes here.

spindle-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

spindle-eval

Why spindle-eval?

Architecture overview

Installation

Quick start

Full pipeline evaluation

Evaluate a single stage

Evaluate a KOS builder

Configuration

Metrics

RAG quality (via Ragas)

Graph quality

Extraction quality

KOS quality

Chunk and provenance quality

Statistical rigor

Tracking backends

Documentation

Requirements

Project structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes