Skip to main content

Pipeline-agnostic evaluation and observability for knowledge graph, RAG, and KOS pipelines

Project description

spindle-eval

Pipeline-agnostic evaluation and observability framework for knowledge graph, RAG, and KOS pipelines. spindle-eval wraps any pipeline defined as a sequence of Stage objects with structured experiment tracking, automated metrics, parameter sweeps, quality gates, baseline comparisons, and CI/CD regression detection.

Originally built for spindle (a Graph RAG pipeline), spindle-eval is designed to evaluate any pipeline — full end-to-end systems, individual stages, or partial subsets.

Why spindle-eval?

Multi-stage pipelines have many interacting parameters. Tuning them requires more than ad-hoc scripts. spindle-eval provides:

  • Stage-gated evaluation — each stage must meet quality thresholds before downstream stages run, enforcing upstream-first optimization
  • Pipeline-agnostic execution — define stages with the Stage protocol, wire them with StageDef, run them with PipelineExecutor
  • Composable configs — Hydra config groups for every pipeline aspect, enabling single runs or multi-dimensional parameter sweeps
  • Multiple tracking backends — MLflow for experiments, file-based for CI, composite for multi-backend, no-op for benchmarking
  • Structured events — thread-safe event store with duration analysis, token tracking, and error filtering
  • KOS metrics — intrinsic quality metrics for SKOS taxonomies and OWL ontologies (taxonomy depth, label quality, SHACL conformance, etc.)
  • Automated regression detection — CI compares metrics against baselines with bootstrap confidence intervals
  • Golden dataset management — versioned evaluation datasets with a question-type taxonomy and extensible reference fields for extraction and KOS evaluation

Architecture overview

                    ┌─────────────────────────────┐
                    │     Hydra Configuration      │
                    │  (composable YAML per stage)  │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │     spindle-eval runner      │
                    │  (discovery + orchestration) │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │      PipelineExecutor        │
                    │  (stage wiring, metrics,     │
                    │   gates, event logging)      │
                    └──────────────┬──────────────┘
                                   │
          ┌────────────┬───────────┼───────────┬────────────┐
          ▼            ▼           ▼           ▼            ▼
      Stage 1      Stage 2     Stage 3     Stage N    Metric fns
      (any)        (any)       (any)       (any)     (attached)
          │            │           │           │            │
          └────────────┴───────────┴───────────┴────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │       Tracker backends       │
                    ├──────────┬─────────┬────────┤
                    ▼          ▼         ▼        ▼
                 MLflow     File     Langfuse   No-op
              (experiments) (JSON)  (traces)  (benchmarks)

Installation

pip install spindle-eval

For co-development with a pipeline package (editable install):

pip install -e ".[dev]"
pip install -e /path/to/your-pipeline

Quick start

Full pipeline evaluation

# Single evaluation run
python -m spindle_eval.runner retrieval=hybrid generation=claude evaluation=quick

# Parameter sweep
python -m spindle_eval.runner --multirun \
  preprocessing.chunk_size=256,512,1024 \
  retrieval.top_k=5,10,20

Evaluate a single stage

from spindle_eval.pipeline import PipelineExecutor
from spindle_eval.protocols import StageDef, StageResult
from spindle_eval.tracking import create_tracker
from spindle_eval.metrics.chunk_metrics import boundary_coherence, size_distribution

class MyChunker:
    name = "chunking"
    def run(self, inputs, cfg):
        chunks = do_chunking(cfg)
        return StageResult(outputs={"chunks": chunks})

tracker = create_tracker("file", output_dir="./results")
stages = [
    StageDef(
        name="chunking",
        stage=MyChunker(),
        metrics=[boundary_coherence, size_distribution],
    ),
]
result = PipelineExecutor(tracker).execute(stages, cfg)
tracker.end_run()

Evaluate a KOS builder

from spindle_eval.metrics.kos_metrics import taxonomy_depth, label_quality, orphan_concept_ratio

stages = [
    StageDef(
        name="taxonomy",
        stage=MyTaxonomyBuilder(),
        input_keys={"chunks": "preprocessing.chunks"},
        metrics=[taxonomy_depth, label_quality, orphan_concept_ratio],
        gate=lambda m: m.get("orphan_concept_ratio", 1.0) < 0.3,
    ),
]

Configuration

Hydra config groups live in spindle_eval/conf/ (packaged with the install) and compose together:

Group Options Controls
preprocessing default, small_chunks, large_chunks Chunking strategy and size
ontology schema_first, schema_free, hybrid Entity/relation schema discovery
extraction llm, nlp, finetuned Triple extraction method
retrieval hybrid, local, global, drift Graph retrieval strategy
generation gpt4, claude, gemini LLM for answer generation
evaluation quick, full Number of evaluation examples
sweep none, er_threshold, retrieval, chunk_size Predefined sweep dimensions

Pipeline packages can register additional config groups via Hydra's SearchPathPlugin. See docs/hydra-config-conventions.md.

Metrics

RAG quality (via Ragas)

Faithfulness, context recall, context precision, answer correctness, answer relevancy.

Graph quality

Connectivity, modularity, B-CUBED clustering, CEAF entity alignment, subgraph completeness.

Extraction quality

Triple extraction precision, recall, and F1 — with configurable stage gates.

KOS quality

Taxonomy depth/breadth, label quality, definition completeness, thesaurus connectivity, orphan ratio, axiom density, SHACL conformance. See docs/kos-evaluation-guide.md.

Chunk and provenance quality

Boundary coherence, size distribution, evidence span coverage.

Statistical rigor

Bootstrap confidence intervals for all metrics, used for regression detection in CI.

Tracking backends

Backend Class Use case
MLflow MLflowTracker Production experiment tracking
File FileTracker Local development, CI
Langfuse Via OpenTelemetry Trace-level debugging
No-op NoOpTracker Benchmarking, unit tests
Composite CompositeTracker Fan out to multiple backends
from spindle_eval.tracking import create_tracker

tracker = create_tracker("mlflow")
tracker = create_tracker("file", output_dir="./results")
tracker = create_tracker("noop")

Documentation

Guide Audience
Spindle Developer Guide Pipeline developers integrating with spindle-eval
Custom Pipeline Guide Developers building non-spindle pipelines
KOS Evaluation Guide Developers evaluating SKOS/OWL knowledge structures
Hydra Config Conventions Config authors and sweep designers
Tracking Setup Setting up MLflow/Langfuse (GKE or local Docker)
PyPI Publishing Building and uploading releases to PyPI

Requirements

  • Python 3.10+
  • Pipeline package (optional — mocks used if unavailable, controlled via runner.allow_mock_fallback)

Project structure

spindle-eval/
├── src/spindle_eval/
│   ├── runner.py           # Hydra entrypoint, pipeline discovery
│   ├── pipeline.py         # PipelineExecutor (stage wiring, metrics, gates)
│   ├── protocols.py        # Stage, StageDef, StageResult, Tracker protocols
│   ├── compat.py           # Legacy component dict → StageDef adapter
│   ├── mocks.py            # Mock Stage implementations for testing
│   ├── metrics/            # Ragas, graph, extraction, KOS, chunk, provenance
│   ├── tracking/           # MLflow, file, noop, composite trackers
│   ├── events/             # Event store, duration/token/error analysis
│   ├── datasets/           # Golden dataset loading, KOS reference extraction
│   ├── baselines/          # Baseline runner implementations
│   ├── ci/                 # Regression detection, PR report generation
│   └── production/         # Feedback loops, staleness monitoring
│   ├── conf/               # Hydra config groups (packaged for pip install)
│   └── golden_data/        # Default evaluation datasets (JSONL)
├── docs/                   # Developer guides
├── baselines/              # Baseline metric snapshots
└── tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spindle_eval-0.1.0.tar.gz (57.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spindle_eval-0.1.0-py3-none-any.whl (57.1 kB view details)

Uploaded Python 3

File details

Details for the file spindle_eval-0.1.0.tar.gz.

File metadata

  • Download URL: spindle_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 57.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for spindle_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7c079c1c42f003f655cd31a52d48375ed13d7a78441c22a6ed9cfd1e37762999
MD5 f3465476b0631233d1243c4f86a50b1a
BLAKE2b-256 ea0ac22dcfa392e7b2bf6f0864e0a8868fec8aa45c864df93892bb9b83bb5d0a

See more details on using hashes here.

File details

Details for the file spindle_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: spindle_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 57.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for spindle_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72be04f4a9f8c1f7debcff064ef0c5a72784aa0665c72460f09581517de4e380
MD5 d37cbb4c36cb87a3e7a981a19745eabd
BLAKE2b-256 e7d03861e718dc4818279016781c7d4be8ffabe3de329e1da60f7a7645627335

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page