Pipeline-agnostic evaluation and observability for knowledge graph, RAG, and KOS pipelines
Project description
spindle-eval
Pipeline-agnostic evaluation and observability framework for knowledge graph, RAG, and KOS pipelines. spindle-eval wraps any pipeline defined as a sequence of Stage objects with structured experiment tracking, automated metrics, parameter sweeps, quality gates, baseline comparisons, and CI/CD regression detection.
Originally built for spindle (a Graph RAG pipeline), spindle-eval is designed to evaluate any pipeline — full end-to-end systems, individual stages, or partial subsets.
Why spindle-eval?
Multi-stage pipelines have many interacting parameters. Tuning them requires more than ad-hoc scripts. spindle-eval provides:
- Stage-gated evaluation — each stage must meet quality thresholds before downstream stages run, enforcing upstream-first optimization
- Pipeline-agnostic execution — define stages with the
Stageprotocol, wire them withStageDef, run them withPipelineExecutor - Composable configs — Hydra config groups for every pipeline aspect, enabling single runs or multi-dimensional parameter sweeps
- Multiple tracking backends — MLflow for experiments, file-based for CI, composite for multi-backend, no-op for benchmarking
- Structured events — thread-safe event store with duration analysis, token tracking, and error filtering
- KOS metrics — intrinsic quality metrics for SKOS taxonomies and OWL ontologies (taxonomy depth, label quality, SHACL conformance, etc.)
- Automated regression detection — CI compares metrics against baselines with bootstrap confidence intervals
- Golden dataset management — versioned evaluation datasets with a question-type taxonomy and extensible reference fields for extraction and KOS evaluation
Architecture overview
┌─────────────────────────────┐
│ Hydra Configuration │
│ (composable YAML per stage) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ spindle-eval runner │
│ (discovery + orchestration) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ PipelineExecutor │
│ (stage wiring, metrics, │
│ gates, event logging) │
└──────────────┬──────────────┘
│
┌────────────┬───────────┼───────────┬────────────┐
▼ ▼ ▼ ▼ ▼
Stage 1 Stage 2 Stage 3 Stage N Metric fns
(any) (any) (any) (any) (attached)
│ │ │ │ │
└────────────┴───────────┴───────────┴────────────┘
│
┌──────────────▼──────────────┐
│ Tracker backends │
├──────────┬─────────┬────────┤
▼ ▼ ▼ ▼
MLflow File Langfuse No-op
(experiments) (JSON) (traces) (benchmarks)
Installation
pip install spindle-eval
For co-development with a pipeline package (editable install):
pip install -e ".[dev]"
pip install -e /path/to/your-pipeline
Quick start
Full pipeline evaluation
# Single evaluation run
python -m spindle_eval.runner retrieval=hybrid generation=claude evaluation=quick
# Parameter sweep
python -m spindle_eval.runner --multirun \
preprocessing.chunk_size=256,512,1024 \
retrieval.top_k=5,10,20
Evaluate a single stage
from spindle_eval.pipeline import PipelineExecutor
from spindle_eval.protocols import StageDef, StageResult
from spindle_eval.tracking import create_tracker
from spindle_eval.metrics.chunk_metrics import boundary_coherence, size_distribution
class MyChunker:
name = "chunking"
def run(self, inputs, cfg):
chunks = do_chunking(cfg)
return StageResult(outputs={"chunks": chunks})
tracker = create_tracker("file", output_dir="./results")
stages = [
StageDef(
name="chunking",
stage=MyChunker(),
metrics=[boundary_coherence, size_distribution],
),
]
result = PipelineExecutor(tracker).execute(stages, cfg)
tracker.end_run()
Evaluate a KOS builder
from spindle_eval.metrics.kos_metrics import taxonomy_depth, label_quality, orphan_concept_ratio
stages = [
StageDef(
name="taxonomy",
stage=MyTaxonomyBuilder(),
input_keys={"chunks": "preprocessing.chunks"},
metrics=[taxonomy_depth, label_quality, orphan_concept_ratio],
gate=lambda m: m.get("orphan_concept_ratio", 1.0) < 0.3,
),
]
Configuration
Hydra config groups live in spindle_eval/conf/ (packaged with the install) and compose together:
| Group | Options | Controls |
|---|---|---|
preprocessing |
default, small_chunks, large_chunks |
Chunking strategy and size |
ontology |
schema_first, schema_free, hybrid |
Entity/relation schema discovery |
extraction |
llm, nlp, finetuned |
Triple extraction method |
retrieval |
hybrid, local, global, drift |
Graph retrieval strategy |
generation |
gpt4, claude, gemini |
LLM for answer generation |
evaluation |
quick, full |
Number of evaluation examples |
sweep |
none, er_threshold, retrieval, chunk_size |
Predefined sweep dimensions |
Pipeline packages can register additional config groups via Hydra's SearchPathPlugin. See docs/hydra-config-conventions.md.
Metrics
RAG quality (via Ragas)
Faithfulness, context recall, context precision, answer correctness, answer relevancy.
Graph quality
Connectivity, modularity, B-CUBED clustering, CEAF entity alignment, subgraph completeness.
Extraction quality
Triple extraction precision, recall, and F1 — with configurable stage gates.
KOS quality
Taxonomy depth/breadth, label quality, definition completeness, thesaurus connectivity, orphan ratio, axiom density, SHACL conformance. See docs/kos-evaluation-guide.md.
Chunk and provenance quality
Boundary coherence, size distribution, evidence span coverage.
Statistical rigor
Bootstrap confidence intervals for all metrics, used for regression detection in CI.
Tracking backends
| Backend | Class | Use case |
|---|---|---|
| MLflow | MLflowTracker |
Production experiment tracking |
| File | FileTracker |
Local development, CI |
| Langfuse | Via OpenTelemetry | Trace-level debugging |
| No-op | NoOpTracker |
Benchmarking, unit tests |
| Composite | CompositeTracker |
Fan out to multiple backends |
from spindle_eval.tracking import create_tracker
tracker = create_tracker("mlflow")
tracker = create_tracker("file", output_dir="./results")
tracker = create_tracker("noop")
Documentation
| Guide | Audience |
|---|---|
| Spindle Developer Guide | Pipeline developers integrating with spindle-eval |
| Custom Pipeline Guide | Developers building non-spindle pipelines |
| KOS Evaluation Guide | Developers evaluating SKOS/OWL knowledge structures |
| Hydra Config Conventions | Config authors and sweep designers |
| Tracking Setup | Setting up MLflow/Langfuse (GKE or local Docker) |
| PyPI Publishing | Building and uploading releases to PyPI |
Requirements
- Python 3.10+
- Pipeline package (optional — mocks used if unavailable, controlled via
runner.allow_mock_fallback)
Project structure
spindle-eval/
├── src/spindle_eval/
│ ├── runner.py # Hydra entrypoint, pipeline discovery
│ ├── pipeline.py # PipelineExecutor (stage wiring, metrics, gates)
│ ├── protocols.py # Stage, StageDef, StageResult, Tracker protocols
│ ├── compat.py # Legacy component dict → StageDef adapter
│ ├── mocks.py # Mock Stage implementations for testing
│ ├── metrics/ # Ragas, graph, extraction, KOS, chunk, provenance
│ ├── tracking/ # MLflow, file, noop, composite trackers
│ ├── events/ # Event store, duration/token/error analysis
│ ├── datasets/ # Golden dataset loading, KOS reference extraction
│ ├── baselines/ # Baseline runner implementations
│ ├── ci/ # Regression detection, PR report generation
│ └── production/ # Feedback loops, staleness monitoring
│ ├── conf/ # Hydra config groups (packaged for pip install)
│ └── golden_data/ # Default evaluation datasets (JSONL)
├── docs/ # Developer guides
├── baselines/ # Baseline metric snapshots
└── tests/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spindle_eval-0.1.0.tar.gz.
File metadata
- Download URL: spindle_eval-0.1.0.tar.gz
- Upload date:
- Size: 57.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c079c1c42f003f655cd31a52d48375ed13d7a78441c22a6ed9cfd1e37762999
|
|
| MD5 |
f3465476b0631233d1243c4f86a50b1a
|
|
| BLAKE2b-256 |
ea0ac22dcfa392e7b2bf6f0864e0a8868fec8aa45c864df93892bb9b83bb5d0a
|
File details
Details for the file spindle_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: spindle_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72be04f4a9f8c1f7debcff064ef0c5a72784aa0665c72460f09581517de4e380
|
|
| MD5 |
d37cbb4c36cb87a3e7a981a19745eabd
|
|
| BLAKE2b-256 |
e7d03861e718dc4818279016781c7d4be8ffabe3de329e1da60f7a7645627335
|