A modular framework for active system identification benchmarks
Project description
DedeuceRL
Benchmark LLMs on Active System Identification — probe hidden systems, form hypotheses, verify correctness.
pip install dedeucerl
dedeucerl-generate --skin mealy --seeds 0-4 --budget 25 --n-states 3 -o tasks.json
dedeucerl-eval --skin mealy --split tasks.json --model heuristic:none --out results.jsonl
dedeucerl-eval-parallel --jobs 4 --out results.jsonl --skin mealy --split tasks.json --model heuristic:none # merged output
Why DedeuceRL?
Modern LLMs excel at knowledge retrieval and static reasoning, but struggle with active exploration — systematically probing unknown systems and deducing their structure from observations.
DedeuceRL benchmarks this capability by requiring agents to:
| Capability | What We Test |
|---|---|
| Systematic Exploration | Strategically select probes to maximize information gain |
| Hypothesis Formation | Build mental models of hidden system dynamics |
| Efficient Verification | Minimize queries while ensuring correctness |
| Safety Awareness | Avoid dangerous "trap" states that penalize reward |
Research Context: Active system identification builds on Angluin's L* algorithm for active automata learning, conformance testing (W-method), and query-based learning theory. See Angluin (1987), Vaandrager (2017).
Table of Contents
- Installation
- Quickstart
- Available Skins
- Interactive Game
- Training with RL
- CLI Reference
- Creating New Skins
- Metrics
- Citation
- Contributing
Installation
pip install dedeucerl # Core
pip install "dedeucerl[openai]" # + OpenAI adapter
pip install "dedeucerl[all]" # All providers
pip install "dedeucerl[gemini]" # + Gemini adapter (google-genai; supported)
pip install "dedeucerl[rl]" # Verifiers RL trainer extras
Development installation
git clone https://github.com/AashVed/DedeuceRL.git
cd DedeuceRL
pip install -e ".[dev]"
Requirements: Python 3.10+ · verifiers>=0.1.9 · datasets>=2.0
Quickstart
1. Generate a task split
dedeucerl-generate --skin mealy --seeds 0-9 --budget 25 --n-states 3 -o tasks.json
2. Evaluate a model
export OPENAI_API_KEY="sk-..."
dedeucerl-eval --skin mealy --split tasks.json --model openai:gpt-4o --out results.jsonl
3. View results
dedeucerl-aggregate results.jsonl --format markdown
Output:
| Model | Episodes | Success Rate | Trap Rate | Avg Queries | Avg Reward |
|---------------|----------|--------------|-----------|-------------|------------|
| openai:gpt-4o | 10 | 40.0% | 20.0% | 18.2 | 0.318 |
Available Skins
DedeuceRL ships with multiple "skins" — domain-specific instantiations of the active identification paradigm:
| Skin | Domain | What the Agent Must Identify |
|---|---|---|
mealy |
Automata Theory | Hidden Mealy machine (state × input → output) |
protocol |
API Testing | REST API state-dependent behavior |
apienv |
SaaS Systems | API with methods, endpoints, variants, response schemas |
exprpolicy |
DSL Debugging | Typed policy expression (compile + test + submit) |
Skin details
Mealy (Reference Skin)
The agent identifies a hidden Mealy machine (finite-state transducer).
- Tools:
act(symbol)→ probe,submit_table(json)→ submit hypothesis - Features: Isomorphism checking, counterexample feedback, trap transitions
- Guarantees: Generated machines are minimal and fully reachable
Protocol
Reverse-engineer a stateful REST API.
- Tools:
api_call(method, endpoint)→ probe,submit_spec(json)→ submit - Features: State-dependent HTTP responses, behavioral equivalence
APIEnv
Realistic SaaS API identification with variants and response schemas.
- Tools:
api_call(method, endpoint, variant)→ probe,submit_spec(json)→ submit - Features: Complex multi-dimensional action space
ExprPolicy
Debug a typed policy DSL using compiler feedback and test suites.
- Tools:
type_check(expr),run_tests(expr, suite),submit(expr) - Features: Hidden tests, counterexample feedback, token constraints
Interactive Game
Play any skin as a human agent to understand the challenge:
Note:
cliGameis a repo-only helper and is not installed viapip install dedeucerl.
python -m cliGame
🎮 DedeuceRL Interactive Game
Available skins: mealy, protocol, apienv, exprpolicy
Select skin [1-4]: 1
Enter seed (int): 42
=== SYSTEM PROMPT ===
You are identifying a hidden Mealy machine...
=== YOUR TURN ===
> act A
{"output": 1, "budget_left": 24, "trap_hit": false}
> act B
{"output": 2, "budget_left": 23, "trap_hit": false}
> submit_table {"n":3,"start":0,"trans":{...}}
{"ok": true}
Commands: :help :tools :prompt :state :quit
Generating Tasks
CLI Generator (recommended)
# Show available parameters for a skin
dedeucerl-generate --skin mealy --show-skin-params --seeds 0 --budget 25
# Generate 100-episode Mealy test split
dedeucerl-generate \
--skin mealy \
--seeds 0-99 \
--subset test \
--budget 100 \
--n-states 4 \
--no-trap \
-o seeds/mealy_test.json
# Generate Protocol split
dedeucerl-generate \
--skin protocol \
--seeds 0-99 \
--budget 120 \
--n-endpoints 5 \
--n-states 4 \
-o seeds/protocol_test.json
Python API
from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator
gen = TaskGenerator(MealyEnv)
split = gen.generate_split(
seeds=list(range(100)),
budget=25,
subset_name="test",
n_states=5,
trap=True,
)
gen.save_split(split, "seeds/mealy_test.json")
# Build HuggingFace Dataset
dataset = gen.build_dataset("seeds/mealy_test.json", "test", feedback=True)
Pre-built splits: 🤗 comfortably-dumb/DedeuceRL
Guide: Running Evaluations
Method 1: CLI (Recommended)
# Basic evaluation
dedeucerl-eval \
--skin mealy \
--split seeds/mealy_smoke.json \
--model openai:gpt-4o \
--out results.jsonl
# With all options
dedeucerl-eval \
--skin apienv \
--split seeds/apienv_smoke.json \
--model anthropic:claude-3-opus-20240229 \
--rollouts 3 \
--feedback \
--temperature 0.0 \
--verbose \
--out results/apienv_claude.jsonl
Supported Model Specs
| Provider | Format | Examples |
|---|---|---|
| OpenAI | openai:<model> |
openai:gpt-4o, openai:gpt-4-turbo |
| Anthropic | anthropic:<model> |
anthropic:claude-3-opus-20240229 |
| Gemini | gemini:<model> |
gemini:gemini-1.5-pro |
| OpenRouter | openrouter:<model> |
openrouter:meta-llama/llama-3-70b |
Method 2: Python API
from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator, make_rubric
from dedeucerl.adapters import get_adapter
# Setup
generator = TaskGenerator(MealyEnv)
dataset = generator.build_dataset("seeds/mealy_smoke.json", "dev", feedback=True)
rubric = make_rubric()
env = MealyEnv(dataset=dataset, rubric=rubric, feedback=True, max_turns=30)
# Get adapter
adapter = get_adapter("openai:gpt-4o", temperature=0.0)
# Run episode manually
item = dataset[0]
state = {"prompt": item["prompt"], "answer": item["answer"]}
# ... custom evaluation loop
Aggregating Results
# CSV (for spreadsheets)
dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv
# Markdown (for README/reports)
dedeucerl-aggregate results.jsonl --format markdown
# JSON (for programmatic use)
dedeucerl-aggregate results.jsonl --format json -o summary.json
# Multiple files
dedeucerl-aggregate results/*.jsonl --format markdown
Output columns: model, n_episodes, success_rate, trap_rate, avg_queries, avg_reward
Hugging Face Dataset
Public task splits (MIT-licensed) are available at:
Training with RL
DedeuceRL environments inherit from verifiers.StatefulToolEnv, making them directly compatible with RL training frameworks.
Quick Start with vf.RLTrainer
# Install verifiers with RL support
uv add 'verifiers[rl]'
# Run training (create your own config based on verifiers docs)
uv run vf-rl @ your-config.toml
Sample configs are included in configs/vf-rl/ (e.g., dedeucerl-mealy.toml).
Example Configuration
# your-config.toml (example)
model = "Qwen/Qwen3-4B-Instruct"
[env]
id = "dedeucerl.vf_env"
[env.args]
skin = "mealy"
seeds = [0, 1, 2, 3, 4]
budget = 25
n_states = 4
subset = "train"
feedback = true
reward_mode = "train_dense"
[trainer.args]
run_name = "dedeucerl-mealy"
micro_batch_size = 4
rollouts_per_example = 16
batch_size = 1024
max_steps = 500
Custom Skins
If you create your own skin, you can pass it by import path:
[env.args]
skin = "my_pkg.my_skin:MySkinEnv"
Alternative Training Frameworks
DedeuceRL is also compatible with:
- prime-rl — Async RL at scale with FSDP2 + vLLM
- SkyRL — Verifiers integration
- Tinker — Verifiers recipes
Custom reward functions
from verifiers import Rubric, Parser
def efficiency_reward(completion, answer, state, parser):
"""Reward efficiency: fewer queries = higher reward."""
if not state.get("ok", False):
return 0.0
queries = state.get("queries_used", 0)
budget = state.get("budget_init", 25)
efficiency = 1.0 - (queries / budget)
trap_penalty = 0.5 if state.get("trap_hit", False) else 0.0
return efficiency - trap_penalty
custom_rubric = Rubric(
funcs=[efficiency_reward],
weights=[1.0],
parser=Parser(extract_fn=lambda s: s),
)
env = MealyEnv(dataset=dataset, rubric=custom_rubric, feedback=True, max_turns=30)
See verifiers training docs for complete setup instructions.
CLI Reference
dedeucerl-eval
Run evaluations on a skin.
dedeucerl-eval \
--skin mealy \
--split seeds/mealy_smoke.json \
--model openai:gpt-4o \
--rollouts 1 \
--out results.jsonl \
--feedback \
--temperature 0.0 \
--verbose
Supported model specs: openai:gpt-4o · anthropic:claude-3-opus-20240229 · gemini:gemini-1.5-pro · openrouter:<model>
Optional effort (supported models only): --effort high|xhigh|... (validated via a cheap probe; disable with --no-effort-probe)
Episode selection + sharding:
# Run only specific episodes
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --episodes 0-4,9
# Run shard 1 of 4 (0-based shard index)
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --shard 1/4
Resume runs (split-aware):
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --resume --out results.jsonl
Resume is safe across restarts because each result line includes a split_hash derived from the split file + subset.
dedeucerl-eval-parallel
Run shard-parallel evals and merge results.
dedeucerl-eval-parallel \
--jobs 4 \
--out results.jsonl \
--skin mealy \
--split seeds/mealy_smoke.json \
--model openai:gpt-4o
This writes a single merged JSONL to --out. Per-shard part files are deleted by default (use --keep-parts to keep them). You can then run dedeucerl-aggregate results.jsonl as usual.
dedeucerl-aggregate
Aggregate results into a leaderboard.
dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv
dedeucerl-aggregate results.jsonl --format markdown
dedeucerl-aggregate results.jsonl --format json -o results_summary.json
dedeucerl-train
Generate a vf-rl config (and optionally run it).
dedeucerl-train --skin mealy --seeds 0-9 --budget 25 --out configs/tmp.toml
# Run training (requires verifiers[rl] + vf-rl)
dedeucerl-train --skin mealy --seeds 0-9 --budget 25 --run
dedeucerl-selfcheck
Validate installation.
dedeucerl-selfcheck --verbose
Creating New Skins
For detailed implementation guide, see docs/SKINS.md.
Quick reference
# dedeucerl/skins/myskin.py
from dedeucerl.core.env import HiddenSystemEnv
from dedeucerl.core.config import SkinConfig
class MySkinEnv(HiddenSystemEnv):
config = SkinConfig(skin_name="myskin", default_budget=30)
def _configure_from_metadata(self, meta): ... # Parse ground truth
def _get_start_state(self): ... # Initial state
def _get_tools(self): ... # [probe, submit]
@staticmethod
def generate_system_static(seed, **params): ... # Deterministic generation
@classmethod
def domain_spec(cls, **params): ... # Tool/observation schemas
Register in dedeucerl/skins/__init__.py and run dedeucerl-selfcheck --verbose.
Metrics
| Metric | Description |
|---|---|
success |
1 if correct submission without trap hit, else 0 |
queries_used |
Total probe + submit calls consumed |
trap_hit |
1 if dangerous state triggered |
budget_remaining |
Queries left at episode end |
reward |
1.0 - 0.01 * queries_used if successful, else 0 |
Project structure
DedeuceRL/
├── dedeucerl/
│ ├── core/ # HiddenSystemEnv, TaskGenerator, rubric
│ ├── skins/ # MealyEnv, ProtocolEnv, APIEnv, ExprPolicyEnv
│ ├── adapters/ # OpenAI, Anthropic, Gemini
│ ├── cli/ # dedeucerl-eval, dedeucerl-generate, etc.
│ └── utils/ # RNG utilities
├── seeds/ # Pre-built evaluation splits
└── tests/ # pytest suite
Running Tests
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=dedeucerl
Environment Variables
| Variable | Description |
|---|---|
OPENAI_API_KEY |
API key for OpenAI models |
OPENAI_BASE_URL |
Base URL for OpenAI-compatible APIs (e.g., OpenRouter) |
ANTHROPIC_API_KEY |
API key for Anthropic models |
GOOGLE_API_KEY |
API key for Google Gemini models |
License
MIT License. See LICENSE for details.
Citation
@software{dedeucerl2026,
title = {DedeuceRL: A Modular Framework for Active System Identification Benchmarks},
author = {Vedansh},
year = {2026},
url = {https://github.com/AashVed/DedeuceRL}
}
See CITATION.cff for full metadata.
Acknowledgments
Built on: verifiers · Angluin's L* algorithm · DedeuceBench
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dedeucerl-1.0.6.tar.gz.
File metadata
- Download URL: dedeucerl-1.0.6.tar.gz
- Upload date:
- Size: 109.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a92f2f833070c660b3644b9d09622a532613517f702e0a1107592900e8ddefeb
|
|
| MD5 |
f9634ccfcd20ca2dd06c3c29de1b2c7e
|
|
| BLAKE2b-256 |
e294653ddac1bd11f0004eb48f05d928bca58b5cbdac29e00ee16e4e1f79ad98
|
Provenance
The following attestation bundles were made for dedeucerl-1.0.6.tar.gz:
Publisher:
pypi-publish.yml on AashVed/DedeuceRL
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dedeucerl-1.0.6.tar.gz -
Subject digest:
a92f2f833070c660b3644b9d09622a532613517f702e0a1107592900e8ddefeb - Sigstore transparency entry: 894400971
- Sigstore integration time:
-
Permalink:
AashVed/DedeuceRL@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a -
Branch / Tag:
refs/tags/v1.0.6 - Owner: https://github.com/AashVed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a -
Trigger Event:
release
-
Statement type:
File details
Details for the file dedeucerl-1.0.6-py3-none-any.whl.
File metadata
- Download URL: dedeucerl-1.0.6-py3-none-any.whl
- Upload date:
- Size: 112.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fde4625b58715baa005ebeb4f4514d5abdae162fe5d16dba3757b2cc2f4becf0
|
|
| MD5 |
c5dc53f5dd60a74ecbb4730cc9a2939b
|
|
| BLAKE2b-256 |
df4f7a0cd50423d0f97c4dec16364971b36b98e6794640812b514d715f9b2fcc
|
Provenance
The following attestation bundles were made for dedeucerl-1.0.6-py3-none-any.whl:
Publisher:
pypi-publish.yml on AashVed/DedeuceRL
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dedeucerl-1.0.6-py3-none-any.whl -
Subject digest:
fde4625b58715baa005ebeb4f4514d5abdae162fe5d16dba3757b2cc2f4becf0 - Sigstore transparency entry: 894401011
- Sigstore integration time:
-
Permalink:
AashVed/DedeuceRL@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a -
Branch / Tag:
refs/tags/v1.0.6 - Owner: https://github.com/AashVed
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a -
Trigger Event:
release
-
Statement type: