Skip to main content

A modular framework for active system identification benchmarks

Project description

DedeuceRL

Benchmark LLMs on Active System Identification — probe hidden systems, form hypotheses, verify correctness.

Python 3.10+ CI PyPI License: MIT Dataset DOI

pip install dedeucerl
dedeucerl-generate --skin mealy --seeds 0-4 --budget 25 --n-states 3 -o tasks.json
dedeucerl-eval --skin mealy --split tasks.json --model heuristic:none --out results.jsonl
dedeucerl-eval-parallel --jobs 4 --out results.jsonl --skin mealy --split tasks.json --model heuristic:none  # merged output

Why DedeuceRL?

Modern LLMs excel at knowledge retrieval and static reasoning, but struggle with active exploration — systematically probing unknown systems and deducing their structure from observations.

DedeuceRL benchmarks this capability by requiring agents to:

Capability What We Test
Systematic Exploration Strategically select probes to maximize information gain
Hypothesis Formation Build mental models of hidden system dynamics
Efficient Verification Minimize queries while ensuring correctness
Safety Awareness Avoid dangerous "trap" states that penalize reward

Research Context: Active system identification builds on Angluin's L* algorithm for active automata learning, conformance testing (W-method), and query-based learning theory. See Angluin (1987), Vaandrager (2017).


Table of Contents


Installation

pip install dedeucerl                   # Core
pip install "dedeucerl[openai]"         # + OpenAI adapter
pip install "dedeucerl[all]"            # All providers
pip install "dedeucerl[gemini]"         # + Gemini adapter (google-genai; supported)
pip install "dedeucerl[rl]"             # Verifiers RL trainer extras
Development installation
git clone https://github.com/AashVed/DedeuceRL.git
cd DedeuceRL
pip install -e ".[dev]"

Requirements: Python 3.10+ · verifiers>=0.1.9 · datasets>=2.0


Quickstart

1. Generate a task split

dedeucerl-generate --skin mealy --seeds 0-9 --budget 25 --n-states 3 -o tasks.json

2. Evaluate a model

export OPENAI_API_KEY="sk-..."
dedeucerl-eval --skin mealy --split tasks.json --model openai:gpt-4o --out results.jsonl

3. View results

dedeucerl-aggregate results.jsonl --format markdown

Output:

| Model         | Episodes | Success Rate | Trap Rate | Avg Queries | Avg Reward |
|---------------|----------|--------------|-----------|-------------|------------|
| openai:gpt-4o | 10       | 40.0%        | 20.0%     | 18.2        | 0.318      |

Available Skins

DedeuceRL ships with multiple "skins" — domain-specific instantiations of the active identification paradigm:

Skin Domain What the Agent Must Identify
mealy Automata Theory Hidden Mealy machine (state × input → output)
protocol API Testing REST API state-dependent behavior
apienv SaaS Systems API with methods, endpoints, variants, response schemas
exprpolicy DSL Debugging Typed policy expression (compile + test + submit)
Skin details

Mealy (Reference Skin)

The agent identifies a hidden Mealy machine (finite-state transducer).

  • Tools: act(symbol) → probe, submit_table(json) → submit hypothesis
  • Features: Isomorphism checking, counterexample feedback, trap transitions
  • Guarantees: Generated machines are minimal and fully reachable

Protocol

Reverse-engineer a stateful REST API.

  • Tools: api_call(method, endpoint) → probe, submit_spec(json) → submit
  • Features: State-dependent HTTP responses, behavioral equivalence

APIEnv

Realistic SaaS API identification with variants and response schemas.

  • Tools: api_call(method, endpoint, variant) → probe, submit_spec(json) → submit
  • Features: Complex multi-dimensional action space

ExprPolicy

Debug a typed policy DSL using compiler feedback and test suites.

  • Tools: type_check(expr), run_tests(expr, suite), submit(expr)
  • Features: Hidden tests, counterexample feedback, token constraints

Interactive Game

Play any skin as a human agent to understand the challenge:

Note: cliGame is a repo-only helper and is not installed via pip install dedeucerl.

python -m cliGame
🎮 DedeuceRL Interactive Game
Available skins: mealy, protocol, apienv, exprpolicy

Select skin [1-4]: 1
Enter seed (int): 42

=== SYSTEM PROMPT ===
You are identifying a hidden Mealy machine...

=== YOUR TURN ===
> act A
{"output": 1, "budget_left": 24, "trap_hit": false}

> act B
{"output": 2, "budget_left": 23, "trap_hit": false}

> submit_table {"n":3,"start":0,"trans":{...}}
{"ok": true}

Commands: :help :tools :prompt :state :quit


Generating Tasks

CLI Generator (recommended)
# Show available parameters for a skin
dedeucerl-generate --skin mealy --show-skin-params --seeds 0 --budget 25

# Generate 100-episode Mealy test split
dedeucerl-generate \
  --skin mealy \
  --seeds 0-99 \
  --subset test \
  --budget 100 \
  --n-states 4 \
  --no-trap \
  -o seeds/mealy_test.json

# Generate Protocol split
dedeucerl-generate \
  --skin protocol \
  --seeds 0-99 \
  --budget 120 \
  --n-endpoints 5 \
  --n-states 4 \
  -o seeds/protocol_test.json
Python API
from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator

gen = TaskGenerator(MealyEnv)
split = gen.generate_split(
    seeds=list(range(100)),
    budget=25,
    subset_name="test",
    n_states=5,
    trap=True,
)
gen.save_split(split, "seeds/mealy_test.json")

# Build HuggingFace Dataset
dataset = gen.build_dataset("seeds/mealy_test.json", "test", feedback=True)

Pre-built splits: 🤗 comfortably-dumb/DedeuceRL


Guide: Running Evaluations

Method 1: CLI (Recommended)

# Basic evaluation
dedeucerl-eval \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o \
  --out results.jsonl

# With all options
dedeucerl-eval \
  --skin apienv \
  --split seeds/apienv_smoke.json \
  --model anthropic:claude-3-opus-20240229 \
  --rollouts 3 \
  --feedback \
  --temperature 0.0 \
  --verbose \
  --out results/apienv_claude.jsonl

Supported Model Specs

Provider Format Examples
OpenAI openai:<model> openai:gpt-4o, openai:gpt-4-turbo
Anthropic anthropic:<model> anthropic:claude-3-opus-20240229
Gemini gemini:<model> gemini:gemini-1.5-pro
OpenRouter openrouter:<model> openrouter:meta-llama/llama-3-70b

Method 2: Python API

from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator, make_rubric
from dedeucerl.adapters import get_adapter

# Setup
generator = TaskGenerator(MealyEnv)
dataset = generator.build_dataset("seeds/mealy_smoke.json", "dev", feedback=True)
rubric = make_rubric()
env = MealyEnv(dataset=dataset, rubric=rubric, feedback=True, max_turns=30)

# Get adapter
adapter = get_adapter("openai:gpt-4o", temperature=0.0)

# Run episode manually
item = dataset[0]
state = {"prompt": item["prompt"], "answer": item["answer"]}
# ... custom evaluation loop

Aggregating Results

# CSV (for spreadsheets)
dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv

# Markdown (for README/reports)
dedeucerl-aggregate results.jsonl --format markdown

# JSON (for programmatic use)
dedeucerl-aggregate results.jsonl --format json -o summary.json

# Multiple files
dedeucerl-aggregate results/*.jsonl --format markdown

Output columns: model, n_episodes, success_rate, trap_rate, avg_queries, avg_reward


Hugging Face Dataset

Public task splits (MIT-licensed) are available at:


Training with RL

DedeuceRL environments inherit from verifiers.StatefulToolEnv, making them directly compatible with RL training frameworks.

Quick Start with vf.RLTrainer

# Install verifiers with RL support
uv add 'verifiers[rl]'

# Run training (create your own config based on verifiers docs)
uv run vf-rl @ your-config.toml

Sample configs are included in configs/vf-rl/ (e.g., dedeucerl-mealy.toml).

Example Configuration

# your-config.toml (example)
model = "Qwen/Qwen3-4B-Instruct"

[env]
id = "dedeucerl.vf_env"

[env.args]
skin = "mealy"
seeds = [0, 1, 2, 3, 4]
budget = 25
n_states = 4
subset = "train"
feedback = true
reward_mode = "train_dense"

[trainer.args]
run_name = "dedeucerl-mealy"
micro_batch_size = 4
rollouts_per_example = 16
batch_size = 1024
max_steps = 500

Custom Skins

If you create your own skin, you can pass it by import path:

[env.args]
skin = "my_pkg.my_skin:MySkinEnv"

Alternative Training Frameworks

DedeuceRL is also compatible with:

Custom reward functions
from verifiers import Rubric, Parser

def efficiency_reward(completion, answer, state, parser):
    """Reward efficiency: fewer queries = higher reward."""
    if not state.get("ok", False):
        return 0.0
    
    queries = state.get("queries_used", 0)
    budget = state.get("budget_init", 25)
    efficiency = 1.0 - (queries / budget)
    trap_penalty = 0.5 if state.get("trap_hit", False) else 0.0
    
    return efficiency - trap_penalty

custom_rubric = Rubric(
    funcs=[efficiency_reward],
    weights=[1.0],
    parser=Parser(extract_fn=lambda s: s),
)

env = MealyEnv(dataset=dataset, rubric=custom_rubric, feedback=True, max_turns=30)

See verifiers training docs for complete setup instructions.


CLI Reference

dedeucerl-eval

Run evaluations on a skin.

dedeucerl-eval \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o \
  --rollouts 1 \
  --out results.jsonl \
  --feedback \
  --temperature 0.0 \
  --verbose

Supported model specs: openai:gpt-4o · anthropic:claude-3-opus-20240229 · gemini:gemini-1.5-pro · openrouter:<model>

Optional effort (supported models only): --effort high|xhigh|... (validated via a cheap probe; disable with --no-effort-probe)

Episode selection + sharding:

# Run only specific episodes
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --episodes 0-4,9

# Run shard 1 of 4 (0-based shard index)
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --shard 1/4

Resume runs (split-aware):

dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --resume --out results.jsonl

Resume is safe across restarts because each result line includes a split_hash derived from the split file + subset.

dedeucerl-eval-parallel

Run shard-parallel evals and merge results.

dedeucerl-eval-parallel \
  --jobs 4 \
  --out results.jsonl \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o

This writes a single merged JSONL to --out. Per-shard part files are deleted by default (use --keep-parts to keep them). You can then run dedeucerl-aggregate results.jsonl as usual.

dedeucerl-aggregate

Aggregate results into a leaderboard.

dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv
dedeucerl-aggregate results.jsonl --format markdown
dedeucerl-aggregate results.jsonl --format json -o results_summary.json

dedeucerl-train

Generate a vf-rl config (and optionally run it).

dedeucerl-train --skin mealy --seeds 0-9 --budget 25 --out configs/tmp.toml

# Run training (requires verifiers[rl] + vf-rl)
dedeucerl-train --skin mealy --seeds 0-9 --budget 25 --run

dedeucerl-selfcheck

Validate installation.

dedeucerl-selfcheck --verbose

Creating New Skins

For detailed implementation guide, see docs/SKINS.md.

Quick reference
# dedeucerl/skins/myskin.py
from dedeucerl.core.env import HiddenSystemEnv
from dedeucerl.core.config import SkinConfig

class MySkinEnv(HiddenSystemEnv):
    config = SkinConfig(skin_name="myskin", default_budget=30)
    
    def _configure_from_metadata(self, meta): ...  # Parse ground truth
    def _get_start_state(self): ...                # Initial state  
    def _get_tools(self): ...                      # [probe, submit]
    
    @staticmethod
    def generate_system_static(seed, **params): ...  # Deterministic generation
    
    @classmethod
    def domain_spec(cls, **params): ...  # Tool/observation schemas

Register in dedeucerl/skins/__init__.py and run dedeucerl-selfcheck --verbose.


Metrics

Metric Description
success 1 if correct submission without trap hit, else 0
queries_used Total probe + submit calls consumed
trap_hit 1 if dangerous state triggered
budget_remaining Queries left at episode end
reward 1.0 - 0.01 * queries_used if successful, else 0
Project structure
DedeuceRL/
├── dedeucerl/
│   ├── core/       # HiddenSystemEnv, TaskGenerator, rubric
│   ├── skins/      # MealyEnv, ProtocolEnv, APIEnv, ExprPolicyEnv
│   ├── adapters/   # OpenAI, Anthropic, Gemini
│   ├── cli/        # dedeucerl-eval, dedeucerl-generate, etc.
│   └── utils/      # RNG utilities
├── seeds/          # Pre-built evaluation splits
└── tests/          # pytest suite

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=dedeucerl

Environment Variables

Variable Description
OPENAI_API_KEY API key for OpenAI models
OPENAI_BASE_URL Base URL for OpenAI-compatible APIs (e.g., OpenRouter)
ANTHROPIC_API_KEY API key for Anthropic models
GOOGLE_API_KEY API key for Google Gemini models

License

MIT License. See LICENSE for details.


Citation

@software{dedeucerl2026,
  title = {DedeuceRL: A Modular Framework for Active System Identification Benchmarks},
  author = {Vedansh},
  year = {2026},
  url = {https://github.com/AashVed/DedeuceRL}
}

See CITATION.cff for full metadata.


Acknowledgments

Built on: verifiers · Angluin's L* algorithm · DedeuceBench

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedeucerl-1.0.6.tar.gz (109.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dedeucerl-1.0.6-py3-none-any.whl (112.1 kB view details)

Uploaded Python 3

File details

Details for the file dedeucerl-1.0.6.tar.gz.

File metadata

  • Download URL: dedeucerl-1.0.6.tar.gz
  • Upload date:
  • Size: 109.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedeucerl-1.0.6.tar.gz
Algorithm Hash digest
SHA256 a92f2f833070c660b3644b9d09622a532613517f702e0a1107592900e8ddefeb
MD5 f9634ccfcd20ca2dd06c3c29de1b2c7e
BLAKE2b-256 e294653ddac1bd11f0004eb48f05d928bca58b5cbdac29e00ee16e4e1f79ad98

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedeucerl-1.0.6.tar.gz:

Publisher: pypi-publish.yml on AashVed/DedeuceRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dedeucerl-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: dedeucerl-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 112.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedeucerl-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 fde4625b58715baa005ebeb4f4514d5abdae162fe5d16dba3757b2cc2f4becf0
MD5 c5dc53f5dd60a74ecbb4730cc9a2939b
BLAKE2b-256 df4f7a0cd50423d0f97c4dec16364971b36b98e6794640812b514d715f9b2fcc

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedeucerl-1.0.6-py3-none-any.whl:

Publisher: pypi-publish.yml on AashVed/DedeuceRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page