Skip to main content

A modular framework for active system identification benchmarks

Project description

DedeuceRL

Benchmark LLMs on Active System Identification — probe hidden systems, form hypotheses, verify correctness.

Python 3.10+ CI PyPI License: MIT Dataset DOI

pip install dedeucerl
dedeucerl-generate --skin mealy --seeds 0-4 --budget 25 --n-states 3 -o tasks.json
dedeucerl-eval --skin mealy --split tasks.json --model heuristic:none --out results.jsonl
dedeucerl-eval-parallel --jobs 4 --out results.jsonl --skin mealy --split tasks.json --model heuristic:none  # merged output

Why DedeuceRL?

Modern LLMs excel at knowledge retrieval and static reasoning, but struggle with active exploration — systematically probing unknown systems and deducing their structure from observations.

DedeuceRL benchmarks this capability by requiring agents to:

Capability What We Test
Systematic Exploration Strategically select probes to maximize information gain
Hypothesis Formation Build mental models of hidden system dynamics
Efficient Verification Minimize queries while ensuring correctness
Safety Awareness Avoid dangerous "trap" states that penalize reward

Research Context: Active system identification builds on Angluin's L* algorithm for active automata learning, conformance testing (W-method), and query-based learning theory. See Angluin (1987), Vaandrager (2017).


Table of Contents


Installation

pip install dedeucerl                   # Core
pip install "dedeucerl[openai]"         # + OpenAI adapter
pip install "dedeucerl[all]"            # All providers
Development installation
git clone https://github.com/AashVed/DedeuceRL.git
cd DedeuceRL
pip install -e ".[dev]"

Requirements: Python 3.10+ · verifiers>=0.1.9 · datasets>=2.0


Quickstart

1. Generate a task split

dedeucerl-generate --skin mealy --seeds 0-9 --budget 25 --n-states 3 -o tasks.json

2. Evaluate a model

export OPENAI_API_KEY="sk-..."
dedeucerl-eval --skin mealy --split tasks.json --model openai:gpt-4o --out results.jsonl

3. View results

dedeucerl-aggregate results.jsonl --format markdown

Output:

| Model         | Episodes | Success Rate | Trap Rate | Avg Queries | Avg Reward |
|---------------|----------|--------------|-----------|-------------|------------|
| openai:gpt-4o | 10       | 40.0%        | 20.0%     | 18.2        | 0.318      |

Available Skins

DedeuceRL ships with multiple "skins" — domain-specific instantiations of the active identification paradigm:

Skin Domain What the Agent Must Identify
mealy Automata Theory Hidden Mealy machine (state × input → output)
protocol API Testing REST API state-dependent behavior
apienv SaaS Systems API with methods, endpoints, variants, response schemas
exprpolicy DSL Debugging Typed policy expression (compile + test + submit)
Skin details

Mealy (Reference Skin)

The agent identifies a hidden Mealy machine (finite-state transducer).

  • Tools: act(symbol) → probe, submit_table(json) → submit hypothesis
  • Features: Isomorphism checking, counterexample feedback, trap transitions
  • Guarantees: Generated machines are minimal and fully reachable

Protocol

Reverse-engineer a stateful REST API.

  • Tools: api_call(method, endpoint) → probe, submit_spec(json) → submit
  • Features: State-dependent HTTP responses, behavioral equivalence

APIEnv

Realistic SaaS API identification with variants and response schemas.

  • Tools: api_call(method, endpoint, variant) → probe, submit_spec(json) → submit
  • Features: Complex multi-dimensional action space

ExprPolicy

Debug a typed policy DSL using compiler feedback and test suites.

  • Tools: type_check(expr), run_tests(expr, suite), submit(expr)
  • Features: Hidden tests, counterexample feedback, token constraints

Interactive Game

Play any skin as a human agent to understand the challenge:

Note: cliGame is a repo-only helper and is not installed via pip install dedeucerl.

python -m cliGame
🎮 DedeuceRL Interactive Game
Available skins: mealy, protocol, apienv, exprpolicy

Select skin [1-4]: 1
Enter seed (int): 42

=== SYSTEM PROMPT ===
You are identifying a hidden Mealy machine...

=== YOUR TURN ===
> act A
{"output": 1, "budget_left": 24, "trap_hit": false}

> act B
{"output": 2, "budget_left": 23, "trap_hit": false}

> submit_table {"n":3,"start":0,"trans":{...}}
{"ok": true}

Commands: :help :tools :prompt :state :quit


Generating Tasks

CLI Generator (recommended)
# Show available parameters for a skin
dedeucerl-generate --skin mealy --show-skin-params --seeds 0 --budget 25

# Generate 100-episode Mealy test split
dedeucerl-generate \
  --skin mealy \
  --seeds 0-99 \
  --subset test \
  --budget 100 \
  --n-states 4 \
  --no-trap \
  -o seeds/mealy_test.json

# Generate Protocol split
dedeucerl-generate \
  --skin protocol \
  --seeds 0-99 \
  --budget 120 \
  --n-endpoints 5 \
  --n-states 4 \
  -o seeds/protocol_test.json
Python API
from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator

gen = TaskGenerator(MealyEnv)
split = gen.generate_split(
    seeds=list(range(100)),
    budget=25,
    subset_name="test",
    n_states=5,
    trap=True,
)
gen.save_split(split, "seeds/mealy_test.json")

# Build HuggingFace Dataset
dataset = gen.build_dataset("seeds/mealy_test.json", "test", feedback=True)

Pre-built splits: 🤗 comfortably-dumb/DedeuceRL


Guide: Running Evaluations

Method 1: CLI (Recommended)

# Basic evaluation
dedeucerl-eval \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o \
  --out results.jsonl

# With all options
dedeucerl-eval \
  --skin apienv \
  --split seeds/apienv_smoke.json \
  --model anthropic:claude-3-opus-20240229 \
  --rollouts 3 \
  --feedback \
  --temperature 0.0 \
  --verbose \
  --out results/apienv_claude.jsonl

Supported Model Specs

Provider Format Examples
OpenAI openai:<model> openai:gpt-4o, openai:gpt-4-turbo
Anthropic anthropic:<model> anthropic:claude-3-opus-20240229
Gemini gemini:<model> gemini:gemini-1.5-pro
OpenRouter openrouter:<model> openrouter:meta-llama/llama-3-70b

Method 2: Python API

from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator, make_rubric
from dedeucerl.adapters import get_adapter

# Setup
generator = TaskGenerator(MealyEnv)
dataset = generator.build_dataset("seeds/mealy_smoke.json", "dev", feedback=True)
rubric = make_rubric()
env = MealyEnv(dataset=dataset, rubric=rubric, feedback=True, max_turns=30)

# Get adapter
adapter = get_adapter("openai:gpt-4o", temperature=0.0)

# Run episode manually
item = dataset[0]
state = {"prompt": item["prompt"], "answer": item["answer"]}
# ... custom evaluation loop

Aggregating Results

# CSV (for spreadsheets)
dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv

# Markdown (for README/reports)
dedeucerl-aggregate results.jsonl --format markdown

# JSON (for programmatic use)
dedeucerl-aggregate results.jsonl --format json -o summary.json

# Multiple files
dedeucerl-aggregate results/*.jsonl --format markdown

Output columns: model, n_episodes, success_rate, trap_rate, avg_queries, avg_reward


Hugging Face Dataset

Public task splits (MIT-licensed) are available at:


Training with RL

DedeuceRL environments inherit from verifiers.StatefulToolEnv, making them directly compatible with RL training frameworks.

Quick Start with vf.RLTrainer

# Install verifiers with RL support
uv add 'verifiers[rl]'

# Run training (create your own config based on verifiers docs)
uv run vf-rl @ your-config.toml

Example Configuration

# your-config.toml (example)
model = "Qwen/Qwen3-4B-Instruct"

[env]
path = "./your_env_module"

[env.args]
max_turns = 30

[trainer.args]
run_name = "dedeucerl-mealy"
micro_batch_size = 4
rollouts_per_example = 16
batch_size = 1024
max_steps = 500

Creating the Environment Module

# your_env_module.py
import verifiers as vf
from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator, make_rubric

def load_environment(split_path: str = "your_split.json") -> vf.Environment:
    gen = TaskGenerator(MealyEnv)
    dataset = gen.build_dataset(split_path, "train", feedback=True)
    rubric = make_rubric()
    return MealyEnv(dataset=dataset, rubric=rubric, feedback=True, max_turns=30)

Alternative Training Frameworks

DedeuceRL is also compatible with:

Custom reward functions
from verifiers import Rubric, Parser

def efficiency_reward(completion, answer, state, parser):
    """Reward efficiency: fewer queries = higher reward."""
    if not state.get("ok", False):
        return 0.0
    
    queries = state.get("queries_used", 0)
    budget = state.get("budget_init", 25)
    efficiency = 1.0 - (queries / budget)
    trap_penalty = 0.5 if state.get("trap_hit", False) else 0.0
    
    return efficiency - trap_penalty

custom_rubric = Rubric(
    funcs=[efficiency_reward],
    weights=[1.0],
    parser=Parser(extract_fn=lambda s: s),
)

env = MealyEnv(dataset=dataset, rubric=custom_rubric, feedback=True, max_turns=30)

See verifiers training docs for complete setup instructions.


CLI Reference

dedeucerl-eval

Run evaluations on a skin.

dedeucerl-eval \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o \
  --rollouts 1 \
  --out results.jsonl \
  --feedback \
  --temperature 0.0 \
  --verbose

Supported model specs: openai:gpt-4o · anthropic:claude-3-opus-20240229 · gemini:gemini-1.5-pro · openrouter:<model>

Episode selection + sharding:

# Run only specific episodes
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --episodes 0-4,9

# Run shard 1 of 4 (0-based shard index)
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --shard 1/4

Resume runs (split-aware):

dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --resume --out results.jsonl

Resume is safe across restarts because each result line includes a split_hash derived from the split file + subset.

dedeucerl-eval-parallel

Run shard-parallel evals and merge results.

dedeucerl-eval-parallel \
  --jobs 4 \
  --out results.jsonl \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o

This writes a single merged JSONL to --out. Per-shard part files are deleted by default (use --keep-parts to keep them). You can then run dedeucerl-aggregate results.jsonl as usual.

dedeucerl-aggregate

Aggregate results into a leaderboard.

dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv
dedeucerl-aggregate results.jsonl --format markdown
dedeucerl-aggregate results.jsonl --format json -o results_summary.json

dedeucerl-selfcheck

Validate installation.

dedeucerl-selfcheck --verbose

Creating New Skins

For detailed implementation guide, see docs/SKINS.md.

Quick reference
# dedeucerl/skins/myskin.py
from dedeucerl.core.env import HiddenSystemEnv
from dedeucerl.core.config import SkinConfig

class MySkinEnv(HiddenSystemEnv):
    config = SkinConfig(skin_name="myskin", default_budget=30)
    
    def _configure_from_metadata(self, meta): ...  # Parse ground truth
    def _get_start_state(self): ...                # Initial state  
    def _get_tools(self): ...                      # [probe, submit]
    
    @staticmethod
    def generate_system_static(seed, **params): ...  # Deterministic generation
    
    @classmethod
    def domain_spec(cls, **params): ...  # Tool/observation schemas

Register in dedeucerl/skins/__init__.py and run dedeucerl-selfcheck --verbose.


Metrics

Metric Description
success 1 if correct submission without trap hit, else 0
queries_used Total probe + submit calls consumed
trap_hit 1 if dangerous state triggered
budget_remaining Queries left at episode end
reward 1.0 - 0.01 * queries_used if successful, else 0
Project structure
DedeuceRL/
├── dedeucerl/
│   ├── core/       # HiddenSystemEnv, TaskGenerator, rubric
│   ├── skins/      # MealyEnv, ProtocolEnv, APIEnv, ExprPolicyEnv
│   ├── adapters/   # OpenAI, Anthropic, Gemini
│   ├── cli/        # dedeucerl-eval, dedeucerl-generate, etc.
│   └── utils/      # RNG utilities
├── seeds/          # Pre-built evaluation splits
└── tests/          # pytest suite

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=dedeucerl

Environment Variables

Variable Description
OPENAI_API_KEY API key for OpenAI models
OPENAI_BASE_URL Base URL for OpenAI-compatible APIs (e.g., OpenRouter)
ANTHROPIC_API_KEY API key for Anthropic models
GOOGLE_API_KEY API key for Google Gemini models

License

MIT License. See LICENSE for details.


Citation

@software{dedeucerl2026,
  title = {DedeuceRL: A Modular Framework for Active System Identification Benchmarks},
  author = {Vedansh},
  year = {2026},
  url = {https://github.com/AashVed/DedeuceRL}
}

See CITATION.cff for full metadata.


Acknowledgments

Built on: verifiers · Angluin's L* algorithm · DedeuceBench

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedeucerl-1.0.4.tar.gz (106.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dedeucerl-1.0.4-py3-none-any.whl (108.1 kB view details)

Uploaded Python 3

File details

Details for the file dedeucerl-1.0.4.tar.gz.

File metadata

  • Download URL: dedeucerl-1.0.4.tar.gz
  • Upload date:
  • Size: 106.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedeucerl-1.0.4.tar.gz
Algorithm Hash digest
SHA256 6f8635e3eca52ad4c86f1e4106097bf4d94a90f1c872cf0ef032cfac3d0b5965
MD5 272db3daff5f7cd1fcb5aa8c2ca89e34
BLAKE2b-256 3c96d8b805e8d6f5748dfa3b7f31c8ba7f534311eab3fd7bda192a3dc1bf3f56

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedeucerl-1.0.4.tar.gz:

Publisher: pypi-publish.yml on AashVed/DedeuceRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dedeucerl-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: dedeucerl-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 108.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedeucerl-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e1c4d5efe7fa85672d8b6b466cac5a57ac7a84433c62afb77d1bcbe323509ad1
MD5 6cc95d7ddd5981022496a84488e5a77e
BLAKE2b-256 667c52eac36f05441e0bd8e8c0626e5e0763e256aa9eabf5579e1edc2c40d7ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedeucerl-1.0.4-py3-none-any.whl:

Publisher: pypi-publish.yml on AashVed/DedeuceRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page