A modular framework for active system identification benchmarks

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vedansh

These details have not been verified by PyPI

Project description

DedeuceRL

Benchmark LLMs on Active System Identification — probe hidden systems, form hypotheses, verify correctness.

pip install dedeucerl
dedeucerl-generate --skin mealy --seeds 0-4 --budget 25 --n-states 3 -o tasks.json
dedeucerl-eval --skin mealy --split tasks.json --model heuristic:none --out results.jsonl
dedeucerl-eval-parallel --jobs 4 --out results.jsonl --skin mealy --split tasks.json --model heuristic:none  # merged output

Why DedeuceRL?

Modern LLMs excel at knowledge retrieval and static reasoning, but struggle with active exploration — systematically probing unknown systems and deducing their structure from observations.

DedeuceRL benchmarks this capability by requiring agents to:

Capability	What We Test
Systematic Exploration	Strategically select probes to maximize information gain
Hypothesis Formation	Build mental models of hidden system dynamics
Efficient Verification	Minimize queries while ensuring correctness
Safety Awareness	Avoid dangerous "trap" states that penalize reward

Research Context: Active system identification builds on Angluin's L* algorithm for active automata learning, conformance testing (W-method), and query-based learning theory. See Angluin (1987), Vaandrager (2017).

Installation
Quickstart
Available Skins
Interactive Game
Training with RL
CLI Reference
Creating New Skins
Metrics
Citation
Contributing

Installation

pip install dedeucerl                   # Core
pip install "dedeucerl[openai]"         # + OpenAI adapter
pip install "dedeucerl[all]"            # All providers
pip install "dedeucerl[gemini]"         # + Gemini adapter (google-genai; supported)
pip install "dedeucerl[rl]"             # Verifiers RL trainer extras

Development installation

git clone https://github.com/AashVed/DedeuceRL.git
cd DedeuceRL
pip install -e ".[dev]"

Requirements: Python 3.10+ · verifiers>=0.1.9 · datasets>=2.0

Quickstart

1. Generate a task split

dedeucerl-generate --skin mealy --seeds 0-9 --budget 25 --n-states 3 -o tasks.json

2. Evaluate a model

export OPENAI_API_KEY="sk-..."
dedeucerl-eval --skin mealy --split tasks.json --model openai:gpt-4o --out results.jsonl

3. View results

dedeucerl-aggregate results.jsonl --format markdown

Output:

| Model         | Episodes | Success Rate | Trap Rate | Avg Queries | Avg Reward |
|---------------|----------|--------------|-----------|-------------|------------|
| openai:gpt-4o | 10       | 40.0%        | 20.0%     | 18.2        | 0.318      |

Available Skins

DedeuceRL ships with multiple "skins" — domain-specific instantiations of the active identification paradigm:

Skin	Domain	What the Agent Must Identify
`mealy`	Automata Theory	Hidden Mealy machine (state × input → output)
`protocol`	API Testing	REST API state-dependent behavior
`apienv`	SaaS Systems	API with methods, endpoints, variants, response schemas
`exprpolicy`	DSL Debugging	Typed policy expression (compile + test + submit)

Skin details

Mealy (Reference Skin)

The agent identifies a hidden Mealy machine (finite-state transducer).

Tools: act(symbol) → probe, submit_table(json) → submit hypothesis
Features: Isomorphism checking, counterexample feedback, trap transitions
Guarantees: Generated machines are minimal and fully reachable

Protocol

Reverse-engineer a stateful REST API.

Tools: api_call(method, endpoint) → probe, submit_spec(json) → submit
Features: State-dependent HTTP responses, behavioral equivalence

APIEnv

Realistic SaaS API identification with variants and response schemas.

Tools: api_call(method, endpoint, variant) → probe, submit_spec(json) → submit
Features: Complex multi-dimensional action space

ExprPolicy

Debug a typed policy DSL using compiler feedback and test suites.

Tools: type_check(expr), run_tests(expr, suite), submit(expr)
Features: Hidden tests, counterexample feedback, token constraints

Interactive Game

Play any skin as a human agent to understand the challenge:

Note: cliGame is a repo-only helper and is not installed via pip install dedeucerl.

python -m cliGame

🎮 DedeuceRL Interactive Game
Available skins: mealy, protocol, apienv, exprpolicy

Select skin [1-4]: 1
Enter seed (int): 42

=== SYSTEM PROMPT ===
You are identifying a hidden Mealy machine...

=== YOUR TURN ===
> act A
{"output": 1, "budget_left": 24, "trap_hit": false}

> act B
{"output": 2, "budget_left": 23, "trap_hit": false}

> submit_table {"n":3,"start":0,"trans":{...}}
{"ok": true}

Commands: :help :tools :prompt :state :quit

Generating Tasks

CLI Generator (recommended)

# Show available parameters for a skin
dedeucerl-generate --skin mealy --show-skin-params --seeds 0 --budget 25

# Generate 100-episode Mealy test split
dedeucerl-generate \
  --skin mealy \
  --seeds 0-99 \
  --subset test \
  --budget 100 \
  --n-states 4 \
  --no-trap \
  -o seeds/mealy_test.json

# Generate Protocol split
dedeucerl-generate \
  --skin protocol \
  --seeds 0-99 \
  --budget 120 \
  --n-endpoints 5 \
  --n-states 4 \
  -o seeds/protocol_test.json

Python API

from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator

gen = TaskGenerator(MealyEnv)
split = gen.generate_split(
    seeds=list(range(100)),
    budget=25,
    subset_name="test",
    n_states=5,
    trap=True,
)
gen.save_split(split, "seeds/mealy_test.json")

# Build HuggingFace Dataset
dataset = gen.build_dataset("seeds/mealy_test.json", "test", feedback=True)

Pre-built splits: 🤗 comfortably-dumb/DedeuceRL

Guide: Running Evaluations

Method 1: CLI (Recommended)

# Basic evaluation
dedeucerl-eval \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o \
  --out results.jsonl

# With all options
dedeucerl-eval \
  --skin apienv \
  --split seeds/apienv_smoke.json \
  --model anthropic:claude-3-opus-20240229 \
  --rollouts 3 \
  --feedback \
  --temperature 0.0 \
  --verbose \
  --out results/apienv_claude.jsonl

Supported Model Specs

Provider	Format	Examples
OpenAI	`openai:<model>`	`openai:gpt-4o`, `openai:gpt-4-turbo`
Anthropic	`anthropic:<model>`	`anthropic:claude-3-opus-20240229`
Gemini	`gemini:<model>`	`gemini:gemini-1.5-pro`
OpenRouter	`openrouter:<model>`	`openrouter:meta-llama/llama-3-70b`

Method 2: Python API

from dedeucerl.skins import MealyEnv
from dedeucerl.core import TaskGenerator, make_rubric
from dedeucerl.adapters import get_adapter

# Setup
generator = TaskGenerator(MealyEnv)
dataset = generator.build_dataset("seeds/mealy_smoke.json", "dev", feedback=True)
rubric = make_rubric()
env = MealyEnv(dataset=dataset, rubric=rubric, feedback=True, max_turns=30)

# Get adapter
adapter = get_adapter("openai:gpt-4o", temperature=0.0)

# Run episode manually
item = dataset[0]
state = {"prompt": item["prompt"], "answer": item["answer"]}
# ... custom evaluation loop

Aggregating Results

# CSV (for spreadsheets)
dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv

# Markdown (for README/reports)
dedeucerl-aggregate results.jsonl --format markdown

# JSON (for programmatic use)
dedeucerl-aggregate results.jsonl --format json -o summary.json

# Multiple files
dedeucerl-aggregate results/*.jsonl --format markdown

Output columns: model, n_episodes, success_rate, trap_rate, avg_queries, avg_reward

Hugging Face Dataset

Public task splits (MIT-licensed) are available at:

🤗 comfortably-dumb/DedeuceRL

Training with RL

DedeuceRL environments inherit from verifiers.StatefulToolEnv, making them directly compatible with RL training frameworks.

Quick Start with vf.RLTrainer

# Install verifiers with RL support
uv add 'verifiers[rl]'

# Run training (create your own config based on verifiers docs)
uv run vf-rl @ your-config.toml

Sample configs are included in configs/vf-rl/ (e.g., dedeucerl-mealy.toml).

Example Configuration

# your-config.toml (example)
model = "Qwen/Qwen3-4B-Instruct"

[env]
id = "dedeucerl.vf_env"

[env.args]
skin = "mealy"
seeds = [0, 1, 2, 3, 4]
budget = 25
n_states = 4
subset = "train"
feedback = true
reward_mode = "train_dense"

[trainer.args]
run_name = "dedeucerl-mealy"
micro_batch_size = 4
rollouts_per_example = 16
batch_size = 1024
max_steps = 500

Custom Skins

If you create your own skin, you can pass it by import path:

[env.args]
skin = "my_pkg.my_skin:MySkinEnv"

Alternative Training Frameworks

DedeuceRL is also compatible with:

prime-rl — Async RL at scale with FSDP2 + vLLM
SkyRL — Verifiers integration
Tinker — Verifiers recipes

Custom reward functions

from verifiers import Rubric, Parser

def efficiency_reward(completion, answer, state, parser):
    """Reward efficiency: fewer queries = higher reward."""
    if not state.get("ok", False):
        return 0.0
    
    queries = state.get("queries_used", 0)
    budget = state.get("budget_init", 25)
    efficiency = 1.0 - (queries / budget)
    trap_penalty = 0.5 if state.get("trap_hit", False) else 0.0
    
    return efficiency - trap_penalty

custom_rubric = Rubric(
    funcs=[efficiency_reward],
    weights=[1.0],
    parser=Parser(extract_fn=lambda s: s),
)

env = MealyEnv(dataset=dataset, rubric=custom_rubric, feedback=True, max_turns=30)

See verifiers training docs for complete setup instructions.

CLI Reference

`dedeucerl-eval`

Run evaluations on a skin.

dedeucerl-eval \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o \
  --rollouts 1 \
  --out results.jsonl \
  --feedback \
  --temperature 0.0 \
  --verbose

Supported model specs: openai:gpt-4o · anthropic:claude-3-opus-20240229 · gemini:gemini-1.5-pro · openrouter:<model>

Optional effort (supported models only): --effort high|xhigh|... (validated via a cheap probe; disable with --no-effort-probe)

Episode selection + sharding:

# Run only specific episodes
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --episodes 0-4,9

# Run shard 1 of 4 (0-based shard index)
dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --shard 1/4

Resume runs (split-aware):

dedeucerl-eval --skin mealy --split seeds/mealy_smoke.json --resume --out results.jsonl

Resume is safe across restarts because each result line includes a split_hash derived from the split file + subset.

`dedeucerl-eval-parallel`

Run shard-parallel evals and merge results.

dedeucerl-eval-parallel \
  --jobs 4 \
  --out results.jsonl \
  --skin mealy \
  --split seeds/mealy_smoke.json \
  --model openai:gpt-4o

This writes a single merged JSONL to --out. Per-shard part files are deleted by default (use --keep-parts to keep them). You can then run dedeucerl-aggregate results.jsonl as usual.

`dedeucerl-aggregate`

Aggregate results into a leaderboard.

dedeucerl-aggregate results.jsonl --format csv > leaderboard.csv
dedeucerl-aggregate results.jsonl --format markdown
dedeucerl-aggregate results.jsonl --format json -o results_summary.json

`dedeucerl-train`

Generate a vf-rl config (and optionally run it).

dedeucerl-train --skin mealy --seeds 0-9 --budget 25 --out configs/tmp.toml

# Run training (requires verifiers[rl] + vf-rl)
dedeucerl-train --skin mealy --seeds 0-9 --budget 25 --run

`dedeucerl-selfcheck`

Validate installation.

dedeucerl-selfcheck --verbose

Creating New Skins

For detailed implementation guide, see docs/SKINS.md.

Quick reference

# dedeucerl/skins/myskin.py
from dedeucerl.core.env import HiddenSystemEnv
from dedeucerl.core.config import SkinConfig

class MySkinEnv(HiddenSystemEnv):
    config = SkinConfig(skin_name="myskin", default_budget=30)
    
    def _configure_from_metadata(self, meta): ...  # Parse ground truth
    def _get_start_state(self): ...                # Initial state  
    def _get_tools(self): ...                      # [probe, submit]
    
    @staticmethod
    def generate_system_static(seed, **params): ...  # Deterministic generation
    
    @classmethod
    def domain_spec(cls, **params): ...  # Tool/observation schemas

Metrics

Metric	Description
`success`	1 if correct submission without trap hit, else 0
`queries_used`	Total probe + submit calls consumed
`trap_hit`	1 if dangerous state triggered
`budget_remaining`	Queries left at episode end
`reward`	`1.0 - 0.01 * queries_used` if successful, else 0

Project structure

DedeuceRL/
├── dedeucerl/
│   ├── core/       # HiddenSystemEnv, TaskGenerator, rubric
│   ├── skins/      # MealyEnv, ProtocolEnv, APIEnv, ExprPolicyEnv
│   ├── adapters/   # OpenAI, Anthropic, Gemini
│   ├── cli/        # dedeucerl-eval, dedeucerl-generate, etc.
│   └── utils/      # RNG utilities
├── seeds/          # Pre-built evaluation splits
└── tests/          # pytest suite

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=dedeucerl

Environment Variables

Variable	Description
`OPENAI_API_KEY`	API key for OpenAI models
`OPENAI_BASE_URL`	Base URL for OpenAI-compatible APIs (e.g., OpenRouter)
`ANTHROPIC_API_KEY`	API key for Anthropic models
`GOOGLE_API_KEY`	API key for Google Gemini models

License

MIT License. See LICENSE for details.

Citation

@software{dedeucerl2026,
  title = {DedeuceRL: A Modular Framework for Active System Identification Benchmarks},
  author = {Vedansh},
  year = {2026},
  url = {https://github.com/AashVed/DedeuceRL}
}

See CITATION.cff for full metadata.

Acknowledgments

Built on: verifiers · Angluin's L* algorithm · DedeuceBench

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vedansh

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.8

Apr 16, 2026

This version

1.0.6

Feb 1, 2026

1.0.5

Feb 1, 2026

1.0.4

Jan 31, 2026

1.0.3

Jan 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedeucerl-1.0.6.tar.gz (109.9 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dedeucerl-1.0.6-py3-none-any.whl (112.1 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file dedeucerl-1.0.6.tar.gz.

File metadata

Download URL: dedeucerl-1.0.6.tar.gz
Upload date: Feb 1, 2026
Size: 109.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedeucerl-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`a92f2f833070c660b3644b9d09622a532613517f702e0a1107592900e8ddefeb`
MD5	`f9634ccfcd20ca2dd06c3c29de1b2c7e`
BLAKE2b-256	`e294653ddac1bd11f0004eb48f05d928bca58b5cbdac29e00ee16e4e1f79ad98`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedeucerl-1.0.6.tar.gz:

Publisher: pypi-publish.yml on AashVed/DedeuceRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dedeucerl-1.0.6.tar.gz
- Subject digest: a92f2f833070c660b3644b9d09622a532613517f702e0a1107592900e8ddefeb
- Sigstore transparency entry: 894400971
- Sigstore integration time: Feb 1, 2026
Source repository:
- Permalink: AashVed/DedeuceRL@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a
- Branch / Tag: refs/tags/v1.0.6
- Owner: https://github.com/AashVed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a
- Trigger Event: release

File details

Details for the file dedeucerl-1.0.6-py3-none-any.whl.

File metadata

Download URL: dedeucerl-1.0.6-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 112.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedeucerl-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fde4625b58715baa005ebeb4f4514d5abdae162fe5d16dba3757b2cc2f4becf0`
MD5	`c5dc53f5dd60a74ecbb4730cc9a2939b`
BLAKE2b-256	`df4f7a0cd50423d0f97c4dec16364971b36b98e6794640812b514d715f9b2fcc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedeucerl-1.0.6-py3-none-any.whl:

Publisher: pypi-publish.yml on AashVed/DedeuceRL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dedeucerl-1.0.6-py3-none-any.whl
- Subject digest: fde4625b58715baa005ebeb4f4514d5abdae162fe5d16dba3757b2cc2f4becf0
- Sigstore transparency entry: 894401011
- Sigstore integration time: Feb 1, 2026
Source repository:
- Permalink: AashVed/DedeuceRL@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a
- Branch / Tag: refs/tags/v1.0.6
- Owner: https://github.com/AashVed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@056ca8e17414a3910eb58ec1bfd9376c58eb4f1a
- Trigger Event: release

dedeucerl 1.0.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DedeuceRL

Why DedeuceRL?

Table of Contents

Installation

Quickstart

1. Generate a task split

2. Evaluate a model

3. View results

Available Skins

Mealy (Reference Skin)

Protocol

APIEnv

ExprPolicy

Interactive Game

Generating Tasks

Guide: Running Evaluations

Method 1: CLI (Recommended)

Supported Model Specs

Method 2: Python API

Aggregating Results

Hugging Face Dataset

Training with RL

Quick Start with vf.RLTrainer

Example Configuration

Custom Skins

Alternative Training Frameworks

CLI Reference

dedeucerl-eval

dedeucerl-eval-parallel

dedeucerl-aggregate

dedeucerl-train

dedeucerl-selfcheck

Creating New Skins

Metrics

Running Tests

Environment Variables

License

Citation

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`dedeucerl-eval`

`dedeucerl-eval-parallel`

`dedeucerl-aggregate`

`dedeucerl-train`

`dedeucerl-selfcheck`