Skip to main content

Intelligent Simulation Orchestration for Large Language Models

Project description

ISOPro: A Reference Implementation of Grounded Continuous Evaluation

ISOPro is a simulation-based fine-tuning and evaluation framework for language models. It is the reference implementation of the Grounded Continuous Evaluation (GCE) framework described in:

Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier. Under review, NeurIPS 2026.

GCE argues that current LLM evaluation practice suffers from four structural validity failures — distributional, temporal, scope, and process invalidity — that compound in RLHF and make reward hacking a predictable consequence of evaluation design rather than an unpredictable training pathology. ISOPro demonstrates that these failures can be addressed architecturally, on a consumer laptop, by replacing the learned reward model with a deterministic verifier and updating LoRA adapter weights on CPU.


Headline Results

On resource-constrained project scheduling (RCPSP) with Qwen 2.5 3B Instruct, across six compositional difficulty tiers (T0–T5):

Method T0 T1 T2 T3 T4 T5 Mean Trains?
Zero-shot 80% 0% 0% 0% 0% 0% 13.3% No
3-shot 20% 0% 0% 20% 0% 0% 6.7% No
Multi-turn (×3) 100% 0% 0% 0% 0% 0% 16.7% No
IsoZero (simulation) 100% 60% 20% 20% 0% 0% 33.3% No
ISOPro + LoRA 100% 66.7% 0% 66.7% 0% 0% 39.8% ± 3.5 Yes

Eval: 3–5 problems per tier, 3 seeds. ISOPro: 6 iterations, 504 rollouts, 119 correct traces. Hardware: Apple M1, 32GB unified memory, ~90 min, peak memory <8GB, 0.216% trainable parameters, no GPU required.

A 3.0× improvement over zero-shot is achieved without oracle solutions, without a reward model, and without a KL penalty.

Per-tier accuracy across evaluation conditions


What ISOPro Implements

ISOPro consists of three layers — a simulation environment layer with deterministic verifiers, an LLM agent layer, and a communication wrapper managing state, evaluation, and feedback loops — and four mechanisms that collectively instantiate GCE:

1. Gradient descent on correct reasoning traces

When the model produces a verified-correct answer, ISOPro runs a forward pass with prompt tokens masked (labels set to -100) and computes loss only on the generated tokens. The gradient signal is the reasoning trajectory that produced correctness. This is process-level supervision: the model is trained on the reasoning, not on correctness as a label.

2. Rejection sampling as continuous self-filter

The model generates at high temperature (T = 0.8); a deterministic verifier accepts only correct responses into the replay buffer. Every iteration evaluates capability against ground truth, making training dynamics observable at the granularity that checkpoint evaluation cannot provide.

3. Implicit-curriculum replay buffer

Correct rollouts accumulate across iterations. Easy wins dominate early; harder problems enter as capability develops. The curriculum emerges from the model's own trajectory rather than from researcher curation. Ablation shows that removing accumulation drops mean accuracy by 12 pp and inflates seed-to-seed variance by roughly 4×.

4. Activation-guided LoRA targeting

Top-K layers are identified by activation probing and receive the LoRA updates (6.6M params = 0.216% of 3.1B). LoRA weights update on CPU — the base model stays frozen in quantized form. This is what eliminates the dual-model VRAM constraint imposed by RLHF's KL penalty.

Training dynamics: rollout hit rate and loss Implicit curriculum: replay buffer composition

Left: rollout hit rate doubles between iterations 2 and 3, coincident with the sharpest loss decrease — an inflection visible only through continuous evaluation. Right: the implicit curriculum that forms in the replay buffer without researcher curation.

Capability emergence heatmap

Capability emergence: each cell marks the iteration in which a tier first produces correct traces. T2 and T5 remain unreached — the framework honestly reports capability boundaries.


ISOPro vs. RLHF vs. GRPO

ISOPro (ours) RLHF (standard) DeepSeek-R1 GRPO
Reward signal Deterministic verifier Learned reward model Deterministic verifier
Stability mechanism Rejection sampling KL penalty (dual model) Group-relative advantages
Models in memory 1 2+ 1
Trainable parameters 0.216% (6.6M) 100% (full) 100% (full)
Min. memory (reference) ~6 GB (3B) ~28 GB (7B × 2) ~280 GB (70B)
Hardware Consumer laptop Data center GPU GPU cluster
Reward hacking Impossible (by construction) Predictable Impossible (by construction)

ISOPro and DeepSeek-R1 GRPO converged on the same architectural insight at orders-of-magnitude-different scales: for verifiable-reward domains, the verifier is the reward signal, and the learned reward model is an unnecessary intermediary.


Installation

pip install isopro

For the paper's training pipeline (LoRA adapters, MLX backend, OR-Tools verifier):

pip install "isopro[train]"
pip install mlx mlx-lm ortools   # Apple Silicon; OR-Tools is the ground-truth solver

For adversarial / conversation / workflow-simulation features:

pip install opencv-python stable-baselines3 gymnasium tqdm

Optional: if you use the Claude-backed agents,

export ANTHROPIC_API_KEY=your_api_key_here

Quickstart (no GPU, no model download)

Three runnable examples that demonstrate ISOPro without a model in the loop:

python examples/quickstart_gce.py           # see the verifier reject reward hacking in 5 seconds
python examples/custom_verifier.py          # plug your own domain into the loop in <100 lines
python examples/watch_curriculum_emerge.py  # visualize the implicit curriculum from a saved log

The quickstart_gce.py script generates a real OR-Tools-solved scheduling problem, then runs three responses (oracle, constraint-violating, plausible hallucination) through the deterministic verifier so you can see — concretely — what "the verifier is the reward signal" means. custom_verifier.py shows the full pattern for extending ISOPro to any domain you can verify with a Python function. watch_curriculum_emerge.py reads a saved training log and renders the buffer composition over iterations as ASCII bars, reproducing Figure 3 from the paper in your terminal.


Reproducing the Paper

All experiments run on an Apple M1 with 32GB unified memory. Full pipeline completes in ~90 minutes.

Main scheduling experiment (Table 2)

python examples/run_scheduling_experiment.py                  # all five modes
python examples/run_scheduling_experiment.py --mode prompting # zero-shot + 3-shot baselines
python examples/run_scheduling_experiment.py --mode isopro    # ISOPro training loop
python examples/run_scheduling_experiment.py --mode multiturn # multi-turn revision (scope validity)

Alternate MLX-native runner (used in the paper's main results):

python examples/run_isopro_mlx.py

IsoZero simulation baseline (no training):

python examples/run_isozero_scheduling.py

Ablation study (Table 3 / Figure 5)

python examples/run_ablation_study.py

Reproduces: full ISOPro, no chain-of-thought (−8.3 pp), no buffer accumulation (−12.0 pp, 4× variance), and the random-layer control (+0.9 pp, ns). Seeds: 42, 123, 456.

Ablation results

Scheduling domain

Tasks are generated programmatically at six difficulty tiers:

  • T0 — 4-job warmup, dependencies only
  • T1 — sequencing
  • T2 — resource allocation
  • T3 — deadline satisfaction
  • T4 — pairwise composition (two constraints)
  • T5 — full composition (all three; held out from training)

Ground truth is produced by an OR-Tools CP-SAT solver. The verifier checks precedence, resource capacity, and deadline satisfaction. Source: isopro/environments/tasks/scheduling_tasks.py, scheduling_verifier.py, scheduling_multiturn.py.


Additional Simulation Modules

ISOPro ships with simulation environments beyond the scheduling domain used in the paper. These are orthogonal to GCE and were developed for earlier work; they remain supported.

Adversarial Simulation
from isopro.adversarial_simulation import AdversarialSimulator, AdversarialEnvironment
from isopro.agents.ai_agent import AI_Agent

adv_env = AdversarialEnvironment(
    agent_wrapper=my_agent,
    num_adversarial_agents=2,
    attack_types=["textbugger", "deepwordbug"],
    attack_targets=["input", "output"],
)
simulator = AdversarialSimulator(adv_env)
results = simulator.run_simulation(
    ["What is the capital of France?", "How does photosynthesis work?"],
    num_steps=1,
)
Conversation Simulation
from isopro.conversation_simulation.conversation_simulator import ConversationSimulator

simulator = ConversationSimulator(
    ai_prompt="You are a customer service agent. Respond politely and professionally.",
)
history = simulator.run_simulation("upset", num_turns=3)
Workflow Simulation
from isopro.workflow_simulation import WorkflowAutomation

WorkflowAutomation(
    video="path/to/workflow.mp4",
    config="config.json",
    output="output_dir",
    logs="logs_dir",
).run()
AI Orchestration
from isopro.orchestration_simulation import OrchestrationEnv
from isopro.orchestration_simulation.components import LLaMAAgent, AnalysisAgent, WritingAgent
from isopro.orchestration_simulation.evaluator import Evaluator

env = OrchestrationEnv()
env.add_component(LLaMAAgent("Research", "analyze AI impact on labor markets"))
env.add_component(AnalysisAgent("Analysis"))
env.add_component(WritingAgent("Writing"))

results = {m: env.run_simulation(mode=m, input_data={"task": task}) for m in ("parallel", "sequence", "node")}
best_mode = Evaluator().evaluate(results)

Core Simulation API

pip install "isopro[api]"
python -m isopro.api_server

Standard response format:

{
  "run_id": "unique-identifier",
  "output": "simulation-specific-output",
  "metadata": { "timestamp": "...", "simulation_type": "..." }
}

Endpoints: GET /healthcheck, POST /simulate, POST /simulate/reason, POST /simulate/qa, POST /simulate/adversarial, POST /simulate/orchestration. See render.yaml for a Render deployment template.


Repository Layout

isopro/
├── training/           # rejection-sampling trainer, replay buffer, GRPO trainer, config
├── environments/       # simulation environments
│   └── tasks/          # scheduling tasks, verifier, multi-turn harness
├── curriculum/         # scheduler for tier progression
├── metrics/            # evaluation
├── backends/           # MLX, Ollama, HF backends
├── rl/                 # RL wrappers (CartPole, car, LLM envs)
├── adversarial_simulation/
├── conversation_simulation/
├── workflow_simulation/
├── orchestration_simulation/
└── api_server.py       # RESTful simulation API

examples/
├── run_scheduling_experiment.py   # Table 2 main experiment
├── run_ablation_study.py          # Table 3 / Figure 5 ablations
├── run_isopro_mlx.py              # MLX-native training loop
└── run_isozero_scheduling.py      # IsoZero baseline

Scope and Limitations

  • Verifiable-reward domains only. Reward hacking elimination is an architectural guarantee of using a deterministic verifier; it does not extend to domains where no verifier exists (safety, style). GCE extends to such domains through rubric-based trajectory assessment, but this is left to future work.
  • Single-domain validation. Results in the paper are on RCPSP scheduling. Broader validation across tasks, model families, and scales is needed.
  • Small per-tier sample sizes (3–5 problems). Per-tier accuracies are directional; the main findings are confirmed through multi-seed averaging (n = 3) and ablation.
  • T2 and T5 at 0% across all configurations suggest resource reasoning requires more iterations, scaffolding, or stronger base models — this is reported honestly rather than papered over.

Citation

If you use ISOPro or the GCE framework, please cite:

@inproceedings{henry2026gce,
  title     = {Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier},
  author    = {Henry, Jazmia},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2026},
  note      = {Under review}
}

@software{isopro,
  author    = {Henry, Jazmia},
  title     = {{ISOPro}: A Reference Implementation of Grounded Continuous Evaluation},
  year      = {2026},
  url       = {https://github.com/iso-ai/isopro}
}

License

Apache License 2.0 — see LICENSE.

Support

Questions, issues, or reproduction problems: please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isopro-0.3.0.tar.gz (92.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isopro-0.3.0-py3-none-any.whl (117.8 kB view details)

Uploaded Python 3

File details

Details for the file isopro-0.3.0.tar.gz.

File metadata

  • Download URL: isopro-0.3.0.tar.gz
  • Upload date:
  • Size: 92.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.8

File hashes

Hashes for isopro-0.3.0.tar.gz
Algorithm Hash digest
SHA256 e26fd33d0c89cc333bd68c779e10517157b1c2f22593de4e35f31634d1895cdf
MD5 2ed5fed4b41ac6708faec42d2e0a44e5
BLAKE2b-256 33b28a6472b4699cdcae3c6d1843d5030b718bfd55e5c509e8bb3696835dd583

See more details on using hashes here.

File details

Details for the file isopro-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: isopro-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 117.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.8

File hashes

Hashes for isopro-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e7a278401368df3fcccc1cfba9964ad95b2d06b8748c89d9c1bd87d66cfb168
MD5 0ba5ef8692104f2f5fba64d8e77728e0
BLAKE2b-256 399a7c9b4392c5a1a244040d19e69eab1bedcca04854f0e3620eb299155de6b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page