Intelligent Simulation Orchestration for Large Language Models
Project description
ISOPro: A Reference Implementation of Grounded Continuous Evaluation
ISOPro is a simulation-based fine-tuning and evaluation framework for language models. It is the reference implementation of the Grounded Continuous Evaluation (GCE) framework described in:
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier. Under review, NeurIPS 2026.
GCE argues that current LLM evaluation practice suffers from four structural validity failures — distributional, temporal, scope, and process invalidity — that compound in RLHF and make reward hacking a predictable consequence of evaluation design rather than an unpredictable training pathology. ISOPro demonstrates that these failures can be addressed architecturally, on a consumer laptop, by replacing the learned reward model with a deterministic verifier and updating LoRA adapter weights on CPU.
Headline Results
On resource-constrained project scheduling (RCPSP) with Qwen 2.5 3B Instruct, across six compositional difficulty tiers (T0–T5):
| Method | T0 | T1 | T2 | T3 | T4 | T5 | Mean | Trains? |
|---|---|---|---|---|---|---|---|---|
| Zero-shot | 80% | 0% | 0% | 0% | 0% | 0% | 13.3% | No |
| 3-shot | 20% | 0% | 0% | 20% | 0% | 0% | 6.7% | No |
| Multi-turn (×3) | 100% | 0% | 0% | 0% | 0% | 0% | 16.7% | No |
| IsoZero (simulation) | 100% | 60% | 20% | 20% | 0% | 0% | 33.3% | No |
| ISOPro + LoRA | 100% | 66.7% | 0% | 66.7% | 0% | 0% | 39.8% ± 3.5 | Yes |
Eval: 3–5 problems per tier, 3 seeds. ISOPro: 6 iterations, 504 rollouts, 119 correct traces. Hardware: Apple M1, 32GB unified memory, ~90 min, peak memory <8GB, 0.216% trainable parameters, no GPU required.
A 3.0× improvement over zero-shot is achieved without oracle solutions, without a reward model, and without a KL penalty.
What ISOPro Implements
ISOPro consists of three layers — a simulation environment layer with deterministic verifiers, an LLM agent layer, and a communication wrapper managing state, evaluation, and feedback loops — and four mechanisms that collectively instantiate GCE:
1. Gradient descent on correct reasoning traces
When the model produces a verified-correct answer, ISOPro runs a forward pass with prompt tokens masked (labels set to -100) and computes loss only on the generated tokens. The gradient signal is the reasoning trajectory that produced correctness. This is process-level supervision: the model is trained on the reasoning, not on correctness as a label.
2. Rejection sampling as continuous self-filter
The model generates at high temperature (T = 0.8); a deterministic verifier accepts only correct responses into the replay buffer. Every iteration evaluates capability against ground truth, making training dynamics observable at the granularity that checkpoint evaluation cannot provide.
3. Implicit-curriculum replay buffer
Correct rollouts accumulate across iterations. Easy wins dominate early; harder problems enter as capability develops. The curriculum emerges from the model's own trajectory rather than from researcher curation. Ablation shows that removing accumulation drops mean accuracy by 12 pp and inflates seed-to-seed variance by roughly 4×.
4. Activation-guided LoRA targeting
Top-K layers are identified by activation probing and receive the LoRA updates (6.6M params = 0.216% of 3.1B). LoRA weights update on CPU — the base model stays frozen in quantized form. This is what eliminates the dual-model VRAM constraint imposed by RLHF's KL penalty.
Left: rollout hit rate doubles between iterations 2 and 3, coincident with the sharpest loss decrease — an inflection visible only through continuous evaluation. Right: the implicit curriculum that forms in the replay buffer without researcher curation.
Capability emergence: each cell marks the iteration in which a tier first produces correct traces. T2 and T5 remain unreached — the framework honestly reports capability boundaries.
ISOPro vs. RLHF vs. GRPO
| ISOPro (ours) | RLHF (standard) | DeepSeek-R1 GRPO | |
|---|---|---|---|
| Reward signal | Deterministic verifier | Learned reward model | Deterministic verifier |
| Stability mechanism | Rejection sampling | KL penalty (dual model) | Group-relative advantages |
| Models in memory | 1 | 2+ | 1 |
| Trainable parameters | 0.216% (6.6M) | 100% (full) | 100% (full) |
| Min. memory (reference) | ~6 GB (3B) | ~28 GB (7B × 2) | ~280 GB (70B) |
| Hardware | Consumer laptop | Data center GPU | GPU cluster |
| Reward hacking | Impossible (by construction) | Predictable | Impossible (by construction) |
ISOPro and DeepSeek-R1 GRPO converged on the same architectural insight at orders-of-magnitude-different scales: for verifiable-reward domains, the verifier is the reward signal, and the learned reward model is an unnecessary intermediary.
Installation
pip install isopro
For the paper's training pipeline (LoRA adapters, MLX backend, OR-Tools verifier):
pip install "isopro[train]"
pip install mlx mlx-lm ortools # Apple Silicon; OR-Tools is the ground-truth solver
For adversarial / conversation / workflow-simulation features:
pip install opencv-python stable-baselines3 gymnasium tqdm
Optional: if you use the Claude-backed agents,
export ANTHROPIC_API_KEY=your_api_key_here
Quickstart (no GPU, no model download)
Three runnable examples that demonstrate ISOPro without a model in the loop:
python examples/quickstart_gce.py # see the verifier reject reward hacking in 5 seconds
python examples/custom_verifier.py # plug your own domain into the loop in <100 lines
python examples/watch_curriculum_emerge.py # visualize the implicit curriculum from a saved log
The quickstart_gce.py script generates a real OR-Tools-solved scheduling problem, then runs three responses (oracle, constraint-violating, plausible hallucination) through the deterministic verifier so you can see — concretely — what "the verifier is the reward signal" means. custom_verifier.py shows the full pattern for extending ISOPro to any domain you can verify with a Python function. watch_curriculum_emerge.py reads a saved training log and renders the buffer composition over iterations as ASCII bars, reproducing Figure 3 from the paper in your terminal.
Reproducing the Paper
All experiments run on an Apple M1 with 32GB unified memory. Full pipeline completes in ~90 minutes.
Main scheduling experiment (Table 2)
python examples/run_scheduling_experiment.py # all five modes
python examples/run_scheduling_experiment.py --mode prompting # zero-shot + 3-shot baselines
python examples/run_scheduling_experiment.py --mode isopro # ISOPro training loop
python examples/run_scheduling_experiment.py --mode multiturn # multi-turn revision (scope validity)
Alternate MLX-native runner (used in the paper's main results):
python examples/run_isopro_mlx.py
IsoZero simulation baseline (no training):
python examples/run_isozero_scheduling.py
Ablation study (Table 3 / Figure 5)
python examples/run_ablation_study.py
Reproduces: full ISOPro, no chain-of-thought (−8.3 pp), no buffer accumulation (−12.0 pp, 4× variance), and the random-layer control (+0.9 pp, ns). Seeds: 42, 123, 456.
Scheduling domain
Tasks are generated programmatically at six difficulty tiers:
- T0 — 4-job warmup, dependencies only
- T1 — sequencing
- T2 — resource allocation
- T3 — deadline satisfaction
- T4 — pairwise composition (two constraints)
- T5 — full composition (all three; held out from training)
Ground truth is produced by an OR-Tools CP-SAT solver. The verifier checks precedence, resource capacity, and deadline satisfaction. Source: isopro/environments/tasks/scheduling_tasks.py, scheduling_verifier.py, scheduling_multiturn.py.
Additional Simulation Modules
ISOPro ships with simulation environments beyond the scheduling domain used in the paper. These are orthogonal to GCE and were developed for earlier work; they remain supported.
Adversarial Simulation
from isopro.adversarial_simulation import AdversarialSimulator, AdversarialEnvironment
from isopro.agents.ai_agent import AI_Agent
adv_env = AdversarialEnvironment(
agent_wrapper=my_agent,
num_adversarial_agents=2,
attack_types=["textbugger", "deepwordbug"],
attack_targets=["input", "output"],
)
simulator = AdversarialSimulator(adv_env)
results = simulator.run_simulation(
["What is the capital of France?", "How does photosynthesis work?"],
num_steps=1,
)
Conversation Simulation
from isopro.conversation_simulation.conversation_simulator import ConversationSimulator
simulator = ConversationSimulator(
ai_prompt="You are a customer service agent. Respond politely and professionally.",
)
history = simulator.run_simulation("upset", num_turns=3)
Workflow Simulation
from isopro.workflow_simulation import WorkflowAutomation
WorkflowAutomation(
video="path/to/workflow.mp4",
config="config.json",
output="output_dir",
logs="logs_dir",
).run()
AI Orchestration
from isopro.orchestration_simulation import OrchestrationEnv
from isopro.orchestration_simulation.components import LLaMAAgent, AnalysisAgent, WritingAgent
from isopro.orchestration_simulation.evaluator import Evaluator
env = OrchestrationEnv()
env.add_component(LLaMAAgent("Research", "analyze AI impact on labor markets"))
env.add_component(AnalysisAgent("Analysis"))
env.add_component(WritingAgent("Writing"))
results = {m: env.run_simulation(mode=m, input_data={"task": task}) for m in ("parallel", "sequence", "node")}
best_mode = Evaluator().evaluate(results)
Core Simulation API
pip install "isopro[api]"
python -m isopro.api_server
Standard response format:
{
"run_id": "unique-identifier",
"output": "simulation-specific-output",
"metadata": { "timestamp": "...", "simulation_type": "..." }
}
Endpoints: GET /healthcheck, POST /simulate, POST /simulate/reason, POST /simulate/qa, POST /simulate/adversarial, POST /simulate/orchestration. See render.yaml for a Render deployment template.
Repository Layout
isopro/
├── training/ # rejection-sampling trainer, replay buffer, GRPO trainer, config
├── environments/ # simulation environments
│ └── tasks/ # scheduling tasks, verifier, multi-turn harness
├── curriculum/ # scheduler for tier progression
├── metrics/ # evaluation
├── backends/ # MLX, Ollama, HF backends
├── rl/ # RL wrappers (CartPole, car, LLM envs)
├── adversarial_simulation/
├── conversation_simulation/
├── workflow_simulation/
├── orchestration_simulation/
└── api_server.py # RESTful simulation API
examples/
├── run_scheduling_experiment.py # Table 2 main experiment
├── run_ablation_study.py # Table 3 / Figure 5 ablations
├── run_isopro_mlx.py # MLX-native training loop
└── run_isozero_scheduling.py # IsoZero baseline
Scope and Limitations
- Verifiable-reward domains only. Reward hacking elimination is an architectural guarantee of using a deterministic verifier; it does not extend to domains where no verifier exists (safety, style). GCE extends to such domains through rubric-based trajectory assessment, but this is left to future work.
- Single-domain validation. Results in the paper are on RCPSP scheduling. Broader validation across tasks, model families, and scales is needed.
- Small per-tier sample sizes (3–5 problems). Per-tier accuracies are directional; the main findings are confirmed through multi-seed averaging (n = 3) and ablation.
- T2 and T5 at 0% across all configurations suggest resource reasoning requires more iterations, scaffolding, or stronger base models — this is reported honestly rather than papered over.
Citation
If you use ISOPro or the GCE framework, please cite:
@inproceedings{henry2026gce,
title = {Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier},
author = {Henry, Jazmia},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2026},
note = {Under review}
}
@software{isopro,
author = {Henry, Jazmia},
title = {{ISOPro}: A Reference Implementation of Grounded Continuous Evaluation},
year = {2026},
url = {https://github.com/iso-ai/isopro}
}
License
Apache License 2.0 — see LICENSE.
Support
Questions, issues, or reproduction problems: please open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file isopro-0.3.1.tar.gz.
File metadata
- Download URL: isopro-0.3.1.tar.gz
- Upload date:
- Size: 93.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e2db93132e035c25c4e623b64f6618ba0fffb5af016bb93063ba8b396bf23cd
|
|
| MD5 |
4029cdaf8c276463f1122a077345b89a
|
|
| BLAKE2b-256 |
d6e25bbe773e0e9659b3f57c72157249e1ffc0850d26357992d84f891a56a687
|
File details
Details for the file isopro-0.3.1-py3-none-any.whl.
File metadata
- Download URL: isopro-0.3.1-py3-none-any.whl
- Upload date:
- Size: 118.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e2fac4da4ce221e4ff4de08c15a0bc92e9c339a3647cb5d8f177ae1dc450334
|
|
| MD5 |
4f21969e2640745d59ec7691c067ab09
|
|
| BLAKE2b-256 |
052e121928c97e6ad07b426f4cf5a4110b811ace16dd8d47bb5453791a1d3b60
|