Intelligent Simulation Orchestration for Large Language Models

These details have not been verified by PyPI

Project links

Project description

ISOPro: A Reference Implementation of Grounded Continuous Evaluation

ISOPro is a simulation-based fine-tuning and evaluation framework for language models. It is the reference implementation of the Grounded Continuous Evaluation (GCE) framework described in:

Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier. Under review, NeurIPS 2026.

GCE argues that current LLM evaluation practice suffers from four structural validity failures — distributional, temporal, scope, and process invalidity — that compound in RLHF and make reward hacking a predictable consequence of evaluation design rather than an unpredictable training pathology. ISOPro demonstrates that these failures can be addressed architecturally, on a consumer laptop, by replacing the learned reward model with a deterministic verifier and updating LoRA adapter weights on CPU.

Headline Results

On resource-constrained project scheduling (RCPSP) with Qwen 2.5 3B Instruct, across six compositional difficulty tiers (T0–T5):

Method	T0	T1	T2	T3	T4	T5	Mean	Trains?
Zero-shot	80%	0%	0%	0%	0%	0%	13.3%	No
3-shot	20%	0%	0%	20%	0%	0%	6.7%	No
Multi-turn (×3)	100%	0%	0%	0%	0%	0%	16.7%	No
IsoZero (simulation)	100%	60%	20%	20%	0%	0%	33.3%	No
ISOPro + LoRA	100%	66.7%	0%	66.7%	0%	0%	39.8% ± 3.5	Yes

Eval: 3–5 problems per tier, 3 seeds. ISOPro: 6 iterations, 504 rollouts, 119 correct traces. Hardware: Apple M1, 32GB unified memory, ~90 min, peak memory <8GB, 0.216% trainable parameters, no GPU required.

A 3.0× improvement over zero-shot is achieved without oracle solutions, without a reward model, and without a KL penalty.

Per-tier accuracy across evaluation conditions

What ISOPro Implements

ISOPro consists of three layers — a simulation environment layer with deterministic verifiers, an LLM agent layer, and a communication wrapper managing state, evaluation, and feedback loops — and four mechanisms that collectively instantiate GCE:

1. Gradient descent on correct reasoning traces

When the model produces a verified-correct answer, ISOPro runs a forward pass with prompt tokens masked (labels set to -100) and computes loss only on the generated tokens. The gradient signal is the reasoning trajectory that produced correctness. This is process-level supervision: the model is trained on the reasoning, not on correctness as a label.

2. Rejection sampling as continuous self-filter

The model generates at high temperature (T = 0.8); a deterministic verifier accepts only correct responses into the replay buffer. Every iteration evaluates capability against ground truth, making training dynamics observable at the granularity that checkpoint evaluation cannot provide.

3. Implicit-curriculum replay buffer

Correct rollouts accumulate across iterations. Easy wins dominate early; harder problems enter as capability develops. The curriculum emerges from the model's own trajectory rather than from researcher curation. Ablation shows that removing accumulation drops mean accuracy by 12 pp and inflates seed-to-seed variance by roughly 4×.

4. Activation-guided LoRA targeting

Top-K layers are identified by activation probing and receive the LoRA updates (6.6M params = 0.216% of 3.1B). LoRA weights update on CPU — the base model stays frozen in quantized form. This is what eliminates the dual-model VRAM constraint imposed by RLHF's KL penalty.

Training dynamics: rollout hit rate and loss Implicit curriculum: replay buffer composition

Left: rollout hit rate doubles between iterations 2 and 3, coincident with the sharpest loss decrease — an inflection visible only through continuous evaluation. Right: the implicit curriculum that forms in the replay buffer without researcher curation.

Capability emergence heatmap

Capability emergence: each cell marks the iteration in which a tier first produces correct traces. T2 and T5 remain unreached — the framework honestly reports capability boundaries.

ISOPro vs. RLHF vs. GRPO

	ISOPro (ours)	RLHF (standard)	DeepSeek-R1 GRPO
Reward signal	Deterministic verifier	Learned reward model	Deterministic verifier
Stability mechanism	Rejection sampling	KL penalty (dual model)	Group-relative advantages
Models in memory	1	2+	1
Trainable parameters	0.216% (6.6M)	100% (full)	100% (full)
Min. memory (reference)	~6 GB (3B)	~28 GB (7B × 2)	~280 GB (70B)
Hardware	Consumer laptop	Data center GPU	GPU cluster
Reward hacking	Impossible (by construction)	Predictable	Impossible (by construction)

ISOPro and DeepSeek-R1 GRPO converged on the same architectural insight at orders-of-magnitude-different scales: for verifiable-reward domains, the verifier is the reward signal, and the learned reward model is an unnecessary intermediary.

Installation

pip install isopro

For the paper's training pipeline (LoRA adapters, MLX backend, OR-Tools verifier):

pip install "isopro[train]"
pip install mlx mlx-lm ortools   # Apple Silicon; OR-Tools is the ground-truth solver

For adversarial / conversation / workflow-simulation features:

pip install opencv-python stable-baselines3 gymnasium tqdm

Optional: if you use the Claude-backed agents,

export ANTHROPIC_API_KEY=your_api_key_here

Quickstart (no GPU, no model download)

Three runnable examples that demonstrate ISOPro without a model in the loop:

python examples/quickstart_gce.py           # see the verifier reject reward hacking in 5 seconds
python examples/custom_verifier.py          # plug your own domain into the loop in <100 lines
python examples/watch_curriculum_emerge.py  # visualize the implicit curriculum from a saved log

The quickstart_gce.py script generates a real OR-Tools-solved scheduling problem, then runs three responses (oracle, constraint-violating, plausible hallucination) through the deterministic verifier so you can see — concretely — what "the verifier is the reward signal" means. custom_verifier.py shows the full pattern for extending ISOPro to any domain you can verify with a Python function. watch_curriculum_emerge.py reads a saved training log and renders the buffer composition over iterations as ASCII bars, reproducing Figure 3 from the paper in your terminal.

Reproducing the Paper

All experiments run on an Apple M1 with 32GB unified memory. Full pipeline completes in ~90 minutes.

Main scheduling experiment (Table 2)

python examples/run_scheduling_experiment.py                  # all five modes
python examples/run_scheduling_experiment.py --mode prompting # zero-shot + 3-shot baselines
python examples/run_scheduling_experiment.py --mode isopro    # ISOPro training loop
python examples/run_scheduling_experiment.py --mode multiturn # multi-turn revision (scope validity)

Alternate MLX-native runner (used in the paper's main results):

python examples/run_isopro_mlx.py

IsoZero simulation baseline (no training):

python examples/run_isozero_scheduling.py

Ablation study (Table 3 / Figure 5)

python examples/run_ablation_study.py

Reproduces: full ISOPro, no chain-of-thought (−8.3 pp), no buffer accumulation (−12.0 pp, 4× variance), and the random-layer control (+0.9 pp, ns). Seeds: 42, 123, 456.

Ablation results

Scheduling domain

Tasks are generated programmatically at six difficulty tiers:

T0 — 4-job warmup, dependencies only
T1 — sequencing
T2 — resource allocation
T3 — deadline satisfaction
T4 — pairwise composition (two constraints)
T5 — full composition (all three; held out from training)

Ground truth is produced by an OR-Tools CP-SAT solver. The verifier checks precedence, resource capacity, and deadline satisfaction. Source: isopro/environments/tasks/scheduling_tasks.py, scheduling_verifier.py, scheduling_multiturn.py.

Additional Simulation Modules

ISOPro ships with simulation environments beyond the scheduling domain used in the paper. These are orthogonal to GCE and were developed for earlier work; they remain supported.

Adversarial Simulation

from isopro.adversarial_simulation import AdversarialSimulator, AdversarialEnvironment
from isopro.agents.ai_agent import AI_Agent

adv_env = AdversarialEnvironment(
    agent_wrapper=my_agent,
    num_adversarial_agents=2,
    attack_types=["textbugger", "deepwordbug"],
    attack_targets=["input", "output"],
)
simulator = AdversarialSimulator(adv_env)
results = simulator.run_simulation(
    ["What is the capital of France?", "How does photosynthesis work?"],
    num_steps=1,
)

Conversation Simulation

from isopro.conversation_simulation.conversation_simulator import ConversationSimulator

simulator = ConversationSimulator(
    ai_prompt="You are a customer service agent. Respond politely and professionally.",
)
history = simulator.run_simulation("upset", num_turns=3)

Workflow Simulation

from isopro.workflow_simulation import WorkflowAutomation

WorkflowAutomation(
    video="path/to/workflow.mp4",
    config="config.json",
    output="output_dir",
    logs="logs_dir",
).run()

AI Orchestration

from isopro.orchestration_simulation import OrchestrationEnv
from isopro.orchestration_simulation.components import LLaMAAgent, AnalysisAgent, WritingAgent
from isopro.orchestration_simulation.evaluator import Evaluator

env = OrchestrationEnv()
env.add_component(LLaMAAgent("Research", "analyze AI impact on labor markets"))
env.add_component(AnalysisAgent("Analysis"))
env.add_component(WritingAgent("Writing"))

results = {m: env.run_simulation(mode=m, input_data={"task": task}) for m in ("parallel", "sequence", "node")}
best_mode = Evaluator().evaluate(results)

Core Simulation API

pip install "isopro[api]"
python -m isopro.api_server

Standard response format:

{
  "run_id": "unique-identifier",
  "output": "simulation-specific-output",
  "metadata": { "timestamp": "...", "simulation_type": "..." }
}

Endpoints: GET /healthcheck, POST /simulate, POST /simulate/reason, POST /simulate/qa, POST /simulate/adversarial, POST /simulate/orchestration. See render.yaml for a Render deployment template.

Repository Layout

isopro/
├── training/           # rejection-sampling trainer, replay buffer, GRPO trainer, config
├── environments/       # simulation environments
│   └── tasks/          # scheduling tasks, verifier, multi-turn harness
├── curriculum/         # scheduler for tier progression
├── metrics/            # evaluation
├── backends/           # MLX, Ollama, HF backends
├── rl/                 # RL wrappers (CartPole, car, LLM envs)
├── adversarial_simulation/
├── conversation_simulation/
├── workflow_simulation/
├── orchestration_simulation/
└── api_server.py       # RESTful simulation API

examples/
├── run_scheduling_experiment.py   # Table 2 main experiment
├── run_ablation_study.py          # Table 3 / Figure 5 ablations
├── run_isopro_mlx.py              # MLX-native training loop
└── run_isozero_scheduling.py      # IsoZero baseline

Scope and Limitations

Verifiable-reward domains only. Reward hacking elimination is an architectural guarantee of using a deterministic verifier; it does not extend to domains where no verifier exists (safety, style). GCE extends to such domains through rubric-based trajectory assessment, but this is left to future work.
Single-domain validation. Results in the paper are on RCPSP scheduling. Broader validation across tasks, model families, and scales is needed.
Small per-tier sample sizes (3–5 problems). Per-tier accuracies are directional; the main findings are confirmed through multi-seed averaging (n = 3) and ablation.
T2 and T5 at 0% across all configurations suggest resource reasoning requires more iterations, scaffolding, or stronger base models — this is reported honestly rather than papered over.

Citation

If you use ISOPro or the GCE framework, please cite:

@inproceedings{henry2026gce,
  title     = {Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier},
  author    = {Henry, Jazmia},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2026},
  note      = {Under review}
}

@software{isopro,
  author    = {Henry, Jazmia},
  title     = {{ISOPro}: A Reference Implementation of Grounded Continuous Evaluation},
  year      = {2026},
  url       = {https://github.com/iso-ai/isopro}
}

License

Apache License 2.0 — see LICENSE.

Support

Questions, issues, or reproduction problems: please open an issue.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.2

May 6, 2026

0.3.1

May 6, 2026

This version

0.3.0

May 6, 2026

0.2.0

May 6, 2026

0.1.7

Jan 23, 2025

0.1.6

Nov 1, 2024

0.1.5

Nov 1, 2024

0.1.4

Sep 26, 2024

0.1.3

Sep 26, 2024

0.1.2

Sep 26, 2024

0.1.1

Sep 26, 2024

0.1.0

Sep 26, 2024

0.0.3

Sep 19, 2024

0.0.2

Sep 19, 2024

0.0.1

Sep 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isopro-0.3.0.tar.gz (92.8 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

isopro-0.3.0-py3-none-any.whl (117.8 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file isopro-0.3.0.tar.gz.

File metadata

Download URL: isopro-0.3.0.tar.gz
Upload date: May 6, 2026
Size: 92.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.8

File hashes

Hashes for isopro-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`e26fd33d0c89cc333bd68c779e10517157b1c2f22593de4e35f31634d1895cdf`
MD5	`2ed5fed4b41ac6708faec42d2e0a44e5`
BLAKE2b-256	`33b28a6472b4699cdcae3c6d1843d5030b718bfd55e5c509e8bb3696835dd583`

See more details on using hashes here.

File details

Details for the file isopro-0.3.0-py3-none-any.whl.

File metadata

Download URL: isopro-0.3.0-py3-none-any.whl
Upload date: May 6, 2026
Size: 117.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.8

File hashes

Hashes for isopro-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e7a278401368df3fcccc1cfba9964ad95b2d06b8748c89d9c1bd87d66cfb168`
MD5	`0ba5ef8692104f2f5fba64d8e77728e0`
BLAKE2b-256	`399a7c9b4392c5a1a244040d19e69eab1bedcca04854f0e3620eb299155de6b2`

See more details on using hashes here.

isopro 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ISOPro: A Reference Implementation of Grounded Continuous Evaluation

Headline Results

What ISOPro Implements

1. Gradient descent on correct reasoning traces

2. Rejection sampling as continuous self-filter

3. Implicit-curriculum replay buffer

4. Activation-guided LoRA targeting

ISOPro vs. RLHF vs. GRPO

Installation

Quickstart (no GPU, no model download)

Reproducing the Paper

Main scheduling experiment (Table 2)

Ablation study (Table 3 / Figure 5)

Scheduling domain

Additional Simulation Modules

Core Simulation API

Repository Layout

Scope and Limitations

Citation

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes