An Async Online Multi-Agent RL library for training reasoning models on TextArena games.
Project description
unstable-baselines
Structure | Installation | Example | Collaboration | Citation
Updates
- 23/06/2025: Early release of the pip package (
pip install UnstableBaselines) - 22/06/2025: Early release of the code base
Introduction
unstable‑baselines is an experimental, asynchronous, online reinforcement‑learning framework for rapid prototyping of multi‑turn / multi‑agent algorithms on TextArena environments.
We tried to keep the code as straight forward as possible. It is currently around 1.2K lines long and semi-readable.
The main focus on unstable baselines is to enable fast prototyping/research. For something a bit more production ready we recomment to use oat or verifiers
Work in progress — interfaces will change.
Key Features
- Asynchronous collection & learning – actors generate data while learners train.
- Multi‑agent, multi‑turn focus with self‑play or fixed opponents.
- LoRA‑first fine‑tuning workflow for fast, lightweight updates.
- Composable reward transforms at step, final, and sampling stages.
Structure
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ │ Register new lora │ │ Get Loss & │ │
│ Model Pool │◀──────────────────────────│ Learner │◀─────────────────────────▶│ Algorithm │
│ │ checkpoint │ │ update weights │ │
└───────────────┘ └───────────────┘ └───────────────┘
▲ │ ▲ │
│ │ Sample If enough │ │ Check if enough
Update │ │ Opponent data, pull │ │ data for training
Trueskill │ │ the next batch │ │ is available
│ ▼ │ ▼
┌───────────────┐ ┌───────────────┐
│ │ Process and store │ │
│ Collector │──────────────────────────▶│ StepBuffer │
│ │ collected Trajectories │ │
└───────────────┘ └───────────────┘
▲ │
│ │ Maintain
return │ │ Pool of
Trajectory │ │ n parallel
│ │ workers
│ ▼
┌─────────────┐
│ run_game() │
│ train\eval │
└─────────────┘
Installation
# build TextArena v0.6.9 (until it’s on PyPI)
git clone https://github.com/LeonGuertler/TextArena.git
cd TextArena
git checkout v0.6.9
pip install -e .
cd ..
# install UnstableBaselines
pip install UnstableBaselines
Example
To get you started, in this short example we will run you through the process of training Qwen3-1.7B-Base via mirror self-play on SimpleTak and evaluating it against google/gemini-2.0-flash-lite-001 on SimpleTak and KuhnPoker. We will be running the experiments on 3xRTX6000 ada. If you are limited to 24gb of vRam, you can reduce the MAX_TRAIN_SEQ_LEN to around 2500 (this means that the model will only be trained on the first 2500 prompt+answer tokens, but can still generate answer that are longer than that. Since (in our experience) models tend to shorten their reasoning throughout training, this works very well).
Training script
import ray, unstable
import unstable.reward_transformations as retra
ray.init(namespace="unstable")
tracker = unstable.Tracker.options(name="Tracker").remote(run_name="demo", wandb_project="UB")
step_buffer = unstable.StepBuffer.options(name="StepBuffer").remote(
max_buffer_size=768,
tracker=tracker,
final_reward_transformation=retra.ComposeFinalRewardTransforms([retra.RoleAdvantageByEnvFormatter()]),
step_reward_transformation=retra.ComposeStepRewardTransforms([retra.RewardForFormat(1.5), retra.PenaltyForInvalidMove(1.0, -1.0)]),
sampling_reward_transformation=retra.ComposeSamplingRewardTransforms([retra.NormalizeRewardsByEnv(True)]),
)
model_pool = unstable.ModelPool.options(name="ModelPool").remote(sample_mode="mirror", max_active_lora=3, tracker=tracker)
ray.get(model_pool.add_checkpoint.remote(path=None, iteration=-1)) # set initial checkpoint as no LoRA
lora_cfg = {
"lora_rank": 32, "lora_alpha": 32, "lora_dropout": 0.0,
"target_modules": ["q_proj","k_proj","v_proj","o_proj","gate_proj", "up_proj","down_proj"]
}
collector = unstable.Collector.options(name="Collector").remote(
num_actors=2,
step_buffer=step_buffer,
model_pool=model_pool,
tracker=tracker,
vllm_config={
"model_name": "Qwen/Qwen3-1.7B-base",
"max_parallel_seq": 128,
"max_tokens": 4096,
"max_loras": 5,
"lora_config": lora_cfg,
"max_model_len": 8192
},
training_envs=[("SimpleTak-v0-train", 2, "qwen3-zs")], # (env-id, num players, prompt template)
evaluation_envs=[("SimpleTak-v0-train", 2, "qwen3-zs"), ("KuhnPoker-v0-train", 2, "qwen3-zs")],
evaluation_opponent="google/gemini-2.0-flash-lite-001",
)
learner = unstable.StandardLearner.options(num_gpus=1, name="Learner").remote(
model_name="Qwen/Qwen3-1.7B-base",
step_buffer=step_buffer,
model_pool=model_pool,
tracker=tracker,
algorithm=unstable.algorithms.Reinforce(),
batch_size=384,
mini_batch_size=1,
learning_rate=1e-5,
grad_clip=0.2,
lora_cfg=lora_cfg,
activation_checkpointing=False,
gradient_checkpointing=False,
max_train_len=None, # always train on the full sequence
max_generation_len=4096, # important for Dr. GRPO
)
# start the collection and training loops
collector.collect.remote(num_workers=384, num_eval_workers=16)
ray.get(learner.train.remote(200)) # total update steps
In a Nutshell, the collector will maintain 384 and 16 in parallel running collection and evaluation games (respectively). Whenever a game finishes, the trajectory is passed to the StepBuffer and a new game is started. The StepBuffer split each trajectory into steps and apply the specified reward transformations.
The Learner will periodically (once every 0.2 seconds) check if the StepBuffer has accumulated enough data for training. If so, it'll request a full training batch from the StepBuffer, train on the data, and push the new set of LoRA weights to the ModelPool.
The collector will keep collecting episodes until the Learner tells it to stop (in this case, after 200 update steps).
Monitoring Progress
If you want to monitor key metrics (in addition to logging them via W&B) during training you can run the following command in a seperate terminal:
python3 -m unstable.terminal_interface
The rendered interface will currently look something like this: (please not that it might change in the future as UnstableBaselines is very much still under development)
The .gif doesn't do it justice, looks nice when you run it yourself haha.
Results
TODO add some comments about the results
Collaboration
Developed in partnership with PlasticLabs.
Paper & Citation
We built this code-base as part of our research on self-play for reasoning models on text based games. We hope to finish and release those works within the next couple of weeks!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unstable_rl-0.1.0.tar.gz.
File metadata
- Download URL: unstable_rl-0.1.0.tar.gz
- Upload date:
- Size: 597.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc45e54d1e13f48457450f5ef0020f49ee9df3c596dbe07e6ba65e59645089a2
|
|
| MD5 |
ad1e2e31e7fb9809461acc7a09f07226
|
|
| BLAKE2b-256 |
711f01a40057a7e7d4e7ad7b85c39efe2989aa2a732a14f819ab835cde53668f
|
File details
Details for the file unstable_rl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: unstable_rl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6feab3e984cf9c1862d4f38f089f578df549f5f2e37d587bc85415a9ad65f8cc
|
|
| MD5 |
e7c38927ab945a3db2c0ee2642abb27a
|
|
| BLAKE2b-256 |
6ec2226fbb02791264e3f84495ea7485bc6a804c110ebebac49bb48c2a8a1eea
|