Skip to main content

Hackable RL post-training for LLMs

Project description

invokerl

Hackable and performant RL post-training for LLMs.

Install

pip install invokerl

Quick start on Single GPU

import invokerl as rl

MODEL = "Qwen/Qwen3-0.6B"

generator = rl.VLLMGenerator(MODEL, gpu_memory_utilization=0.3, max_model_len=2048)
policy = rl.Policy(MODEL)
ref_policy = rl.Policy(MODEL).freeze()  # frozen ref for KL

trainer = rl.Trainer(
    config=rl.TrainerConfig(
        model_name_or_path=MODEL, total_steps=200, lr=5e-6,
        batch_size=1, group_size=4, accumulation_steps=4,
    ),
    algorithm=rl.algorithms.GRPO(clip_eps=0.2, beta=0.04),
    generator=generator, policy=policy, ref_policy=ref_policy,
    reward_fn=rl.rewards.ExactMatch(),
    dataset=rl.datasets.GSM8K("train"),
    eval_dataset=rl.datasets.GSM8K("test"),
)
trainer.train()

Full runnable: examples/train_grpo_gsm8k.py

Multi-GPU

Same trainer.train() — pass different objects:

# Disagg (generation on cuda:0, training on cuda:1)
pipeline = rl.DisaggPipeline(...)
trainer.train(pipeline=pipeline)

# FSDP (launch with torchrun)
policy = rl.Policy(MODEL).fsdp()     # auto-inits torch.distributed
trainer.train(pipeline=pipeline)     # FSDP auto-detected from the policy

Full runnable: examples/train_disagg.py, examples/train_fsdp.py

Profiling is first-class

with rl.profile() as p:
    trainer.step()

p.summary()                   # wall / CPU / CUDA / unaccounted + per-phase
p.export_trace("trace.json")  # open at ui.perfetto.dev

Also works with nsys — the NVTX markers are emitted unconditionally, no extra flag needed:

nsys profile --trace=cuda,nvtx python examples/train_grpo_gsm8k.py

Full runnable: examples/profile_step.py

Writing a new algorithm

Every algorithm implements two methods:

from invokerl import BaseAlgorithm, RolloutBatch

class MyAlgorithm(BaseAlgorithm):
    def compute_advantages(self, batch: RolloutBatch) -> Tensor:
        """Turn rewards into per-token learning signals. The credit
        assignment hook — override for group normalization, GAE,
        token-level shaping, PRM scores, etc."""
        ...

    def compute_loss(self, new_log_probs, batch, advantages):
        """The policy objective. Return (loss, metrics)."""
        ...

Pass it to Trainer:

trainer = rl.Trainer(..., algorithm=MyAlgorithm(...))

Five algorithms already exist as reference: GRPO, DPO, PPO, SimPO, DAPO.

RolloutBatch

The data contract between the trainer and your algorithm:

Field Shape Description
token_ids [B, T] Prompt + completion token IDs
response_mask [B, T] True for generated tokens
rewards [B] Per-sequence scalar rewards
token_rewards [B, T] Optional per-token rewards
old_log_probs [B, T] Log-probs from policy at generation time
ref_log_probs [B, T] Log-probs from frozen reference model
group_ids [B] Which prompt each completion belongs to
group_size int Completions per prompt

Project structure

invokerl/
├── __init__.py       # public API (rl.Trainer, rl.Policy, rl.algorithms.GRPO, ...)
├── trainer.py        # Trainer: train() dispatches to internal standard/disagg/FSDP paths
├── policy.py         # PolicyModel + .fsdp() for distributed
├── generator.py      # VLLMGenerator
├── pipeline.py       # DisaggPipeline (optional, for 2-GPU async)
├── distributed.py    # FSDP init helpers
├── profiling.py      # rl.profile() context manager
├── algorithms/       # base + GRPO, DPO, PPO, SimPO, DAPO
├── data/             # base + GSM8K
└── rewards/          # base + rule-based exact match

examples/
├── train_grpo_gsm8k.py     # single GPU
├── train_disagg.py         # 2 GPUs async
├── train_fsdp.py           # FSDP multi-GPU
├── profile_step.py         # profiling
└── sweep_grpo_lr.py        # hyperparameter sweep

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invokerl-0.1.0.tar.gz (42.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

invokerl-0.1.0-py3-none-any.whl (50.2 kB view details)

Uploaded Python 3

File details

Details for the file invokerl-0.1.0.tar.gz.

File metadata

  • Download URL: invokerl-0.1.0.tar.gz
  • Upload date:
  • Size: 42.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for invokerl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 623efad19f19c12c52a2dacad616d8a47dfa41e7f4a2d77508fdc19f7894c9bd
MD5 50dcfcb51b0c4f3c3d9d2b80e13055fe
BLAKE2b-256 19e6bb5a1e7e00830513b81edcded1264661a27fa34826d69dc3f6eb435b7799

See more details on using hashes here.

Provenance

The following attestation bundles were made for invokerl-0.1.0.tar.gz:

Publisher: publish.yml on dhmnr/invokerl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file invokerl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: invokerl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for invokerl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 99bc3b49aa852ab38db0c4cc411fc034e477af6e838871d5bb0cb7add96e1307
MD5 a89e6151cd8ebe5707d4f03e9003eadb
BLAKE2b-256 cf0981f00d8ebea60889b896edf76b9d4d03c3a08298d4ebf7c6c9e34b6feba6

See more details on using hashes here.

Provenance

The following attestation bundles were made for invokerl-0.1.0-py3-none-any.whl:

Publisher: publish.yml on dhmnr/invokerl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page