Skip to main content

Hackable RL post-training for LLMs

Project description

invokerl

Hackable and performant RL post-training for LLMs.

Install

pip install invokerl

Quick start on Single GPU

import invokerl as rl

MODEL = "Qwen/Qwen3-0.6B"

generator = rl.VLLMGenerator(MODEL, gpu_memory_utilization=0.3, max_model_len=2048)
policy = rl.Policy(MODEL)
ref_policy = rl.Policy(MODEL).freeze()  # frozen ref for KL

trainer = rl.Trainer(
    config=rl.TrainerConfig(
        model_name_or_path=MODEL, total_steps=200, lr=5e-6,
        batch_size=1, group_size=4, accumulation_steps=4,
    ),
    algorithm=rl.algorithms.GRPO(clip_eps=0.2, beta=0.04),
    generator=generator, policy=policy, ref_policy=ref_policy,
    reward_fn=rl.rewards.ExactMatch(),
    dataset=rl.datasets.GSM8K("train"),
    eval_dataset=rl.datasets.GSM8K("test"),
)
trainer.train()

Full runnable: examples/train_grpo_gsm8k.py

Multi-GPU

Same trainer.train() — pass different objects:

# Disagg (generation on cuda:0, training on cuda:1)
pipeline = rl.DisaggPipeline(...)
trainer.train(pipeline=pipeline)

# FSDP (launch with torchrun)
policy = rl.Policy(MODEL).fsdp()     # auto-inits torch.distributed
trainer.train(pipeline=pipeline)     # FSDP auto-detected from the policy

Full runnable: examples/train_disagg.py, examples/train_fsdp.py

Profiling is first-class

with rl.profile() as p:
    trainer.step()

p.summary()                   # wall / CPU / CUDA / unaccounted + per-phase
p.export_trace("trace.json")  # open at ui.perfetto.dev

Also works with nsys — the NVTX markers are emitted unconditionally, no extra flag needed:

nsys profile --trace=cuda,nvtx python examples/train_grpo_gsm8k.py

Full runnable: examples/profile_step.py

Writing a new algorithm

Every algorithm implements two methods:

from invokerl import BaseAlgorithm, RolloutBatch

class MyAlgorithm(BaseAlgorithm):
    def compute_advantages(self, batch: RolloutBatch) -> Tensor:
        """Turn rewards into per-token learning signals. The credit
        assignment hook — override for group normalization, GAE,
        token-level shaping, PRM scores, etc."""
        ...

    def compute_loss(self, new_log_probs, batch, advantages):
        """The policy objective. Return (loss, metrics)."""
        ...

Pass it to Trainer:

trainer = rl.Trainer(..., algorithm=MyAlgorithm(...))

Five algorithms already exist as reference: GRPO, DPO, PPO, SimPO, DAPO.

RolloutBatch

The data contract between the trainer and your algorithm:

Field Shape Description
token_ids [B, T] Prompt + completion token IDs
response_mask [B, T] True for generated tokens
rewards [B] Per-sequence scalar rewards
token_rewards [B, T] Optional per-token rewards
old_log_probs [B, T] Log-probs from policy at generation time
ref_log_probs [B, T] Log-probs from frozen reference model
group_ids [B] Which prompt each completion belongs to
group_size int Completions per prompt

Project structure

invokerl/
├── __init__.py       # public API (rl.Trainer, rl.Policy, rl.algorithms.GRPO, ...)
├── trainer.py        # Trainer: train() dispatches to internal standard/disagg/FSDP paths
├── policy.py         # PolicyModel + .fsdp() for distributed
├── generator.py      # VLLMGenerator
├── pipeline.py       # DisaggPipeline (optional, for 2-GPU async)
├── distributed.py    # FSDP init helpers
├── profiling.py      # rl.profile() context manager
├── algorithms/       # base + GRPO, DPO, PPO, SimPO, DAPO
├── data/             # base + GSM8K
└── rewards/          # base + rule-based exact match

examples/
├── train_grpo_gsm8k.py     # single GPU
├── train_disagg.py         # 2 GPUs async
├── train_fsdp.py           # FSDP multi-GPU
├── profile_step.py         # profiling
└── sweep_grpo_lr.py        # hyperparameter sweep

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invokerl-0.2.0.tar.gz (44.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

invokerl-0.2.0-py3-none-any.whl (53.0 kB view details)

Uploaded Python 3

File details

Details for the file invokerl-0.2.0.tar.gz.

File metadata

  • Download URL: invokerl-0.2.0.tar.gz
  • Upload date:
  • Size: 44.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for invokerl-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0c619857c80237b99790fe2cb7c90ebb2149db1a1570abfa61894556bb5f1542
MD5 2e23c1bd14a02bcbde6adac83ae04f1c
BLAKE2b-256 144066630c161123484fb56fbcc3b4b24fbbfacdc7dce8916c301f8008957373

See more details on using hashes here.

Provenance

The following attestation bundles were made for invokerl-0.2.0.tar.gz:

Publisher: publish.yml on dhmnr/invokerl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file invokerl-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: invokerl-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 53.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for invokerl-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 441674093cfe7992027a88036131cc702c8787770d1025f6946b4302fd090767
MD5 751282bf8441a36649d6c4ccc874efdb
BLAKE2b-256 4b6ed442be1e99ba0ea50d060efef1f5328424b80be1d439f3abd56a5cea849f

See more details on using hashes here.

Provenance

The following attestation bundles were made for invokerl-0.2.0-py3-none-any.whl:

Publisher: publish.yml on dhmnr/invokerl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page