Skip to main content

Actors: A hackable library for doing Multi-Turn Multi-Agent RL with LLMs for the GPU poor and middle class.

Project description

Actors: Multi‑(Agent, Turn, Env) RL

Long Banner

A hackable library for doing Multi‑Turn Multi‑Agent RL with LLMs for the GPU poor and middle class. Supports some fun environments and makes it very easy to add new ones.

Hugging Face Hub PyPI


Multi‑Trainable‑Agents

This library supports training multiple different models together using Accelerate.

This allows you to do some very fun stuff, such as adversarial training, collaborative problem solving, multi‑agent collaboration, etc.

Here is a quick simplified example for collaborative problem solving:

# 2 completely different models, both trainable.
bob_actor = vLLMActor(
  name="Bob",
  model_path="Qwen/Qwen2.5-7B-Instruct",
)
alice_actor = vLLMActor(
  name="Alice",
  model_path="meta-llama/Llama-3.1-8B-Instruct",
)

# Loading a math dataset
ds = load_dataset('rl-actors/GSM8K-Easy-Math')

# In this environment they will take turns improving their solution.
env = CollaborativeEnvironment(
  actor_cfgs=[
    CollaborativeActorConfig(
      actor=alice_actor,
      system_prompt="You are Alice",
    ),
    CollaborativeActorConfig(
      actor=bob_actor,
      system_prompt="You are Bob",
    ),
  ],
  reward_functions=[
    # Omitted for brevity.
  ],
  # The order of the rounds is specified with a tiny DSL.
  # Bob starts and then Alice followed by random 5 turns.
  round_spec='Bob -> Alice -> (Bob/Alice)*5',
  train_dataset=ds
)

Installation

You can install the library from source for the latest features and bug fixes:

git clone https://github.com/RD211/actors.git
pip install .

Or install from PyPI:

pip install rl-actors

You should run the code with accelerate using a ZeRO‑3 configuration to be able to use all the features of the library.

accelerate launch --config_file zero3.yaml your_script.py

The library uses Accelerate, DeepSpeed, bitsandbytes, vLLM, and PEFT, and supports LoRA and QLoRA training.

Some quickstart examples can be found at examples/.


Environments

We plan to have the following environments; suggestions for new environments are welcome:

Category Environment Status Description
Single Trainable Agent SingleTurnEnvironment Standard environment with only one actor and one turn.
Multi Trainable Agent CollaborativeEnvironment Iterates on a task together in alternating turns.
Multi Trainable Agent ParallelEnvironment Samples multiple solutions in parallel and combines them at the end. This is probably what Grok 4 Heavy does.
Fun Environments JailbreakEnvironment One trainable actor tries to convince a frozen actor to do unsafe things from this dataset.
Fun Environments CodeforcesParallelEnvironment Same as the parallel environment but with code execution feedback.

Creating a new environment

It is pretty easy to add a new environment, and we recommend making a new environment rather than trying to adapt the current environments for specific tasks.

class CustomEnv(Environment):
  async def generate(self, batch: Map[str, Any]) -> EnvironmentOutput:
    # 1. Sample using your actor.
    problems = batch['problem']
    generations = await alice_actor.agenerate(problems)
    txt_gen = [gen.outputs[0].text for gen in generations]

    # 2. Give rewards (simplified).
    answers = batch['answer']
    rewards = [int(answer in txt) for answer, txt in zip(answers, txt_gen)]

    # 3. We now return the environment results.
    tok = alice_actor.tokenizer

    alice_output = ActorOutput(
      input_ids=tok(txt_gen)['input_ids'],
      rewards=rewards,
    )

    return EnvironmentOutput(
      actors={'Alice': alice_output},
    )

Combining environments

Combining environments is pretty cool. There are two major use cases we see:

  • Training on multiple different tasks with different rewards and completely different goals. Coding + Math, Coding + Creative Writing, etc.
  • Easily adding evaluation environments to your training.

Here are some examples:

# Training env for Codeforces.
codeforces_env = CodeforcesParallelEnvironment(
  actors=[bob_actor],
  reward_functions=[codeforces_reward]
)

# Training env for math.
math_env = SingleTurnEnvironment(
  actors=[bob_actor],
  reward_functions=[math_correctness],
  prompt_column='problem',
  train_data=load_dataset('rl-actors/GSM8K-Easy-Math', split='train'),
  eval_data={
    'gsm8k': load_dataset('rl-actors/GSM8K-Easy-Math', split='test')
  }
)

# Evaluation environment for AIME.
aime_eval = SingleTurnEnvironment(
  actors=[bob_actor],
  reward_functions=[math_correctness],
  prompt_column='problem',
  eval_data={
    'aime25': load_dataset('math-ai/aime25')
  }
)

# Final combined environment.
env = codeforces_env + math_env + aime_eval

Rewards

We do not provide many predefined reward functions as of now, but they can be easily created. The rewards are made to super easily support judges and very complex workflows. If you create your own environment you do not even need to explicitly create a reward function, as they can just be part of your environment directly.

However, for our predefined environments you can make rewards as follows:

# Single turn reward
@reward_function(name='math_reward', weight=1.0)
def length_reward(prompt: str, completion: str) -> float:
  return -len(prompt) / 1024

# We support batched rewards and weights too.
@conversation_reward_function(name='math_reward', weight=1.0, batched=True)
def math_reward(conversation: list,
                problem: list,  # Dataset field
                answer: list,   # Also dataset field
                actor_name: list  # allows actor-specific rewards.
              ) -> list[float]:
  # Batched reward functions are designed for Judges.
  # You can use Actors freely in the reward function.
  # ...
  return rewards

# The parameters for the reward functions are automatically filled in as follows:
# For Single turn you will always get the prompt and completion.
# For Conversation you will always get conversation and actor_name.
# For both of them you will get all dataset attributes too, such as `answer` for math data.

Memory efficiency

Training multiple models at the same time requires a lot of careful VRAM management. We have thus implemented the following features:

  • Full offloading of optimizer states and parameters. This is done during inference but also when switching between different models during the training part. More details here.
  • Triton kernel for computing log‑probabilities. Helps with long context a bit. More details here.
  • Liger kernels for computing the GRPO loss.
  • Efficient streamed implementation for updating vLLM weights along with LoRA in‑memory updates. More details here.

Debugging VRAM

In order to debug memory issues try running with ACTORS_LOGGING_LEVEL='verbose'.

Sometimes memory becomes very fragmented and can cause OOM errors when switching to the inference part. You can try running with: PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.3,max_split_size_mb:64 and it might fix the problem.

Sometimes, after a failed run, memory might remain allocated for a while. Make sure to terminate all previous processes before starting a new run.


RL algorithms

Currently there is a GRPO and GSPO implementation. Both implementations have both a torch version and a Liger-Kernel chunked version.

[!NOTE] You can also get a lot of the other implementations such as DAPO, Dr. GRPO just by configuring the existing losses and advantage function.


Actors

We support both hosted API actors and local/trainable actors.

# OpenAI‑style API actor (frozen or for judgment / orchestration)
openai_actor = OpenAIActor(
  name="Judge",
  api_key=os.environ["OPENAI_API_KEY"],
  # base_url can be customized to point at compatible endpoints
)

# Trainable vLLM actors
train_cfg = ActorTrainCfg(
  learning_rate=1e-6,
  beta=0.01,                      # Controls KL
  peft_config=LoraConfig(r=16),   # pass a PEFT/LoRA config if desired
  offload_optimizer=True,
  offload_model=True,
)

bob = vLLMActor(
  name="Bob",
  model_path="Qwen/Qwen2.5-7B-Instruct",
  gpu_groups=[[0, 1]],            # on what GPUs we put the model; allows data‑parallel
  training_config=train_cfg,
)

alice = vLLMActor(
  name="Alice",
  model_path="meta-llama/Llama-3.1-8B-Instruct",
  gpu_groups=1,
  training_config=train_cfg,
)
  • The gpu_groups for the vLLMActor are on what GPUs we put the model on, and it allows for data‑parallel.

Inspiration

Inspired by TRL, Unsloth, OpenRLHF and Verifiers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rl_actors-0.1.1.tar.gz (80.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rl_actors-0.1.1-py3-none-any.whl (85.4 kB view details)

Uploaded Python 3

File details

Details for the file rl_actors-0.1.1.tar.gz.

File metadata

  • Download URL: rl_actors-0.1.1.tar.gz
  • Upload date:
  • Size: 80.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for rl_actors-0.1.1.tar.gz
Algorithm Hash digest
SHA256 89c618336613e05035fae8e7adab2e969bb3d4b06dd43dc3cf537acb1e395006
MD5 03d23be32f485a5aed590fb34b9d2d45
BLAKE2b-256 8150d8c168c5993f30f7dc8a21e259169644d59430df7fab2624a129ce490123

See more details on using hashes here.

File details

Details for the file rl_actors-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rl_actors-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 85.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for rl_actors-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 484493fb60129ad167a91aed82673657f62725f9fe733fec9ebf99e082595f6a
MD5 3b1d5f1eafa6fe7384387be7830e46b4
BLAKE2b-256 04f88b1d1deee95f57a2349802a96bb864041cd148c7955305d797697f002398

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page