Skip to main content

RL2: Ray Less Reinforcement Learning

Project description

RL2: Ray Less Reinforcement Learning

A concise library of post-training for large language models.

This is the right library for you if you want to learn reinforcement learning for large language models or have a quick test for your own algorithm. We deliver a clear implementation without complicated abstractions.

Despite the simplicity, you should be able to scale up to moderate-sized, e.g., 72B, language models with

We also support

  • Balanced sequence packing for higher throughput
  • Multi-turn rollout with SGLang async inference engine
  • GEM (OpenAI Gym like) Agentic Environments

RL2 is a production-ready library! Check our wandb report on OpenThoughts, SkyworkRM, UltraFeedback, TinyZero, LetterCounting, and SearchR1.

Incoming Features

  • Initialize model on meta device to decrease RAM consumption
  • Support partial rollout to increase GPU utilization
  • Use SGLang Router to forward requests for load balance between inference engines
  • Integrate GEM to scale environments

Getting Started

Installation

pip install rl-square

Data Preperation [Examples]

Hugging Face dataset and various file types, i.e., JSON, JSONL, CSV, Parquet, and Arrow, are accepted. All trainers support formats of both raw text and messages. The former is more flexible but may be model-specific.

SFT

[
    {
        "prompt": "The capital of China is",
        "response": "Beijing."
    }
]
[
    {
        "messages": [
            {"role": "user", "content": "What is the capital of China?"},
            {"role": "assistant", "content": "Beijing."}
        ]
    }
]

Multi-turn is only supported by the latter format.

RM and DPO

[
    {
        "prompt": "The capital of China is",
        "chosen": "Beijing.",
        "rejected": "Shanghai."
    }
]
[
    {
        "messages": [
            {"role": "user", "content": "What is the capital of China?"}
        ],
        "chosen": "Beijing.",
        "rejected": "Shanghai."
    }
]

PPO

[
    {
        "prompt": "The capital of China is",
        "extra_info": {
            "answer": "Beijing"
        }
    }
]
[
    {
        "messages": [
            {"role": "user", "content": "What is the capital of China?"}
        ],
        "extra_info": {
            "answer": "Beijing"
        }
    }
]

Environments [Examples]

In PPO, the language model interacts with the environment through a user-defined function step in the following format.

async def step(
    state: str, action: str, extra_info: Dict
) -> Dict:
    action_type = parse_action_type(action)
    env_response = {
        "next_state": None,
        "reward": 0.0,
        "score": 0.0,
        "done": False,
        "extra_info": extra_info
    }
    if action_type == "search":
        query = parse_query(action)
        passage = await search_result(query)
        env_response["next_state"] = state + action + passage
    elif action_type == "answer":
        pred = parse_pred(action)
        reward = float(is_equivalent(pred, extra_info["answer"]))
        env["reward"] = reward
        env["score"] = score
        env_response["done"] = True
    return env_response
  • state and action are the input and output of language model in the last turn and next_state is the input of language model in the next turn. When state + action is a prefix of next_state, the two turns will be processed in a single sequence.
  • reward is used to compute advantages (and subsequently update the model) while score is used to log the model performance. Diverge values may be used when needed.
  • done indicates whether to proceed to the next turn.
  • extra_info contains everything not aforementioned, e.g., answer.

The function should be included in a Python script where the path is specified by actor.rollout.env_path.

Launch [Examples]

Use torchrun to launch the trainer. For example, for single node

torchrun \
    --nproc_per_node=<number of GPUs> \
    -m RL2.trainer.ppo \
    <args>

For multi nodes

torchrun \
    --nnodes=<number of nodes> \
    --node_rank=<rank of node> \
    --nproc_per_node=<number of GPUs on a node> \
    --master_addr=<address of master node> \
    --master_port=<port of master node> \
    -m RL2.trainer.ppo \
    <args>

Hyper-Parameters

Training Engine Partition

By default, i.e., ddp_size=1, tp_size=1, your model will be partitioned via ZeRO stage 3. ddp_size specifies the number of model parameter copies. Larger ddp_size leads to higher memory consumption and lower communication cost. For large models, you may specify tp_size > 1 to enable tensor parallelism. The product of ddp_size and tp_size should be a factor of the total number of GPUs.

Sequence Length

For SFT, RM, and DPO, max_length is used to truncate sequences. In RM and DPO, the chosen and rejected sequences will be packed together, so the actual sequence length can be up to twice of max_length. For PPO, max_new_tokens is used to terminate generations. The length of any sequence cannot exceed sp_size * tp_size * max_length_per_device.

Algorithm

The default algorithm is Dr. GRPO, where the loss is averaged at the token level and the advantage is not divided by the standard deviation.

  • To use OpenAI PPO, set kl.type=reward, kl.reward_estimator=k1, and adv.estimator=gae
  • To use DeepSeek GRPO, set actor.avg_level=sequence, kl.type=loss, kl.loss_estimator=k3, and adv.norm_var=true

Acknowledgement

This project is built upon the basis of many remarkable projects, including but not limited to

We also thank OpenRLHF and veRL for their pioneering work.

Citation

If you find this library useful, please cite in the following format

@misc{Tan2025RL2,
    author={Chenmien Tan and Simon Yu and Lanbo Lin and Ze Zhang and Yuanwu Xu and Chenhao Jiang and Tianyuan Yang and Sicong Xie and Guannan Zhang},
    title={RL2: Ray Less Reinforcement Learning},
    note={GitHub repository},
    howpublished={\url{https://github.com/ChenmienTan/RL2}},
    year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rl_square-0.0.2.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rl_square-0.0.2-py3-none-any.whl (43.4 kB view details)

Uploaded Python 3

File details

Details for the file rl_square-0.0.2.tar.gz.

File metadata

  • Download URL: rl_square-0.0.2.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for rl_square-0.0.2.tar.gz
Algorithm Hash digest
SHA256 d0c79b516c21dfd0f6c00ba67eefb69ec501fb036ef21cd3cba4c52d510f5331
MD5 f72d52276b5b370739c2ee11a34ead80
BLAKE2b-256 f860840eb86f6d7819f84cae1831fe08f0384daa1d8fbb3a056cacddbcfdd16e

See more details on using hashes here.

File details

Details for the file rl_square-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: rl_square-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 43.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for rl_square-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6a5ff0614ef6033d76d002510657a8eeeb990609880cd0424407f41af4360f24
MD5 d28191cfa799a32b85fddbe2285d010c
BLAKE2b-256 3ebb36d3de8d8cff2a48dd5925f01ea37e8d6a6e1174ff5694ee585914c1c154

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page