Skip to main content

A Gym for Generalist LLMs.

Project description

GEM: A Gym for Generalist LLMs

Notion blog 🌐 Axon-RL Hugging Face Collection Documentation

We’re entering the era of experience, where LLM training moves beyond static datasets, towards LLM agents learning from experience gathered in complex, expressive environments. As a step towards this we introduce GEM, our open-source General Experience Maker.

Like OpenAI Gym for traditional RL, GEM is a dedicated environment simulator for the age of LLMs. GEM offers a diverse range of environments with clean, standardized interfaces, making it easy to integrate with existing RL training frameworks (Oat, Verl, etc.). In addition, GEM features tool integration, flexible and easy-to-modify wrappers, async vectorized environment execution to maximize throughput, multi-environment training, and more … everything you need to make LLM agent RL training simple.

Links

Installation

We recommend using uv for dependency management:

uv pip install gem-llm

To use the search tool, run:

uv pip install 'gem-llm[search]'
conda install -c pytorch -c nvidia faiss-gpu=1.8.0

Interface

GEM's interface closely follows Gym's API. Here's an example using the "game:GuessTheNumber-v0" environment:

import gem

# List all supported environments
gem.print_envs()

# Initialize the environment
env = gem.make("game:GuessTheNumber-v0")

# Reset the environment to generate the first observation
observation, info = env.reset()

# Start the agent-environment loop
while True:
    action = env.sample_random_action() # insert policy here, e.g.,
    # (pseudocode) action = llm.generate(observation)

    # apply action and receive next observation, reward
    # and whether the episode has ended
    next_observation, reward, terminated, truncated, info = env.step(action)
    print("OBS", observation)
    print("ACT", action)

    # update the policy (online) here
    # e.g., policy = learn(policy, observation, action, reward, info)

    observation = next_observation
    # Exit when the episode terminates
    if terminated or truncated:
        break

Tool Integration Examples

Below are examples for enabling tools within environments.

Example using the Python tool:

from transformers import AutoTokenizer

import gem
from gem.tools.python_code_tool import PythonCodeTool
from gem.tools.tool_env_wrapper import ToolEnvWrapper
from gem.wrappers.wrapper_factory import WRAPPER_FACTORY

env = gem.make("math:GSM8K")
tool = PythonCodeTool()
wrapped_env = ToolEnvWrapper(env, tools=[tool])
wrapped_env = WRAPPER_FACTORY["concat_chat"](
    wrapped_env, tokenizer=AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
)
obs, info = wrapped_env.reset()

# we ignore the obs and use a dummy action
dummy_action = "<think>Let me compare 9.9 and 0.11 using python.</think><python>print('9.9 > 9.11?', 9.9 > 9.11)</python>"
obs, reward, terminated, truncated, info = wrapped_env.step(dummy_action)
print(obs)
# continue to sample the next response given the tool results ...

wrapped_env.close()

Example using the search tool:

# assume you have search server running

env = gem.make("game:GuessTheNumber-v0", max_turns=2)
tool = SearchTool(search_url="http://localhost:8000/retrieve", topk=2)
wrapped_env = ToolEnvWrapper(env, tools=[tool], max_tool_uses=1)
wrapped_env = WRAPPER_FACTORY['concat_chat'](wrapped_env, tokenizer=AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B"))
wrapped_env.reset()

dummy_action = "<think>I need to search for Python list comprehension examples</think><search>Python list comprehension examples</search>"
obs, reward, terminated, truncated, info = wrapped_env.step(dummy_action)
print(obs)
Click to get the complete runnable code
import subprocess
import time

from transformers import AutoTokenizer

import gem
from gem.tools.search_tool import SearchTool
from gem.tools.tool_env_wrapper import ToolEnvWrapper
from gem.wrappers.wrapper_factory import WRAPPER_FACTORY

# start the search server
serp_api_key = "add you api key" # get api at https://serpapi.com/manage-api-key
server_process = subprocess.Popen([
    'python', '-m', 'gem.tools.search_engine.serp_search_server',
    '--search_url', 'https://serpapi.com/search',
    '--topk', '2', '--serp_api_key', serp_api_key
])
time.sleep(5)

# interact using search tool
env = gem.make("game:GuessTheNumber-v0", max_turns=2)
tool = SearchTool(search_url="http://localhost:8000/retrieve", topk=2)
wrapped_env = ToolEnvWrapper(env, tools=[tool], max_tool_uses=1)
wrapped_env = WRAPPER_FACTORY['concat_chat'](wrapped_env, tokenizer=AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B"))
wrapped_env.reset()

dummy_action = "<think>I need to search for Python list comprehension examples</think><search>Python list comprehension examples</search>"
obs, reward, terminated, truncated, info = wrapped_env.step(dummy_action)
print(obs)

Integration Examples

We demonstrate how to leverage existing LLM RL infrastructure to train agents with GEM. First, we show how to train game agents using Oat.

Before running the training, ensure you set up the development environment by following the instructions.

Run the following command to train an agent for the game environment game:GuessTheNumber-v0:

python train.py \
    --env_id game:GuessTheNumber-v0 \
    --wrappers concat \
    --gamma 0.9 \
    --norm_adv \
    --gpus 8 \
    --gradient-checkpointing \
    --num_samples 1 \
    --rollout_batch_size 128 \
    --num_envs 2 \
    --rollout_batch_size_per_device 16 \
    --pi_buffer_maxlen_per_device 16 \
    --pretrain Qwen/Qwen3-1.7B-Base \
    --enable_prefix_caching \
    --collocate \
    --vllm_sleep \
    --vllm_gpu_ratio 0.45 \
    --rnd-seed \
    --learning_rate 0.000001 \
    --lr_scheduler constant \
    --lr_warmup_ratio 0 \
    --num_ppo_epochs 2 \
    --train_batch_size 128 \
    --train_batch_size_per_device 1 \
    --beta 0 \
    --max_model_len 12800 \
    --generate_max_length 4096 \
    --temperature 1.0 \
    --top_p 1 \
    --eval_steps -1 \
    --save_steps -1 \
    --eval_temperature 0.6 \
    --eval_top_p 0.95 \
    --eval_generate_max_length 4096 \
    --max_train 65000 \
    --max_save_num 30 \
    --use-wb \
    --wb-run-name oat-qwen3-1.7b-base-game:GuessTheNumber-v0 \
    --wb_project gem \
    --debug

We also provide sample code for math, code, and general QA in the examples directory. In addition to Oat integration, you can find examples of RL training with Verl here.

Roadmap

As our next step, we plan to integrate the following environments (among others):

  • Terminal-Bench
  • SWE-Gym
  • ...

Contributing

We welcome all forms of contribution — from adding new environments to integrating additional training frameworks. We're planning to write a community-driven technical report, and major contributors will be recognized with authorship. Join discord to discuss more!

Acknowledgement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gem_llm-0.0.3.post1-py3-none-any.whl (75.9 kB view details)

Uploaded Python 3

File details

Details for the file gem_llm-0.0.3.post1-py3-none-any.whl.

File metadata

  • Download URL: gem_llm-0.0.3.post1-py3-none-any.whl
  • Upload date:
  • Size: 75.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for gem_llm-0.0.3.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 2246228c1eb060e94e8463be63821ba6066be4e5bbd479a02351ef56209597b2
MD5 8d7f341b61c84372dd551ebdc2b34a3d
BLAKE2b-256 702fe0b84fcb792f3a5ba815c350d7b5ec086d380b7086bb7fc15c8d85c6b287

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page