Skip to main content

AgileRL is a deep reinforcement learning library focused on improving RL development through RLOps.

Project description

Reinforcement learning streamlined.
Easier and faster reinforcement learning with RLOps. Visit our website. View documentation.
Join the Discord Server for questions, help and collaboration.

License Documentation Status Coverage Linux macOS Windows Downloads Discord Arena

🚀 Train super-fast for free on Arena, the RLOps platform from AgileRL 🚀


AgileRL is a Deep Reinforcement Learning library focused on improving development by introducing RLOps - MLOps for reinforcement learning.

This library is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning.
Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.
We are constantly adding more algorithms and features. AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline, multi-agent and contextual multi-armed bandit reinforcement learning algorithms with distributed training.

AgileRL offers 10x faster hyperparameter optimization than SOTA.

Table of Contents

Benchmarks

Reinforcement learning algorithms and libraries are usually benchmarked once the optimal hyperparameters for training are known, but it often takes hundreds or thousands of experiments to discover these. This is unrealistic and does not reflect the true, total time taken for training. What if we could remove the need to conduct all these prior experiments?

In the charts below, a single AgileRL run, which automatically tunes hyperparameters, is benchmarked against Optuna's multiple training runs traditionally required for hyperparameter optimization, demonstrating the real time savings possible. Global steps is the sum of every step taken by any agent in the environment, including across an entire population.

AgileRL offers an order of magnitude speed up in hyperparameter optimization vs popular reinforcement learning training frameworks combined with Optuna. Remove the need for multiple training runs and save yourself hours.

AgileRL also supports multi-agent reinforcement learning using the Petting Zoo-style (parallel API). The charts below highlight the performance of our MADDPG and MATD3 algorithms with evolutionary hyper-parameter optimisation (HPO), benchmarked against epymarl's MADDPG algorithm with grid-search HPO for the simple speaker listener and simple spread environments.

Get Started

To see the full AgileRL documentation, including tutorials, visit our documentation site. To ask questions and get help, collaborate, or discuss anything related to reinforcement learning, join the AgileRL Discord Server.

Install as a package with pip:

pip install agilerl

Or install in development mode:

git clone https://github.com/AgileRL/AgileRL.git && cd AgileRL
pip install -e .

AgileRL ships optional dependency groups that you can install as needed:

Installation Description
agilerl[box2d] Box2D physics engine for Gymnasium environments
agilerl[arena] Arena SDK & CLI. Validate custom environments, and train & deploy agents on managed cloud infrastructure.
agilerl[llm] LLM reinforcement fine-tuning.
agilerl[all] Cover all functionalities of AgileRL.

In development mode, quote the extras:

pip install -e ".[arena]"

To install the nightly version of AgileRL with the latest features, use:

pip install git+https://github.com/AgileRL/AgileRL.git@nightly

Training Locally

AgileRL provides the tools to train RL algorithms in a variety of ways, focusing on flexibility and modularity as a stepping stone for efficiently training arbitrarily large populations of agents in a distributed manner on Arena.

Training a Single Agent without Evolutionary HPO

The simplest way to train an RL agent with AgileRL is through the LocalTrainer. Here is an example of training a DQN agent on the LunarLander-v3 environment:

from agilerl.training.trainer import LocalTrainer

trainer = LocalTrainer(algorithm="DQN", environment="LunarLander-v3")
population, fitnesses = trainer.train()

With no other arguments provided, LocalTrainer defaults to 1,000,000 steps with a single agent and the algorithm's default hyperparameters — no evolutionary HPO is applied.

Training a Population with Evolutionary HPO

To unlock AgileRL's evolutionary hyperparameter optimization, train a population of agents whose hyperparameters will evolve and mutate towards their optimal values:

from agilerl import LocalTrainer
from agilerl.models import TrainingSpec

trainer = LocalTrainer(
    algorithm="DQN",
    environment="LunarLander-v3",
    training=TrainingSpec(pop_size=4), # Train four agents simultaneously
    hpo=True, # Enable evolutionary HPO using default settings
)
population, fitnesses = trainer.train()

This trains a population of four DQN agents that share experiences but learn individually. Every 10,000 steps (default value for evo_steps in TrainingSpec), tournament selection identifies the best performers and mutations are applied to explore the hyperparameter space. See Evolutionary Hyperparameter Optimization for details on how evolutionary HPO works in AgileRL.

Or via a YAML manifest:

DQN-LunarLander-v3 manifest (configs/training/dqn/dqn.yaml)
---
algorithm:
    name: DQN
    batch_size: 128
    lr: 6.3e-4
    learn_step: 4
    gamma: 0.99
    tau: 0.001
    double: false
    cudagraphs: false

environment:
    name: LunarLander-v3
    num_envs: 16

mutation:
    probabilities:
        no_mut: 0.4
        arch_mut: 0.2
        new_layer: 0.2
        params_mut: 0.2
        act_mut: 0.2
        rl_hp_mut: 0.2
    rl_hp_selection:
        lr:
            min: 0.0000625
            max: 0.01
        batch_size:
            min: 8
            max: 512
        learn_step:
            min: 1
            max: 10
    mutation_sd: 0.1
    rand_seed: 42

network:
    latent_dim: 128
    arch: mlp
    encoder_config:
        hidden_size:
            - 128
    head_config:
        hidden_size:
            - 128

replay_buffer:
    max_size: 100_000

tournament_selection:
    tournament_size: 2
    elitism: true

training:
    max_steps: 1_000_000
    target_score: 200.0
    pop_size: 4
    evo_steps: 10_000
    eval_steps:
    eval_loop: 1
    learning_delay: 0
    eps_start: 1.0
    eps_end: 0.1
    eps_decay: 0.99

Python

from agilerl import LocalTrainer

trainer = LocalTrainer.from_manifest("configs/training/dqn/dqn.yaml")
population, fitnesses = trainer.train()

CLI

python -m agilerl.train configs/training/dqn/dqn.yaml

Every aspect of the training pipeline is customisable — from modifying hyperparameters and mutation strategies in our off-the-shelf tools, to implementing your own evolvable algorithms, network architectures, and training loops.

Custom Training Pipelines

For full control over training, you can build each component individually:

Custom RL pipeline example
import torch

from agilerl.algorithms import DQN
from agilerl.utils.utils import make_vect_envs
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.hpo.tournament import TournamentSelection
from agilerl.hpo.mutation import Mutations
from agilerl.training.train_off_policy import train_off_policy

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize environment
env = make_vect_envs(env_name="LunarLander-v3", num_envs=16)

# Network configuration
net_config = {
    "latent_dim": 64,
    "encoder_config": {"hidden_size": [64]},
    "head_config": {"hidden_size": [64]}
}

# Algorithm hyperparameters
init_hp = {
    "double": True,
    "batch_size": 256,
    "lr": 1e-3,
    "gamma": 0.99,
    "learn_step": 1,
    "tau": 1e-3
}

# Create a population of DQN agents
population_size=6
agent_pop = DQN.population(
    size=population_size,
    observation_space=env.single_observation_space,
    action_space=env.single_action_space,
    net_config=net_config,
    device=device,
    **init_hp
)

# Replay buffer
memory = ReplayBuffer(max_size=10_000, device=device)

# Evolutionary HPO
tournament = TournamentSelection(
    tournament_size=2,
    elitism=True,
    population_size=population_size
)
mutations = Mutations(
    no_mutation=0.4,
    architecture=0.2,
    new_layer_prob=0.2,
    parameters=0.2,
    activation=0.0,
    rl_hp=0.2,
    mutation_sd=0.1,
    rand_seed=42,
    device=device,
)

trained_pop, pop_fitnesses = train_off_policy(
    env=env,
    env_name="LunarLander-v3",
    algo="DQN",
    pop=agent_pop,
    memory=memory,
    max_steps=1_000_000,
    evo_steps=10_000,
    target=200.0,
    tournament=tournament,
    mutation=mutations,
)

This approach gives you the flexibility to swap in your own Gymnasium or PettingZoo environments, custom evolvable networks, or entirely custom training loops while still leveraging AgileRL's evolutionary HPO.

Training on Arena

Arena is the RLOps platform from AgileRL. We provide tools to create and validate custom reinforcement learning environments on the platform and train RL agents on managed cloud infrastructure specifically tailored to RL workloads.

AgileRL ships a Python SDK and a CLI for interacting with the platform through the agilerl-arena package. It is a separate PyPI distribution that contributes the agilerl.arena namespace. Install it directly, or via the AgileRL extra:

pip install agilerl-arena
# or
pip install "agilerl[arena]"

Python

Use the ArenaClient to interact with Arena programmatically from scripts or notebooks:

from agilerl.arena import ArenaClient

client = ArenaClient()
client.login()

# Register and validate a custom environment
client.validate_environment(source="path/to/my_env.py")

# Train on validated custom environment
client.submit_experiment(
    manifest="path/to/manifest.yaml",
    project="my-project",
)

Arena CLI

The same operations are available from the command line:

# Authenticate with Arena
arena login

# Upload and validate
arena env validate --source path/to/my_env.py

# Train on validated custom environment
arena experiments submit path/to/manifest.yaml --project my-project

For the full CLI and Python SDK reference—including authentication, environment validation, experiments, and deployment—see the Arena Client documentation.

Tutorials

We are constantly updating our tutorials to showcase the latest features of AgileRL and how users can leverage our evolutionary HPO to achieve 10x faster hyperparameter optimization. Please see the available tutorials below.

Tutorial Type Description Tutorials
Single-agent tasks Guides for training both on and off-policy agents to beat a variety of Gymnasium environments. PPO - Acrobot
TD3 - Lunar Lander
Rainbow DQN - CartPole
Recurrent PPO - Masked Pendulum
Multi-agent tasks Use of PettingZoo environments such as training DQN to play Connect Four with curriculum learning and self-play, and for multi-agent tasks in MPE environments. DQN - Connect Four
MADDPG - Space Invaders
MATD3 - Speaker Listener
Hierarchical curriculum learning Shows how to teach agents Skills and combine them to achieve an end goal. PPO - Lunar Lander
Contextual multi-arm bandits Learn to make the correct decision in environments that only have one timestep. NeuralUCB - Iris Dataset
NeuralTS - PenDigits
Custom Modules & Networks Learn how to create custom evolvable modules and networks for RL algorithms. Dueling Distributional Q Network
EvolvableSimBa
Training on Arena Upload and validate custom environments, submit training jobs on managed cloud infrastructure, and deploy trained agents for inference. PPO - Acrobot Custom Environment
LLM Finetuning Learn how to finetune an LLM using AgileRL. GRPO

Evolvable Algorithms (more coming soon!)

Single-agent

RL Algorithm
On-Policy Proximal Policy Optimization (PPO)
Off-Policy Deep Q Learning (DQN)
Rainbow DQN
Deep Deterministic Policy Gradient (DDPG)
Twin Delayed Deep Deterministic Policy Gradient (TD3)
Offline Conservative Q-Learning (CQL)
Implicit Language Q-Learning (ILQL)

Multi-agent

RL Algorithm
Multi-agent Multi-Agent Deep Deterministic Policy Gradient (MADDPG)
Multi-Agent Twin-Delayed Deep Deterministic Policy Gradient (MATD3)
Independent Proximal Policy Optimization (IPPO)

Contextual multi-armed bandit

RL Algorithm
Bandits Neural Contextual Bandits with UCB-based Exploration (NeuralUCB)
Neural Contextual Bandits with Thompson Sampling (NeuralTS)

LLM Fine-tuning

RL Algorithm
On-Policy Group Relative Policy Optimization (GRPO)
Clipped Importance Sampling Policy Optimization (CISPO)
Grouped Sequence Policy Optimization (GSPO)
LLM Proximal Policy Optimization (LLM PPO)
LLM REINFORCE
Off-Policy Direct Preference Optimization (DPO)

Citing AgileRL

If you use AgileRL in your work, please cite the repository:

@software{Ustaran-Anderegg_AgileRL,
author = {Ustaran-Anderegg, Nicholas and Pratt, Michael and Sabal-Bermudez, Jaime},
license = {Apache-2.0},
title = {{AgileRL}},
url = {https://github.com/AgileRL/AgileRL}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agilerl-2.8.0.dev0.tar.gz (503.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agilerl-2.8.0.dev0-py3-none-any.whl (601.2 kB view details)

Uploaded Python 3

File details

Details for the file agilerl-2.8.0.dev0.tar.gz.

File metadata

  • Download URL: agilerl-2.8.0.dev0.tar.gz
  • Upload date:
  • Size: 503.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.8

File hashes

Hashes for agilerl-2.8.0.dev0.tar.gz
Algorithm Hash digest
SHA256 aba523aefdc5ab725b05e5ac73bc56bd3e783f107c6fabab810a029944b6c01f
MD5 06a858d308878ab5e81bd6f6df0aa1fd
BLAKE2b-256 559285c964dc7ae522f5059b680775de5295bc74920899ae165b07d058e79546

See more details on using hashes here.

File details

Details for the file agilerl-2.8.0.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for agilerl-2.8.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 7059a578337908cd63956d9d8a634241170349350f1ed83ac1b585f8853b437d
MD5 20064894aa36e0c08e5b6335281c8a00
BLAKE2b-256 8a0a8133187f060663d4192880ec60110cc5dc210f93f88439ef49de3084e2a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page