Skip to main content

A lightweight implementation of Policy Dual Averaging (PDA) for reinforcement learning in PyTorch.

Project description

Policy Dual Averaging (PDA)

PyPI Downloads PyPI License

Policy Dual Averaging (PDA) is an on-policy reinforcement learning algorithm with theoretical guarantees and competitive empirical performance. This package is a lightweight PyTorch implementation of PDA and actor-accelerated PDA for problems with discrete and continuous action space. The environment wrappers and on-policy training loops are adapted from CleanRL and Tianshou. We use Gymnasium for environment simulation.

The algorithm is based on paper Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces, and Policy Optimization over General State and Action Spaces.

PDA

On-policy RL methods (e.g., TRPO, PPO) under the Policy Mirror Descent (PMD) framework regularize the new policy toward the previous iterate. PDA instead regularizes (through Bregman divergence $D(\cdot,\cdot)$ ) toward a fixed prox-center policy $\pi_0$ and accumulates all past advantages $\psi^{\pi_t}$ with weights $\beta_t$:

$$ \pi_{k+1}(s) = \arg\min_{a \in \mathcal{U}} \sum_{t=0}^{k} \beta_t , \psi^{\pi_t}(s, a) + \lambda_k , D\bigl(\pi_0(s), a\bigr). $$

Acceleration: PDA's policy is defined implicitly. Every time an action $\pi_{k+1}(s)$ is needed (at every environment step), we have to solve the optimization subproblem over the action space $\mathcal{U}$, which is theoretically clean but computationally challenging for deep RL at scale. Actor-accelerated PDA addresses this by training a neural network actor $\hat\pi_\theta$ to approximate the minimizer of the cumulative regularized objective: $\hat\pi_{k+1}(s;\theta^\pi) \approx \pi_{k+1}(s)$.

Benchmarks

Comparison with on-policy methods: Actor-accelerated PDA consistently outperforms on-policy baselines (PPO, TRPO, NPG) on MuJoCo-v4 and OR-Gymnasium. Hyperparameters are kept fixed across all environments. Benchmarks are conducted using our Tianshou (V1.2.0) implementation using Tianshou tuned hyperparameters. Benchmark results table records average episodic return over the final 5 epochs, 10 seeds × 10 evaluations per seed, 1M (or 3M) environment steps.

Continuous control benchmark (MuJoCo & Box2D): PDA vs PPO / TRPO / NPG

Environment NPG PPO TRPO PDA
HalfCheetah-v4 3556.4 ± 837.9 4067.5 ± 572.3 4496.8 ± 969.7 5174.6 ± 686.7
Ant-v4 2002.5 ± 300.6 2589.5 ± 445.8 2807.1 ± 645.5 3568.2 ± 449.7
Hopper-v4 1650.8 ± 726.4 2329.7 ± 572.9 2017.0 ± 849.0 2944.3 ± 787.4
Walker2d-v4 2923.2 ± 707.2 3277.0 ± 762.6 4128.8 ± 638.1 4367.1 ± 915.9
InvertedPendulum-v4 1000.0 ± 0.0 993.3 ± 20.2 380.3 ± 391.6 1000.0 ± 0.0
InvertedDoublePendulum-v4 7967.2 ± 1205.3 7926.8 ± 1241.8 2723.4 ± 2393.5 9167.5 ± 552.6
Reacher-v4 -5.7 ± 0.9 -18.7 ± 17.5 -6.1 ± 1.0 -4.0 ± 0.4
Swimmer-v4 25.1 ± 9.7 54.2 ± 14.2 35.0 ± 28.3 111.3 ± 29.7
Humanoid-v4 (1M) 745.1 ± 141.4 669.2 ± 107.4 719.5 ± 107.0 760.1 ± 139.9
Humanoid-v4 (3M) 4650.6 ± 686.8 933.3 ± 294.5 4745.1 ± 592.5 5020.0 ± 501.5
HumanoidStandup-v4 38364.7 ± 3926.9 135853.7 ± 21036.7 36737.7 ± 2413.4 161184.5 ± 3275.5
LunarLander-v3 34.1 ± 59.5 204.4 ± 44.7 -83.0 ± 53.3 204.7 ± 54.6
BipedalWalker-v3 212.3 ± 39.8 251.6 ± 44.8 106.3 ± 103.0 149.0 ± 117.4
NewsvendorEnv-v0 (3M) -1.0e7 ± 2.3e7 -37225.9 ± 48015.8 -5.3e7 ± 10.0e7 21857.9 ± 11382.1
PortfolioOptEnv-v0 10095.3 ± 1522.9 10062.3 ± 1230.8 10195.6 ± 1496.7 10550.9 ± 1277.1
InvManagementBacklogEnv-v0 430.4 ± 31.0 496.4 ± 6.8 397.5 ± 25.4 491.6 ± 11.8
InvManagementLostSalesEnv-v0 431.5 ± 7.0 447.4 ± 16.8 432.3 ± 8.8 472.3 ± 10.3

Comparison with off-policy methods: PDA is an on-policy algorithm yet remains competitive with SAC and TD3 on many tasks, with 10–40× faster wall-clock time.

Continuous control benchmark (MuJoCo & Box2D): PDA vs SAC / TD3 / PPO

Environment SAC TD3 PPO PDA
HalfCheetah-v4 11892.1 ± 789.8 10228.4 ± 636.2 4067.5 ± 572.3 5174.6 ± 686.7
Ant-v4 5764.8 ± 358.6 3914.1 ± 1330.9 2589.5 ± 445.8 3568.2 ± 449.7
Hopper-v4 3387.0 ± 205.7 2896.6 ± 801.0 2329.7 ± 572.9 2944.3 ± 787.4
Walker2d-v4 3864.3 ± 816.6 3435.5 ± 712.5 3277.0 ± 762.6 4367.1 ± 915.9
InvertedPendulum-v4 1000.0 ± 0.0 783.5 ± 401.9 993.3 ± 20.2 1000.0 ± 0.0
InvertedDoublePendulum-v4 9183.2 ± 524.0 8134.8 ± 2914.5 7926.8 ± 1241.8 9167.5 ± 552.6
Reacher-v4 -3.6 ± 0.4 -3.7 ± 0.5 -18.7 ± 17.5 -4.0 ± 0.4
Swimmer-v4 43.5 ± 1.5 93.3 ± 25.2 54.2 ± 14.2 111.3 ± 29.7
Humanoid-v4 (1M) 4785.3 ± 953.5 4846.9 ± 437.6 669.2 ± 107.4 760.1 ± 139.9
HumanoidStandup-v4 146249.0 ± 12118.1 107311.1 ± 18845.0 135853.7 ± 21036.7 161184.5 ± 3275.5
LunarLander-v3 284.6 ± 5.1 229.0 ± 98.1 204.4 ± 44.7 204.7 ± 54.6
BipedalWalker-v3 -9.1 ± 14.2 295.2 ± 23.9 251.6 ± 44.8 149.0 ± 117.4
NewsvendorEnv-v0 -19212.1 ± 4213.5 -19758.2 ± 3821.4 -1.9e5 ± 4.1e5 -1.5e5 ± 1.9e5
PortfolioOptEnv-v0 9010.2 ± 1255.1 10062.3 ± 1230.8 10550.9 ± 1277.1
InvManagementBacklogEnv-v0 281.8 ± 37.9 -1575.7 ± 665.3 496.4 ± 6.8 491.6 ± 11.8
InvManagementLostSalesEnv-v0 356.5 ± 13.0 -311.9 ± 469.4 447.4 ± 16.8 472.3 ± 10.3

Algorithm runtime on MuJoCo-v4: Average runtime (5 parallel runs) for 1M environment steps of training and testing. GPU is used for off-policy methods (SAC and TD3) for acceleration due to larger neural network sizes. Times are reported in seconds (mean ± standard deviation). Hardware: 14700K, RTX 4070.

Algorithm MuJoCo-v4 (w/o humanoid) MuJoCo-v4 (humanoid variants)
PDA 423.4 ± 104.8 (1.00x) 1480.7 ± 178.2 (1.00x)
PPO 932.1 ± 120.1 (2.20x) 1960.8 ± 167.2 (1.32x)
NPG 344.1 ± 108.4 (0.81x) 1011.3 ± 137.0 (0.68x)
TRPO 349.5 ± 112.6 (0.83x) 1009.6 ± 127.6 (0.68x)
SAC 20335.9 ± 1878.3 (48.03x) 19531.4 ± 2181.6 (13.19x)
TD3 11571.7 ± 510.6 (27.33x) 29113.7 ± 596.8 (19.66x)

Installation

Install with:

pip install pdarl
pip install "pdarl[mujoco]" # MuJoCo envs
pip install "pdarl[track]" # Weights & Biases
pip install "pdarl[all]" # MuJoCo + W&B

For other envs, see Gymnasium for installation details.

We included CleanRL style single-file RL script for PDA_ACT and PDA_DSC in folder cleanrl_pda. The files can be copied and ran directly when dependencies are installed.

Quickstart

Copy the config folder or create your own and train actor-accelerated PDA (default) with a config file:

pdarl-run --config config/pda_act.yaml

Use the config files for other PDA variants and PMD:

pdarl-run --config config/pda_opt_rgd.yaml
pdarl-run --config config/pda_opt_acfgm.yaml
pdarl-run --config config/pda_bcd.yaml
pdarl-run --config config/pda_dsc.yaml
pdarl-run --config config/pmd_act.yaml

Customization

To customize the agents and training process, customize agents like PDA_ACT and Trainer class directly. See pdarl/trainer.py for more details.

import numpy as np
 
from pdarl.utils.args import load_args_from_yaml
from pdarl.trainer import Trainer
from pdarl import PDA_ACT

args = load_args_from_yaml("config/pda_act.yaml")

class CustomPDA(PDA_ACT):
    # Override agent for custom behavior
    def __init__(self, obs_dim, act_dim, args, device):
        super().__init__(obs_dim, act_dim, args, device)
        print("Custom PDA initialized!")
    
    def compute_act(self, obs):
        return super().compute_act(obs)

class CustomTrainer(Trainer):
    # Override trainer setup to use CustomPDA
    def _setup_agent(self):
        obs_space = self.env.env.observation_space
        act_space = self.env.env.action_space
        obs_dim = int(np.array(obs_space.shape).prod())
        act_dim = int(np.prod(act_space.shape))
        self.agent = CustomPDA(obs_dim, act_dim, self.args, self.device).to(self.device)
        print("Custom Trainer agent changed to CustomPDA!")

trainer = CustomTrainer(args)
trainer.setup()
trainer.run()

Logging

When logging: true (default), TensorBoard scalars are written under runs/. View them with:

tensorboard --logdir runs

Set track: true in a config file to log to Weights & Biases (wandb_project_name, wandb_entity).

Agents

agent_name Description
PDA_ACT PDA with actor acceleration: a learned policy network approximates the action subproblem (default).
PDA_DSC PDA for native discrete action spaces.
PDA_OPT Direct subproblem optimization per action via randomized gradient descent (RGD) or Auto-Conditioned Fast Gradient Method (AC-FGM).
PDA_BCD Direct subproblem optimization per action via Block-Coordinate Descent (BCD) on a discretized action grid.
PMD_ACT Policy Mirror Descent (PMD) with actor acceleration.

Key configs

See pdarl/utils/args.py and the files under config for the full schema.

Parameter Role
exp_name Experiment name used for logging and tracking
agent_name Agent variant (see table above)
seed Random seed for reproducibility
torch_deterministic Whether to use deterministic PyTorch operations
device Compute device for training (e.g., cpu, cuda, or mps)
logging Enable or disable TensorBoard logging
env_id Gymnasium environment id
env_kargs Additional keyword arguments for the environment
total_timesteps Training horizon in environment steps
learning_rate Base learning rate for neural network optimizers
num_envs, num_steps Number of parallel environments and steps per rollout
num_test_envs Number of parallel environments used for evaluation
test_interval How often (in epochs) to run evaluation tests
anneal_lr Whether to linearly anneal the learning rate during training
gamma, gae_lambda Discount factor and Generalized Advantage Estimation lambda
num_minibatches, update_epochs Minibatch splits and number of optimizer passes per rollout
hidden_sizes, activation MLP architecture for value / sum-advantage / actor
max_grad_norm Maximum gradient norm for clipping
obs_norm, ret_norm, adv_norm Observation, return, and advantage normalization
recompute_ret Recompute returns based on value function during updates
step_size PDA regularization coefficient
act_noise Exploration noise scale on actions
action_opt For PDA_OPT: RGD or ACFGM
action_opt_itr Iterations for PDA_OPT / PDA_BCD inner solves
action_opt_params Parameters for the inner action solver
discretization Grid resolution for PDA_BCD

To cite this repository

@misc{gao2026actorpda,
  title={Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces},
  author={Gao, Ji and Ju, Caleb and Lan, Guanghui and Tong, Zhaohui},
  year={2026},
  eprint={2603.10199},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.10199}
}
@misc{ju2022policyoptimizationgeneralstate,
  title={Policy Optimization over General State and Action Spaces},
  author={Ju, Caleb and Lan, Guanghui},
  year={2022},
  eprint={2211.16715},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2211.16715}
}

References

@article{huang2022cleanrl,
  author  = {Shengyi Huang and Rousslan Fernand Julien Dossa and Chang Ye and Jeff Braga and Dipam Chakraborty and Kinal Mehta and João G.M. Araújo},
  title   = {CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {274},
  pages   = {1--18},
  url     = {http://jmlr.org/papers/v23/21-1342.html}
}
@article{weng2022tianshou,
  title   = {Tianshou: A Highly Modularized Deep Reinforcement Learning Library},
  author  = {Weng, Jiayu and Chen, Hang and Yan, Dong and You, Kaiwen and Zhang, Chen and Gong, Yunchang and Zhu, Rongqing and Li, Sijin and Zhou, Yiwen and Liu, Qian and Song, Yueyang},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {267},
  pages   = {1--6},
  url     = {http://jmlr.org/papers/v23/21-1127.html}
}
@misc{towers2024gymnasium,
  author = {Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and De Cola, Gianluca and Deleu, Tristan and Goul{\~a}o, Manuel and Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},
  title  = {Gymnasium: A Standard Interface for Reinforcement Learning Environments},
  year   = {2024},
  eprint = {arXiv:2407.17032}
}
@misc{HubbsOR-Gym,
    author={Christian D. Hubbs and Hector D. Perez and Owais Sarwar and Nikolaos V. Sahinidis and Ignacio E. Grossmann and John M. Wassick},
    title={OR-Gym: A Reinforcement Learning Library for Operations Research Problems},
    year={2020},
    Eprint={arXiv:2008.06319}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdarl-0.1.2.tar.gz (2.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdarl-0.1.2-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file pdarl-0.1.2.tar.gz.

File metadata

  • Download URL: pdarl-0.1.2.tar.gz
  • Upload date:
  • Size: 2.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdarl-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c417a9f367c4b310f81fb6261165eb87e0d4a7ae73259ad03a113384e796b12c
MD5 12d2ecf7409379280b7691babbb155a0
BLAKE2b-256 12fd1fa7e96ef50cd92456c1c29a38dc471cf8246caa880244864887fed0c0e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdarl-0.1.2.tar.gz:

Publisher: workflow.yml on JGIoA/pdarl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdarl-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pdarl-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdarl-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c1311fd9f7ea9ad67e96532fe16e2a979dbac02a2f015f647890fbb42537a479
MD5 7432f330f202609b103816cb17282875
BLAKE2b-256 f13bfe26fbb9c369472d7b6195611c08b99c6b2a4076135a93e5e2e23061f2df

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdarl-0.1.2-py3-none-any.whl:

Publisher: workflow.yml on JGIoA/pdarl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page