A lightweight implementation of Policy Dual Averaging (PDA) for reinforcement learning in PyTorch.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

JGIoA

These details have not been verified by PyPI

Project description

Policy Dual Averaging (PDA)

License

Policy Dual Averaging (PDA) is an on-policy reinforcement learning algorithm with theoretical guarantees and competitive empirical performance. This package is a lightweight PyTorch implementation of PDA and actor-accelerated PDA for problems with discrete and continuous action space. The environment wrappers and on-policy training loops are adapted from CleanRL and Tianshou. We use Gymnasium for environment simulation.

The algorithm is based on paper Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces, and Policy Optimization over General State and Action Spaces.

PDA

On-policy RL methods (e.g., TRPO, PPO) under the Policy Mirror Descent (PMD) framework regularize the new policy toward the previous iterate. PDA instead regularizes (through Bregman divergence $D(\cdot,\cdot)$ ) toward a fixed prox-center policy $\pi_0$ and accumulates all past advantages $\psi^{\pi_t}$ with weights $\beta_t$:

$$ \pi_{k+1}(s) = \arg\min_{a \in \mathcal{U}} \sum_{t=0}^{k} \beta_t , \psi^{\pi_t}(s, a) + \lambda_k , D\bigl(\pi_0(s), a\bigr). $$

Acceleration: PDA's policy is defined implicitly. Every time an action $\pi_{k+1}(s)$ is needed (at every environment step), we have to solve the optimization subproblem over the action space $\mathcal{U}$, which is theoretically clean but computationally challenging for deep RL at scale. Actor-accelerated PDA addresses this by training a neural network actor $\hat\pi_\theta$ to approximate the minimizer of the cumulative regularized objective: $\hat\pi_{k+1}(s;\theta^\pi) \approx \pi_{k+1}(s)$.

Benchmarks

Comparison with on-policy methods: Actor-accelerated PDA consistently outperforms on-policy baselines (PPO, TRPO, NPG) on MuJoCo-v4 and OR-Gymnasium. Hyperparameters are kept fixed across all environments. Benchmarks are conducted using our Tianshou (V1.2.0) implementation using Tianshou tuned hyperparameters. Benchmark results table records average episodic return over the final 5 epochs, 10 seeds × 10 evaluations per seed, 1M (or 3M) environment steps.

Continuous control benchmark (MuJoCo & Box2D): PDA vs PPO / TRPO / NPG

Environment	NPG	PPO	TRPO	PDA
HalfCheetah-v4	3556.4 ± 837.9	4067.5 ± 572.3	4496.8 ± 969.7	5174.6 ± 686.7
Ant-v4	2002.5 ± 300.6	2589.5 ± 445.8	2807.1 ± 645.5	3568.2 ± 449.7
Hopper-v4	1650.8 ± 726.4	2329.7 ± 572.9	2017.0 ± 849.0	2944.3 ± 787.4
Walker2d-v4	2923.2 ± 707.2	3277.0 ± 762.6	4128.8 ± 638.1	4367.1 ± 915.9
InvertedPendulum-v4	1000.0 ± 0.0	993.3 ± 20.2	380.3 ± 391.6	1000.0 ± 0.0
InvertedDoublePendulum-v4	7967.2 ± 1205.3	7926.8 ± 1241.8	2723.4 ± 2393.5	9167.5 ± 552.6
Reacher-v4	-5.7 ± 0.9	-18.7 ± 17.5	-6.1 ± 1.0	-4.0 ± 0.4
Swimmer-v4	25.1 ± 9.7	54.2 ± 14.2	35.0 ± 28.3	111.3 ± 29.7
Humanoid-v4 (1M)	745.1 ± 141.4	669.2 ± 107.4	719.5 ± 107.0	760.1 ± 139.9
Humanoid-v4 (3M)	4650.6 ± 686.8	933.3 ± 294.5	4745.1 ± 592.5	5020.0 ± 501.5
HumanoidStandup-v4	38364.7 ± 3926.9	135853.7 ± 21036.7	36737.7 ± 2413.4	161184.5 ± 3275.5
LunarLander-v3	34.1 ± 59.5	204.4 ± 44.7	-83.0 ± 53.3	204.7 ± 54.6
BipedalWalker-v3	212.3 ± 39.8	251.6 ± 44.8	106.3 ± 103.0	149.0 ± 117.4
NewsvendorEnv-v0 (3M)	-1.0e7 ± 2.3e7	-37225.9 ± 48015.8	-5.3e7 ± 10.0e7	21857.9 ± 11382.1
PortfolioOptEnv-v0	10095.3 ± 1522.9	10062.3 ± 1230.8	10195.6 ± 1496.7	10550.9 ± 1277.1
InvManagementBacklogEnv-v0	430.4 ± 31.0	496.4 ± 6.8	397.5 ± 25.4	491.6 ± 11.8
InvManagementLostSalesEnv-v0	431.5 ± 7.0	447.4 ± 16.8	432.3 ± 8.8	472.3 ± 10.3

Comparison with off-policy methods: PDA is an on-policy algorithm yet remains competitive with SAC and TD3 on many tasks, with 10–40× faster wall-clock time.

Continuous control benchmark (MuJoCo & Box2D): PDA vs SAC / TD3 / PPO

Environment	SAC	TD3	PPO	PDA
HalfCheetah-v4	11892.1 ± 789.8	10228.4 ± 636.2	4067.5 ± 572.3	5174.6 ± 686.7
Ant-v4	5764.8 ± 358.6	3914.1 ± 1330.9	2589.5 ± 445.8	3568.2 ± 449.7
Hopper-v4	3387.0 ± 205.7	2896.6 ± 801.0	2329.7 ± 572.9	2944.3 ± 787.4
Walker2d-v4	3864.3 ± 816.6	3435.5 ± 712.5	3277.0 ± 762.6	4367.1 ± 915.9
InvertedPendulum-v4	1000.0 ± 0.0	783.5 ± 401.9	993.3 ± 20.2	1000.0 ± 0.0
InvertedDoublePendulum-v4	9183.2 ± 524.0	8134.8 ± 2914.5	7926.8 ± 1241.8	9167.5 ± 552.6
Reacher-v4	-3.6 ± 0.4	-3.7 ± 0.5	-18.7 ± 17.5	-4.0 ± 0.4
Swimmer-v4	43.5 ± 1.5	93.3 ± 25.2	54.2 ± 14.2	111.3 ± 29.7
Humanoid-v4 (1M)	4785.3 ± 953.5	4846.9 ± 437.6	669.2 ± 107.4	760.1 ± 139.9
HumanoidStandup-v4	146249.0 ± 12118.1	107311.1 ± 18845.0	135853.7 ± 21036.7	161184.5 ± 3275.5
LunarLander-v3	284.6 ± 5.1	229.0 ± 98.1	204.4 ± 44.7	204.7 ± 54.6
BipedalWalker-v3	-9.1 ± 14.2	295.2 ± 23.9	251.6 ± 44.8	149.0 ± 117.4
NewsvendorEnv-v0	-19212.1 ± 4213.5	-19758.2 ± 3821.4	-1.9e5 ± 4.1e5	-1.5e5 ± 1.9e5
PortfolioOptEnv-v0	9010.2 ± 1255.1	—	10062.3 ± 1230.8	10550.9 ± 1277.1
InvManagementBacklogEnv-v0	281.8 ± 37.9	-1575.7 ± 665.3	496.4 ± 6.8	491.6 ± 11.8
InvManagementLostSalesEnv-v0	356.5 ± 13.0	-311.9 ± 469.4	447.4 ± 16.8	472.3 ± 10.3

Algorithm runtime on MuJoCo-v4: Average runtime (5 parallel runs) for 1M environment steps of training and testing. GPU is used for off-policy methods (SAC and TD3) for acceleration due to larger neural network sizes. Times are reported in seconds (mean ± standard deviation). Hardware: 14700K, RTX 4070.

Algorithm	MuJoCo-v4 (w/o humanoid)	MuJoCo-v4 (humanoid variants)
PDA	423.4 ± 104.8 (1.00x)	1480.7 ± 178.2 (1.00x)
PPO	932.1 ± 120.1 (2.20x)	1960.8 ± 167.2 (1.32x)
NPG	344.1 ± 108.4 (0.81x)	1011.3 ± 137.0 (0.68x)
TRPO	349.5 ± 112.6 (0.83x)	1009.6 ± 127.6 (0.68x)
SAC	20335.9 ± 1878.3 (48.03x)	19531.4 ± 2181.6 (13.19x)
TD3	11571.7 ± 510.6 (27.33x)	29113.7 ± 596.8 (19.66x)

Installation

Install with:

pip install pdarl
pip install "pdarl[mujoco]" # MuJoCo envs
pip install "pdarl[track]" # Weights & Biases
pip install "pdarl[all]" # MuJoCo + W&B

For other envs, see Gymnasium for installation details.

We included CleanRL style single-file RL script for PDA_ACT and PDA_DSC in folder cleanrl_pda. The files can be copied and ran directly when dependencies are installed.

Quickstart

Copy the config folder or create your own and train actor-accelerated PDA (default) with a config file:

pdarl-run --config config/pda_act.yaml

Use the config files for other PDA variants and PMD:

pdarl-run --config config/pda_opt_rgd.yaml
pdarl-run --config config/pda_opt_acfgm.yaml
pdarl-run --config config/pda_bcd.yaml
pdarl-run --config config/pda_dsc.yaml
pdarl-run --config config/pmd_act.yaml

Customization

To customize the agents and training process, customize agents like PDA_ACT and Trainer class directly. See pdarl/trainer.py for more details.

import numpy as np
 
from pdarl.utils.args import load_args_from_yaml
from pdarl.trainer import Trainer
from pdarl import PDA_ACT

args = load_args_from_yaml("config/pda_act.yaml")

class CustomPDA(PDA_ACT):
    # Override agent for custom behavior
    def __init__(self, obs_dim, act_dim, args, device):
        super().__init__(obs_dim, act_dim, args, device)
        print("Custom PDA initialized!")
    
    def compute_act(self, obs):
        return super().compute_act(obs)

class CustomTrainer(Trainer):
    # Override trainer setup to use CustomPDA
    def _setup_agent(self):
        obs_space = self.env.env.observation_space
        act_space = self.env.env.action_space
        obs_dim = int(np.array(obs_space.shape).prod())
        act_dim = int(np.prod(act_space.shape))
        self.agent = CustomPDA(obs_dim, act_dim, self.args, self.device).to(self.device)
        print("Custom Trainer agent changed to CustomPDA!")

trainer = CustomTrainer(args)
trainer.setup()
trainer.run()

Logging

When logging: true (default), TensorBoard scalars are written under runs/. View them with:

tensorboard --logdir runs

Set track: true in a config file to log to Weights & Biases (wandb_project_name, wandb_entity).

Agents

`agent_name`	Description
`PDA_ACT`	PDA with actor acceleration: a learned policy network approximates the action subproblem (default).
`PDA_DSC`	PDA for native discrete action spaces.
`PDA_OPT`	Direct subproblem optimization per action via randomized gradient descent (RGD) or Auto-Conditioned Fast Gradient Method (AC-FGM).
`PDA_BCD`	Direct subproblem optimization per action via Block-Coordinate Descent (BCD) on a discretized action grid.
`PMD_ACT`	Policy Mirror Descent (PMD) with actor acceleration.

Key configs

See pdarl/utils/args.py and the files under config for the full schema.

Parameter	Role
`exp_name`	Experiment name used for logging and tracking
`agent_name`	Agent variant (see table above)
`seed`	Random seed for reproducibility
`torch_deterministic`	Whether to use deterministic PyTorch operations
`device`	Compute device for training (e.g., `cpu`, `cuda`, or `mps`)
`logging`	Enable or disable TensorBoard logging
`env_id`	Gymnasium environment id
`env_kargs`	Additional keyword arguments for the environment
`total_timesteps`	Training horizon in environment steps
`learning_rate`	Base learning rate for neural network optimizers
`num_envs`, `num_steps`	Number of parallel environments and steps per rollout
`num_test_envs`	Number of parallel environments used for evaluation
`test_interval`	How often (in epochs) to run evaluation tests
`anneal_lr`	Whether to linearly anneal the learning rate during training
`gamma`, `gae_lambda`	Discount factor and Generalized Advantage Estimation lambda
`num_minibatches`, `update_epochs`	Minibatch splits and number of optimizer passes per rollout
`hidden_sizes`, `activation`	MLP architecture for value / sum-advantage / actor
`max_grad_norm`	Maximum gradient norm for clipping
`obs_norm`, `ret_norm`, `adv_norm`	Observation, return, and advantage normalization
`recompute_ret`	Recompute returns based on value function during updates
`step_size`	PDA regularization coefficient
`act_noise`	Exploration noise scale on actions
`action_opt`	For `PDA_OPT`: `RGD` or `ACFGM`
`action_opt_itr`	Iterations for `PDA_OPT` / `PDA_BCD` inner solves
`action_opt_params`	Parameters for the inner action solver
`discretization`	Grid resolution for `PDA_BCD`

To cite this repository

@misc{gao2026actorpda,
  title={Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces},
  author={Gao, Ji and Ju, Caleb and Lan, Guanghui and Tong, Zhaohui},
  year={2026},
  eprint={2603.10199},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.10199}
}

@misc{ju2022policyoptimizationgeneralstate,
  title={Policy Optimization over General State and Action Spaces},
  author={Ju, Caleb and Lan, Guanghui},
  year={2022},
  eprint={2211.16715},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2211.16715}
}

References

@article{huang2022cleanrl,
  author  = {Shengyi Huang and Rousslan Fernand Julien Dossa and Chang Ye and Jeff Braga and Dipam Chakraborty and Kinal Mehta and João G.M. Araújo},
  title   = {CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {274},
  pages   = {1--18},
  url     = {http://jmlr.org/papers/v23/21-1342.html}
}

@article{weng2022tianshou,
  title   = {Tianshou: A Highly Modularized Deep Reinforcement Learning Library},
  author  = {Weng, Jiayu and Chen, Hang and Yan, Dong and You, Kaiwen and Zhang, Chen and Gong, Yunchang and Zhu, Rongqing and Li, Sijin and Zhou, Yiwen and Liu, Qian and Song, Yueyang},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {267},
  pages   = {1--6},
  url     = {http://jmlr.org/papers/v23/21-1127.html}
}

@misc{towers2024gymnasium,
  author = {Towers, Mark and Kwiatkowski, Ariel and Terry, Jordan and Balis, John U and De Cola, Gianluca and Deleu, Tristan and Goul{\~a}o, Manuel and Kallinteris, Andreas and Krimmel, Markus and KG, Arjun and others},
  title  = {Gymnasium: A Standard Interface for Reinforcement Learning Environments},
  year   = {2024},
  eprint = {arXiv:2407.17032}
}

@misc{HubbsOR-Gym,
    author={Christian D. Hubbs and Hector D. Perez and Owais Sarwar and Nikolaos V. Sahinidis and Ignacio E. Grossmann and John M. Wassick},
    title={OR-Gym: A Reinforcement Learning Library for Operations Research Problems},
    year={2020},
    Eprint={arXiv:2008.06319}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

JGIoA

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

May 24, 2026

0.1.0

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdarl-0.1.2.tar.gz (2.9 MB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdarl-0.1.2-py3-none-any.whl (32.8 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file pdarl-0.1.2.tar.gz.

File metadata

Download URL: pdarl-0.1.2.tar.gz
Upload date: May 24, 2026
Size: 2.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdarl-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`c417a9f367c4b310f81fb6261165eb87e0d4a7ae73259ad03a113384e796b12c`
MD5	`12d2ecf7409379280b7691babbb155a0`
BLAKE2b-256	`12fd1fa7e96ef50cd92456c1c29a38dc471cf8246caa880244864887fed0c0e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdarl-0.1.2.tar.gz:

Publisher: workflow.yml on JGIoA/pdarl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdarl-0.1.2.tar.gz
- Subject digest: c417a9f367c4b310f81fb6261165eb87e0d4a7ae73259ad03a113384e796b12c
- Sigstore transparency entry: 1624557301
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: JGIoA/pdarl@2fa321e0175fd6d5f4e21aeab39fb3e061010d1b
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/JGIoA
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@2fa321e0175fd6d5f4e21aeab39fb3e061010d1b
- Trigger Event: push

File details

Details for the file pdarl-0.1.2-py3-none-any.whl.

File metadata

Download URL: pdarl-0.1.2-py3-none-any.whl
Upload date: May 24, 2026
Size: 32.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdarl-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1311fd9f7ea9ad67e96532fe16e2a979dbac02a2f015f647890fbb42537a479`
MD5	`7432f330f202609b103816cb17282875`
BLAKE2b-256	`f13bfe26fbb9c369472d7b6195611c08b99c6b2a4076135a93e5e2e23061f2df`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdarl-0.1.2-py3-none-any.whl:

Publisher: workflow.yml on JGIoA/pdarl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdarl-0.1.2-py3-none-any.whl
- Subject digest: c1311fd9f7ea9ad67e96532fe16e2a979dbac02a2f015f647890fbb42537a479
- Sigstore transparency entry: 1624557338
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: JGIoA/pdarl@2fa321e0175fd6d5f4e21aeab39fb3e061010d1b
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/JGIoA
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@2fa321e0175fd6d5f4e21aeab39fb3e061010d1b
- Trigger Event: push

pdarl 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Policy Dual Averaging (PDA)

PDA

Benchmarks

Installation

Quickstart

Customization

Logging

Agents

Key configs

To cite this repository

References

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance