Multi-Agent RLRM Framework

These details have not been verified by PyPI

Project links

Homepage

Project description

Multi-Agent RLRM

Introduction

The Multi-Agent RLRM (Reinforcement Learning with Reward Machines) Framework is a library designed to facilitate the formulation of multi-agent problems and solve them through reinforcement learning. The framework supports the integration of Reward Machines (RMs), providing a modular and flexible structure for defining complex tasks through a set of objectives and rules.

Installation

Option A — PyPI (recommended)

pip install multiagent-rl-rm

import path: multiagent_rlrm (underscore).

Option B — From source (for development)

To install the framework, follow these steps:

git clone https://github.com/Alee08/multi-agent-rl-rm.git
cd multi-agent-rl-rm
pip install -r requirements.txt
pip install -e .

Installation with docker

Build the container image from the repository root:

docker build -f docker/Dockerfile -t multiagent-rlrm .
docker run --rm -it multiagent-rlrm python

More details (compose, examples, troubleshooting) are available in docker/README.md.

Usage

Below is a compact end-to-end example for two agents in the Frozen Lake environment, each with its own Reward Machine (RM) and tabular Q-learning.

Step 1: Environment Setup

First, import the necessary modules and initialize the MultiAgentFrozenLake environment with desired parameters such as grid size and hole locations. Here, holes is the list of obstacle coordinates that the agents must avoid. This setup provides a simple yet challenging environment for agents to learn navigation strategies.

from multiagent_rlrm.environments.frozen_lake.ma_frozen_lake import MultiAgentFrozenLake
from multiagent_rlrm.environments.frozen_lake.action_encoder_frozen_lake import ActionEncoderFrozenLake

W, H = 10, 10
holes = [(2,3), (2,4), (7,0), (7,1), (7,2), (7,3), (7,4), (7,8)]
env = MultiAgentFrozenLake(width=W, height=H, holes=holes)
env.frozen_lake_stochastic = True      # slip/stochastic dynamics
env.penalty_amount = 0      # penalty when falling into a hole
env.delay_action = False    # optional "wait" bias if True

Step 2: Define Agents and Action/State Encoders

Create agent instances, set their initial positions, and attach domain-specific encoders for state and actions. In Frozen Lake, the StateEncoderFrozenLake maps grid positions (and RM state) to tabular indices, while ActionEncoderFrozenLake registers the discrete actions (up, down, left, right) for each agent. Finally, register the agents with the environment so reset/step include them.

from multiagent_rlrm.multi_agent.agent_rl import AgentRL
from multiagent_rlrm.multi_agent.action_rl import ActionRL
from multiagent_rlrm.environments.frozen_lake.state_encoder_frozen_lake import StateEncoderFrozenLake

a1, a2 = AgentRL("a1", env), AgentRL("a2", env)
a1.set_initial_position(4, 0)
a2.set_initial_position(6, 2)

for ag in (a1, a2):
    ag.add_state_encoder(StateEncoderFrozenLake(ag))
    ag.add_action_encoder(ActionEncoderFrozenLake(ag))

env.add_agent(a1)
env.add_agent(a2)

Step 3: Define Reward Machines (one per agent)

You define the task as a small automaton (the Reward Machine). The PositionEventDetector turns grid visits into events; here, reaching (4,4) triggers a transition from q0→q1 (+0), then reaching (0,0) triggers q1→qf (+1, final). Each agent gets its own RM (rm1, rm2), so progress and rewards are tracked independently even in the same environment. This cleanly separates what should be achieved (waypoints/sequence) from how the agent moves in a stochastic world, and you can extend it by adding more waypoints, branches, or different detectors.

from multiagent_rlrm.multi_agent.reward_machine import RewardMachine
from multiagent_rlrm.environments.frozen_lake.detect_event import PositionEventDetector
# Define Reward Machine transitions
# visit cells in sequence to progress and collect rewards
e1, e2 = (4,4), (0,0)

# {(current_state, event): (next_state, reward)}
transitions = {
    ("q0", e1): ("q1", 0),
    ("q1", e2): ("qf", 1),  # final RM state
}
detector = PositionEventDetector({e1, e2})

rm1 = RewardMachine(transitions, detector)
rm2 = RewardMachine(transitions, detector)
a1.set_reward_machine(rm1)
a2.set_reward_machine(rm2)

Step4: Wrap env with RM and set learners

Wrap the base environment with RMEnvironmentWrapper so RM logic is applied automatically at every step: it detects events, updates each agent’s RM state, and merges env reward + RM reward (and termination). The learner’s state size must include RM states (W*H*rm.numbers_state()), because policies depend on both position and RM progress. Assign a separate Q-learning instance per agent. Optional knobs: use_qrm=True for counterfactual RM updates and use_rsh=True for potential-based shaping.

from multiagent_rlrm.multi_agent.wrappers.rm_environment_wrapper import RMEnvironmentWrapper
from multiagent_rlrm.learning_algorithms.qlearning import QLearning

rm_env = RMEnvironmentWrapper(env, [a1, a2])

def make_ql(rm):  # state size includes RM states
    return QLearning(
        state_space_size=W * H * rm.numbers_state(),
        action_space_size=4,
        learning_rate=0.2,
        gamma=0.99,
        action_selection="greedy",
        epsilon_start=0.01, epsilon_end=0.01, epsilon_decay=0.9995,
        use_qrm=True, use_rsh=False  # optional: counterfactuals & RM shaping
    )

a1.set_learning_algorithm(make_ql(rm1))
a2.set_learning_algorithm(make_ql(rm2))

Step5: Training Loop

Standard episodic loop. On each episode, reset initializes env + each agent’s RM state. Every step: each agent picks an action from the raw env state; the wrapped env executes them, detects events, and returns env+RM rewards plus per-agent termination flags. Then each agent calls update_policy(...) to learn from (s, a, r, s') (the learner/encoder handle RM progress internally). The loop stops when all agents are done (hole/time-limit or final RM state).

import copy

EPISODES = 1000
for ep in range(EPISODES):
    obs, infos = rm_env.reset(seed=123 + ep)
    done = {ag.name: False for ag in rm_env.agents}

    while not all(done.values()):
        actions = {}
        for ag in rm_env.agents:
            s = rm_env.env.get_state(ag)          # raw env state for the agent
            actions[ag.name] = ag.select_action(s)

        next_obs, rewards, terms, truncs, infos = rm_env.step(actions)

        for ag in rm_env.agents:
            terminated = terms[ag.name] or truncs[ag.name]
            ag.update_policy(
                state=obs[ag.name],
                action=actions[ag.name],
                reward=rewards[ag.name],           # env + RM reward
                next_state=next_obs[ag.name],
                terminated=terminated,
                infos=infos[ag.name],              # includes RM fields
            )
            done[ag.name] = terminated

        obs = copy.deepcopy(next_obs)

In this loop, agents continuously assess their environment, make decisions, and act accordingly. The env.step(actions) method encapsulates the agents' interactions with the environment, including executing actions, receiving new observations, calculating rewards, and updating the agents' policies based on the results. This streamlined process simplifies the learning loop and focuses on the essential elements of agent-environment interaction.

Implemented learning algorithms

All algorithms live in multiagent_rlrm/learning_algorithms and expose a common interface via choose_action(...) and update(...).

Algorithm	Type	Short description
`QLearning`	Model-free, tabular	Standard tabular Q-learning with ε-greedy or softmax exploration; supports Reward Machines via QRM-style counterfactual updates and optional potential-based reward shaping.
`QLambda`	Model-free, tabular, eligibility traces	Q-learning with eligibility traces (λ): propagates TD errors backwards along recent state–action pairs, enabling faster credit assignment over multi-step trajectories and often speeding up learning in sparse-reward settings.
`QRM`	Model-free, RM-aware	Q-learning over Reward Machines: augments the state with the RM automaton state and uses counterfactual updates across compatible automaton states to reuse experience under non-Markovian rewards.
`RMax`	Model-based, optimistic	Classic R-Max algorithm: learns an explicit tabular transition/reward model, treats unknown state–action pairs as maximally rewarding, and plans via value iteration to drive directed exploration.
`RMaxRM`	Model-based, RM-aware	R-Max on the product space S×Q): uses the Reward Machine to augment the MDP state but does not factorise environment and automaton dynamics; serves as a RM-aware model-based baseline.
`QRMax`	Model-based, factored, RM-aware	R-Max-style model-based RL for non-Markovian rewards via Reward Machines; factorises environment dynamics and RM dynamics, reuses each learned environment transition across RM states, and preserves PAC-style sample-efficiency guarantees. The algorithm only requires the current RM state and reward signal, not the full RM description.
`QRMaxRM`	Model-based, RM-aware (extra RM experience)	Extension of `QRMax` that also leverages additional experience generated from the known Reward Machine, applying the same factorised updates to both real and counterfactual transitions to further improve sample efficiency.
`PSRL`	Model-based, posterior sampling	Posterior Sampling for RL (Thompson sampling over MDPs): maintains Bayesian posteriors over transitions and rewards, samples an MDP each episode, and follows its optimal policy.
`OPSRL`	Model-based, optimistic posterior sampling	Optimistic PSRL variant with Dirichlet/Beta priors and optimistic treatment of under-explored transitions, encouraging exploration by biasing sampled models toward rewarding but uncertain dynamics.
`UCBVI`	Model-based, UCB-style (base class)	Base implementation of tabular UCB Value Iteration for finite-horizon MDPs: empirical models plus step-wise exploration bonuses and backward value iteration. Concrete variants differ only in the bonus definition.
`UCBVI-sB`	Model-based, UCBVI (simplified Bernstein)	UCBVI with simplified Bernstein bonuses as in Azar et al. (2017), trading off tightness of confidence intervals and implementation simplicity.
`UCBVI-B`	Model-based, UCBVI (Bernstein)	UCBVI variant using full Bernstein-style bonuses, yielding tighter confidence bounds and typically stronger theoretical guarantees.
`UCBVI-H`	Model-based, UCBVI (Hoeffding)	UCBVI variant with Hoeffding-style bonuses, using simpler but more conservative confidence intervals.

License

Multi-Agent RLRM is released under the Apache 2.0 License.
See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.0

Dec 16, 2025

This version

0.1.2

Dec 6, 2025

0.1.1

Oct 27, 2025

0.1

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiagent_rl_rm-0.1.2.tar.gz (15.0 MB view details)

Uploaded Dec 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multiagent_rl_rm-0.1.2-py3-none-any.whl (15.1 MB view details)

Uploaded Dec 6, 2025 Python 3

File details

Details for the file multiagent_rl_rm-0.1.2.tar.gz.

File metadata

Download URL: multiagent_rl_rm-0.1.2.tar.gz
Upload date: Dec 6, 2025
Size: 15.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for multiagent_rl_rm-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3eb1a5dedc26611fb37a4313dc70064aefd50e64c1ab0b056ed43a8a8a4ff94a`
MD5	`9ce18bfad26f42d233f478e107c03de6`
BLAKE2b-256	`69b889d38f8df2fc9648090549f77095e8f95beadb0e632de50f4c95d32d605d`

See more details on using hashes here.

File details

Details for the file multiagent_rl_rm-0.1.2-py3-none-any.whl.

File metadata

Download URL: multiagent_rl_rm-0.1.2-py3-none-any.whl
Upload date: Dec 6, 2025
Size: 15.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for multiagent_rl_rm-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dde827259bb4c97564e0378a3cd0736fc47e3ff856b1ab3c72f6fccef8a65714`
MD5	`6ef8c013e88e2173956336f9c653d89d`
BLAKE2b-256	`2c22cb3cd3b61d05b34c75564051295d94c3002d435a23ff539d08dbc475d524`

See more details on using hashes here.

multiagent-rl-rm 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Multi-Agent RLRM

Introduction

Installation

Option A — PyPI (recommended)

Option B — From source (for development)

Installation with docker

Usage

Step 1: Environment Setup

Step 2: Define Agents and Action/State Encoders

Step 3: Define Reward Machines (one per agent)

Step4: Wrap env with RM and set learners

Step5: Training Loop

Implemented learning algorithms

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes