A PyTorch implementation of Pensieve using Proximal Policy Optimization (PPO) for neural adaptive video streaming

These details have not been verified by PyPI

Project links

Project description

Pensieve PPO

A user-friendly PyTorch implementation of Pensieve [1], a neural adaptive video streaming system. This implementation uses Proximal Policy Optimization (PPO) instead of the original A3C algorithm, achieving improved training stability and comparable performance.

Features

Modern PyTorch Implementation: Clean, modular codebase using PyTorch 2.0+
Gymnasium Environment: Standard RL environment interface for easy integration
Parallel Training: Multi-worker distributed training support
TensorBoard Integration: Real-time training visualization
Extensible Architecture: Easy to add new agents and environments

Installation

From Source

git clone https://github.com/yindaheng98/Pensieve-PPO.git
cd Pensieve-PPO
pip install -e . # or
pip install --target . --upgrade . --no-deps

Dependencies

pip install torch numpy gymnasium tensorboard tqdm transformers peft

Quick Start

Using the Gymnasium Environment

from pensieve_ppo.gym import ABREnv
from pensieve_ppo.defaults import create_env_with_default

# Create environment with default Pensieve parameters
env = create_env_with_default(train=True)

# Standard Gymnasium API
obs, info = env.reset()
while True:
    action = env.action_space.sample()  # or use your agent
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break

Training a PPO Agent

# Train with default settings
python -m pensieve_ppo.train

# Train with custom parameters
python -m pensieve_ppo.train \
    --parallel-workers 16 \
    --train-epochs 500000 \
    --model-save-interval 300 \
    --output-dir ./ppo

Testing a Trained Model

# Test with a trained model
python -m pensieve_ppo.test --model-path ./ppo/nn_model_ep_300.pth

# Test with custom trace folder
python -m pensieve_ppo.test \
    --model-path ./ppo/nn_model_ep_300.pth \
    --test-trace-folder ./src/test/

Data path note: The default training traces, test traces, and video chunk size files are currently read from the legacy src/ tree: ./src/train/, ./src/test/, and ./src/envivio/video_size_. Keep those folders present when using the default configuration, or pass explicit --train-trace-folder, --test-trace-folder, and video_size_file_prefix values.

Package Structure

pensieve_ppo/
├── core/                  # Core simulation components
│   ├── trace/             # Network trace handling (bandwidth, latency)
│   ├── video/             # Video processing (chunk sizes, bitrates, playback)
│   └── simulator/         # Combines trace & video into ABR simulator
├── gym/                   # Gymnasium environment wrapper
│   ├── env.py             # ABREnv - standard RL environment interface
│   └── imitate.py         # Imitation observer wrapper
├── agent/                 # Agent implementations and training loops
│   ├── abc.py             # Abstract base classes
│   ├── registry.py        # Agent factory and registration
│   ├── trainer.py         # Distributed training framework
│   ├── bba/               # Buffer-based baseline
│   ├── mpc/               # MPC and oracle MPC baselines
│   ├── netllm/            # NetLLM-style Decision Transformer agents
│   └── rl/                # PPO, A3C, and DQN agents
├── exp_pool/              # Experience pool data and offline trainer
├── defaults.py            # Default parameters and factory functions
├── train.py               # Training script
└── test.py                # Testing script

Architecture

Agent Class Hierarchy

The agent system follows a hierarchical inheritance structure:

AbstractAgent
    ├── select_action(state) -> (action, action_prob)
    │
    └── AbstractTrainableAgent
        ├── select_action_for_training(state) -> (action, action_prob)
        ├── produce_training_batch(trajectory, done) -> TrainingBatch
        ├── train_batch(training_batches, epoch) -> metrics
        ├── get_params() -> params
        ├── set_params(params) -> None
        ├── save(path) -> None
        └── load(path) -> None
            │
            └── AbstractRLAgent
                ├── train(s_batch, a_batch, p_batch, v_batch, epoch) -> metrics
                ├── compute_v(s_batch, a_batch, r_batch, terminal) -> v_batch
                ├── produce_training_batch(trajectory, done) -> RLTrainingBatch  # Implemented
                └── train_batch(training_batches, epoch) -> metrics  # Implemented

AbstractAgent (pensieve_ppo.agent.abc.AbstractAgent):

Base class for all agents
Defines the minimal interface: select_action(state) method
Used for inference-only agents that don't require training

AbstractTrainableAgent (pensieve_ppo.agent.trainable.AbstractTrainableAgent):

Extends AbstractAgent with training infrastructure
Adds methods for:
- Training-time action selection (select_action_for_training)
- Converting trajectories to training batches (produce_training_batch)
- Training on batches (train_batch)
- Model persistence (save, load, get_params, set_params)
Abstract methods must be implemented by subclasses

AbstractRLAgent (pensieve_ppo.agent.rl.abc.AbstractRLAgent):

Extends AbstractTrainableAgent with RL-specific functionality
Implements produce_training_batch and train_batch using RL methods
Requires subclasses to implement:
- train(): Core training logic (e.g., PPO, A3C, DQN)
- compute_v(): Value target computation (returns/advantages)
Concrete implementations: PPOAgent, A3CAgent, DQNAgent, etc.

Note on RL Agent Implementations: The agents in pensieve_ppo/agent/rl/ (PPO, A3C, DQN) are reinforcement learning algorithms, not imitation learning algorithms. While the framework technically allows running them with pensieve_ppo.imitate, this is not recommended as these algorithms are designed to learn from rewards through environment interaction, not from teacher demonstrations.

Note on A3C Implementation: The A3CAgent implementation is based on the A3C (Asynchronous Advantage Actor-Critic) algorithm, but the actual Trainer performs synchronous updates rather than asynchronous updates. This means all workers synchronize before each parameter update, which differs from the original A3C paper's asynchronous design.

Agent Statefulness

Design Principle: Logically, an Agent should be stateless. All historical information needed by the Agent should be collected by the Observer and passed through the State object. However, in some special cases, an Agent may need to maintain its own "internal state".

Stateless Agents: The agents in pensieve_ppo/agent/rl/ (PPO, A3C, DQN), pensieve_ppo/agent/mpc/, and pensieve_ppo/agent/bba/ are all stateless - they do not maintain any "internal state" between select_action calls. Each action is computed purely from the current input state.

"Stateful" Agents: In certain cases, agents need to maintain "internal state". For example, in NetLLM (pensieve_ppo/agent/netllm/), the large language model needs to cache embeddings of historical states to avoid redundant computation. Since the embedding model is a trainable part of the policy, it cannot be moved into the Observer. In such cases, the Agent must maintain its own "internal state" (again, not the actual environment state), i.e., some special internal data structures that accelerate computation.

Reference: The NetLLM implementation follows the architecture from NetLLM's OfflineRLPolicy, which uses deques (states_dq, returns_dq, actions_dq) to cache embeddings for autoregressive inference.

Technical Details: The "internal state" management in NetLLM is essentially maintaining an embedding cache rather than managing the actual environment state; it just reuses a "state-style" management approach to maintain internal acceleration data structures. Theoretically, this should support out-of-order select_action calls by querying pre-computed embeddings based on the input state. However, since the current codebase does not have out-of-order select_action calls, the implementation assumes sequential calls only.

Tradeoffs of using "state management" to handle embedding caches:

Pros: Eliminates embedding cache lookup steps; allows precise control of cache size since we know exactly which embeddings are needed; better performance optimization.
Cons: If the same state appears at distant timesteps, the embedding must be recomputed rather than retrieved from cache.

Reset Method: The AbstractAgent.reset() method should be called at the beginning of each episode to clear any "internal state" (e.g., embedding caches). For stateless agents, this is a no-op. For "stateful" agents like NetLLM, this clears the embedding caches via clear_dq().

Agent, Observer, and Trainer Relationships

AbstractABRStateObserver (pensieve_ppo.gym.env.AbstractABRStateObserver):

Abstract interface for state observation and reward calculation
Decouples state representation from environment dynamics
Methods:
- reset(env, initial_bit_rate) -> (state, info): Initialize state
- observe(env, bit_rate, result) -> (state, reward, info): Update state and compute reward
Implementations:
- RLABRStateObserver: For RL agents (returns np.ndarray states)
- MPCABRStateObserver: For MPC-based agents (returns MPCState dataclass)
- NetLLMABRStateObserver: For NetLLM agents (returns NetLLMState)

ABREnv (pensieve_ppo.gym.env.ABREnv):

Gymnasium-compatible environment wrapper
Uses an AbstractABRStateObserver instance to:
- Observe states from simulator results
- Compute rewards based on actions and results
The observer is injected via constructor, allowing different state representations for different agent types

Trainer (pensieve_ppo.agent.trainer.Trainer):

Coordinates distributed training with multiple parallel workers
Architecture:
- Central Agent: Aggregates experiences from workers, updates model, distributes parameters
- Worker Agents: Collect experiences by interacting with environments
Uses AbstractTrainableAgent interface:
- Workers call select_action_for_training() for exploration
- Workers call produce_training_batch() to convert trajectories
- Central agent calls train_batch() to update the model
- Parameters synchronized via get_params() and set_params()

Relationship Flow:

Trainer
  ├── Creates multiple (env, agent) pairs via factories
  ├── Workers: env.step(action) -> observer.observe() -> state, reward
  ├── Workers: agent.select_action_for_training(state) -> action
  ├── Workers: agent.produce_training_batch(trajectory) -> TrainingBatch
  └── Central: agent.train_batch(batches) -> updates model

State, Step, and TrainingBatch

State (pensieve_ppo.gym.env.State):

Type alias: State = Any
Represents environment observations
Concrete type depends on the observer:
- RLState (np.ndarray) for RL agents: shape (S_INFO, S_LEN) = (6, 8)
- MPCState (dataclass) for MPC agents
- NetLLMState (dataclass) for NetLLM agents
Used in:
- AbstractAgent.select_action(state)
- AbstractTrainableAgent.select_action_for_training(state)
- Step.state field

Step (pensieve_ppo.agent.trainable.Step):

Dataclass representing a single environment step
Fields:
- state: State: Observation at this step
- action: List[int]: One-hot encoded action (e.g., [0, 0, 1, 0, 0, 0])
- action_prob: List[float]: Action probability distribution from agent
- reward: float: Reward received
- step: int: Step index within trajectory
- done: bool: Whether episode terminated/truncated
Usage:
- Collected during environment rollout in Trainer._agent_worker()
- Stored in trajectory: List[Step]
- Converted to TrainingBatch via produce_training_batch()

TrainingBatch (pensieve_ppo.agent.trainable.TrainingBatch):

Abstract base class for training data containers
Subclasses define algorithm-specific fields:
- RLTrainingBatch (pensieve_ppo.agent.rl.abc.RLTrainingBatch):
  - s_batch: List[RLState]: States
  - a_batch: List[List[int]]: One-hot actions
  - p_batch: List[List[float]]: Action probabilities
  - v_batch: List[float]: Computed value targets (returns)
- NetLLMTrainingBatch (pensieve_ppo.agent.netllm.abc.NetLLMTrainingBatch):
  - states: List[torch.Tensor]: State tensors
  - actions: List[int]: Action indices
  - returns: List[float]: Return-to-go values
  - timesteps: List[int]: Timestep indices
  - labels: List[int]: Target labels
Usage:
- Created by produce_training_batch(trajectory) from List[Step]
- Multiple batches aggregated in train_batch(List[TrainingBatch])
- Converted to numpy arrays/tensors for actual training

Data Flow:

Environment Step
  → Step(state, action, action_prob, reward, step, done)
  → Collected in trajectory: List[Step]
  → produce_training_batch(trajectory)
  → TrainingBatch (e.g., RLTrainingBatch)
  → train_batch([TrainingBatch, ...])
  → Model update

Imitation Learning

ImitationObserver (pensieve_ppo.gym.imitate.ImitationObserver):

Combines two observers (student and teacher) in the same environment
Both observers observe the same environment state and teacher's actions
Returns ImitationState containing:
- student_state: State for training the student agent (neural network)
- teacher_state: State for teacher agent's decision-making

How It Works:

Two Observers in Same Environment: The ImitationObserver wraps both a student_observer and a teacher_observer, both observing the same ABREnv instance
Teacher Makes Decisions: The teacher agent uses teacher_state to select actions
Teacher Actions Are Executed: The teacher's selected actions are executed in the environment
Student Learns from Teacher: The student agent receives student_state and learns to imitate the teacher's decisions through behavioral cloning

ImitationState (pensieve_ppo.gym.imitate.ImitationState):

Dataclass containing both student_state and teacher_state
Both states are generated from the same environment step and teacher action
student_state is used for training (e.g., RL policy updates)
teacher_state is used by the teacher agent for action selection

ImitationTrainer (pensieve_ppo.agent.imitate.ImitationTrainer):

Extends Trainer for distributed imitation learning
Architecture:
- Central Agent (Student): Neural network that learns to imitate teacher decisions
- Worker Agents (Teacher): Expert agents (e.g., BBA, MPC, LLM-based) that generate trajectories
Workflow:
- Workers use teacher agent with teacher_state to select actions
- Teacher actions are executed in the environment
- Both observers update their states from the same environment result
- Student receives student_state and teacher's action for training
- No parameter synchronization between student and teacher (different agent types)

Example Usage:

from pensieve_ppo.gym.imitate import ImitationObserver
from pensieve_ppo.agent.rl.observer import RLABRStateObserver
from pensieve_ppo.agent.bba.observer import BBAStateObserver

# Create observers
student_observer = RLABRStateObserver(levels_quality=VIDEO_BIT_RATE)
teacher_observer = BBAStateObserver(levels_quality=VIDEO_BIT_RATE)

# Combine for imitation learning
imitation_observer = ImitationObserver(
    student_observer=student_observer,
    teacher_observer=teacher_observer,
)

# Use in environment
env = ABREnv(simulator=simulator, observer=imitation_observer)
state, info = env.reset()
# state.student_state: RLState for training RL agent
# state.teacher_state: BBAState for BBA agent's decision

Training with Imitation Learning:

# Train student agent (PPO) to imitate teacher agent (BBA)
python -m pensieve_ppo.imitate \
    --agent-name ppo \
    --teacher-agent-name bba \
    --parallel-workers 16 \
    --train-epochs 500000

Warning: The RL agents (ppo, a3c, dqn in pensieve_ppo/agent/rl/) are reinforcement learning algorithms designed to learn from reward signals, not from teacher demonstrations. Although the framework allows running them with pensieve_ppo.imitate, this is not recommended for production use. For proper imitation learning, consider using agents specifically designed for behavioral cloning or other imitation learning methods (e.g., netllm).

API Reference

Creating Environment and Agent

from pensieve_ppo.defaults import (
    create_env_with_default,
    create_env_agent_with_default,
    create_env_agent_factory_with_default,
)

# Create just the environment
env = create_env_with_default(
    levels_quality=[300., 750., 1200., 1850., 2850., 4300.],  # Kbps
    trace_folder='./src/train/',
    train=True,
)

# Create compatible env and agent pair
env, agent = create_env_agent_with_default(
    name='ppo',
    model_path='./ppo/nn_model_ep_300.pth',  # Optional: load pretrained weights
    device='cuda',
)

# Create factories for distributed training
env_factory, agent_factory = create_env_agent_factory_with_default(
    name='ppo',
    train=True,
)

Using the Agent

from pensieve_ppo.agent import create_agent

# Create agent directly
agent = create_agent(
    name='ppo',
    state_dim=(6, 8),  # (S_INFO, S_LEN)
    action_dim=6,       # Number of bitrate levels
    device='cuda',
    learning_rate=1e-4,
    gamma=0.99,
)

# Predict action
state = env.reset()[0]
action, action_prob = agent.select_action(state)

# Train on batch
metrics = agent.train(s_batch, a_batch, p_batch, v_batch, epoch)

Registering Custom Agents

The register function allows you to register custom agent implementations with the Pensieve-PPO framework. Once registered, your custom agent can be used with all factory functions (create_agent, create_env, etc.) and command-line tools.

Function Signature:

from pensieve_ppo.agent import register
from pensieve_ppo.agent.abc import AbstractAgent
from pensieve_ppo.gym.env import AbstractABRStateObserver

register(
    name: str,
    agent_cls: Type[AbstractAgent],
    observer_cls: Type[AbstractABRStateObserver],
    trainable_agent_cls: Optional[Type[AbstractTrainableAgent]] = None,
) -> None

Parameters:

name: Name to register the agent under (case-sensitive). This name will be used in create_agent(), create_env(), and command-line arguments.
agent_cls: The agent class to register. Must be a subclass of AbstractAgent.
observer_cls: The observer class associated with this agent. Must be a subclass of AbstractABRStateObserver. The observer handles state observation and reward calculation for the agent.
trainable_agent_cls: Optional trainable agent class. If not provided, will be automatically set to agent_cls if it's a subclass of AbstractTrainableAgent.

Example: Registering a Custom Agent:

from pensieve_ppo.agent import register
from pensieve_ppo.agent.abc import AbstractAgent
from pensieve_ppo.gym.env import AbstractABRStateObserver

# Define your custom agent
class MyCustomAgent(AbstractAgent):
    def select_action(self, state):
        # Your implementation
        pass

# Define your custom observer
class MyCustomObserver(AbstractABRStateObserver):
    def reset(self, env, initial_bit_rate):
        # Your implementation
        pass
    
    def observe(self, env, bit_rate, result):
        # Your implementation
        pass

# Register the agent
register("my-custom-agent", MyCustomAgent, MyCustomObserver)

# Now you can use it with factory functions
from pensieve_ppo.agent import create_agent, create_env

agent = create_agent(name="my-custom-agent", ...)
env = create_env(name="my-custom-agent", ...)

Example: Registering a Trainable Agent:

from pensieve_ppo.agent import register
from pensieve_ppo.agent.trainable import AbstractTrainableAgent

class MyTrainableAgent(AbstractTrainableAgent):
    # Implement all required methods
    pass

# Register with trainable agent class
register("my-trainable", MyTrainableAgent, MyCustomObserver)

# Can be used for training
from pensieve_ppo.defaults import create_env_agent_with_default
env, agent = create_env_agent_with_default(name="my-trainable")

Checking Available Agents:

from pensieve_ppo.agent import get_available_agents, get_available_trainable_agents

# Get all registered agents
all_agents = get_available_agents()
print(all_agents)  # ['ppo', 'bba', 'mpc', 'dqn', 'a3c', ...]

# Get only trainable agents
trainable_agents = get_available_trainable_agents()
print(trainable_agents)  # ['ppo', 'dqn', 'a3c', ...]

Note: Agent registration typically happens in the agent module's __init__.py file. When you import the module, the registration is automatically executed. For example, importing pensieve_ppo.agent.rl automatically registers all RL agents (ppo, dqn, a3c).

Environment Details

Observation Space (Box(6, 8)):

Index	Description
0	Last quality normalized by max quality
1	Buffer size normalized by buffer_norm_factor
2	Throughput (chunk_size / delay) in Mbps
3	Delay normalized by buffer_norm_factor
4	Next chunk sizes at each bitrate level (MB)
5	Remaining chunks normalized by total

Action Space (Discrete(6)): Select bitrate level (0-5)

Reward: quality - 4.3 * rebuffer - 1.0 * |quality_change|

NetLLM Agents

NetLLM-style agents are registered under names such as netllm-gpt2, netllm-llama, and netllm-gpt2-lora. They use the NetLLMABRStateObserver and the model wrappers in pensieve_ppo.agent.netllm.

When creating a NetLLM agent, provide the reward normalization range required by the offline RL data processing:

python -m pensieve_ppo.imitate_exp_pool \
    --agent-name netllm-gpt2 \
    --state-history-len 6 \
    -o min_reward=-10.0 max_reward=10.0 plm_size='small'

NetLLM currently expects a state history length of 6, while the default PPO observer uses 8. Pass --state-history-len 6 for NetLLM runs unless you are using a custom NetLLM-compatible state encoder.

Command Line Options

Training (`pensieve_ppo.train`)

--train-trace-folder    Training trace folder (default: ./src/train/)
--output-dir            Output directory (default: ./ppo)
--parallel-workers      Number of parallel workers (default: 16)
--max-steps-per-epoch   Maximum steps per epoch per worker (default: 1000)
--train-epochs          Total training epochs (default: 500000)
--model-save-interval   Model checkpoint interval (default: 300)
--model-path            Resume from pretrained model

Testing (`pensieve_ppo.test`)

--test-trace-folder     Test trace folder (default: ./src/test/)
--model-path            Path to trained model weights
--test-log-file-prefix  Prefix for test log files

Common Options

--agent-name            RL algorithm (default: ppo)
--device                PyTorch device (cuda/cpu)
--levels-quality        Bitrate levels in Kbps
--state-history-len     State history length (default: 8)
--random-seed           Random seed (default: 42)
-o, --agent-options     Extra agent kwargs (e.g., learning_rate=1e-4)
-e, --env-options       Extra env kwargs (e.g., rebuf_penalty=4.3)

Values passed through --agent-options and --env-options are parsed as Python expressions. Quote string values, for example plm_size='small'.

TensorBoard Monitoring

Monitor training in real-time:

tensorboard --logdir=./ppo

Original README

Updates

Jan. 18, 2025: We removed the rate-based method and added NetLLM [4].

May. 4, 2024: We removed the Elastic, revised BOLA, and add new baseline Comyco [3] and Genet [2].

Jan. 26, 2024: We are excited to announce significant updates to Pensieve-PPO! We have replaced TensorFlow with PyTorch, and we have achieved a similar training speed while training models that rival in performance.

For the TensorFlow version, please check Pensieve-PPO TF Branch.

Dec. 28, 2021: In a previous update, we enhanced Pensieve-PPO with several state-of-the-art technologies, including Dual-Clip PPO and adaptive entropy decay.

About Pensieve-PPO

Pensieve-PPO is a user-friendly PyTorch implementation of Pensieve [1], a neural adaptive video streaming system. Unlike A3C, we utilize the Proximal Policy Optimization (PPO) algorithm for training.

This stable version of Pensieve-PPO includes both the training and test datasets.

You can run the repository by executing the following command:

python train.py

The results will be evaluated on the test set (from HSDPA) every 300 epochs.

Tensorboard Integration

To monitor the training process in real time, you can leverage Tensorboard. Simply run the following command:

tensorboard --logdir=./

Pretrained Model

We have also added a pretrained model, which can be found at this link. This model demonstrates a substantial improvement of 7.03% (from 0.924 to 0.989) in average Quality of Experience (QoE) compared to the original Pensieve model [1]. For a more detailed performance analysis, refer to the figures below:

If you have any questions or require further assistance, please don't hesitate to reach out.

Additional Reinforcement Learning Algorithms

For more implementations of reinforcement learning algorithms, please visit the following branches:

DQN: Pensieve-PPO DQN Branch
SAC: Pensieve-PPO SAC Branch or Pensieve-SAC Repository

[1] Mao H, Netravali R, Alizadeh M. Neural adaptive video streaming with Pensieve[C]//Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017: 197-210.

[2] Xia, Zhengxu, et al. "Genet: automatic curriculum generation for learning adaptation in networking." Proceedings of the ACM SIGCOMM 2022 Conference. 2022.

[3] Huang, Tianchi, et al. "Comyco: Quality-aware adaptive video streaming via imitation learning." Proceedings of the 27th ACM international conference on multimedia. 2019.

[4] Wu, Duo, et al. "Netllm: Adapting large language models for networking." Proceedings of the ACM SIGCOMM 2024 Conference. 2024.

We use the following command to test the entire traces in the dataset.

python run_plm.py --test --plm-type llama --plm-size base --rank 128 --device cuda:0 --trace-num -1 --model-dir  data/ft_plms/try_llama2_7b

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.1

Jun 8, 2026

2.0.0

Jun 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pensieve_ppo-2.0.1-py3-none-any.whl (159.0 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file pensieve_ppo-2.0.1-py3-none-any.whl.

File metadata

Download URL: pensieve_ppo-2.0.1-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 159.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pensieve_ppo-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74db89e96558c98e137d02f0d9f14e6aacba2e864fd940f9887496b5ec26f5d6`
MD5	`a4291f06f687be40f6fd6146574f1116`
BLAKE2b-256	`4e3272a1bd8e522d3621baee86bbd59a8db96c47136deb172230e91225d2d136`

See more details on using hashes here.

pensieve-ppo 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pensieve PPO

Features

Installation

From Source

Dependencies

Quick Start

Using the Gymnasium Environment

Training a PPO Agent

Testing a Trained Model

Package Structure

Architecture

Agent Class Hierarchy

Agent Statefulness

Agent, Observer, and Trainer Relationships

State, Step, and TrainingBatch

Imitation Learning

API Reference

Creating Environment and Agent

Using the Agent

Registering Custom Agents

Environment Details

NetLLM Agents

Command Line Options

Training (pensieve_ppo.train)

Testing (pensieve_ppo.test)

Common Options

TensorBoard Monitoring

Original README

Updates

About Pensieve-PPO

Tensorboard Integration

Pretrained Model

Additional Reinforcement Learning Algorithms

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Training (`pensieve_ppo.train`)

Testing (`pensieve_ppo.test`)