Skip to main content

Unified Multi-modal Feedback using Amortized Variational Inference

Project description

MAVRL - Unified Multi-modal Feedback using Amortized Variational Inference

This package implements a variational inference approach for learning reward functions from multiple types of feedback (preferences, demonstrations, etc.).

Repository layout

The repository ships two top-level Python packages:

  • mavrl/ — the algorithm itself: encoders, feedback models, datasets, losses, environment wrappers, retraining utilities. Importable as import mavrl.
  • mavrl_experiments/ — the infrastructure that runs the algorithm: Optuna search, distributed file queues, table printers, Slack watchers, CLI entry points, and the experiment configs themselves (mavrl_experiments/configs/{experiments,optuna}/). Importable as import mavrl_experiments and invoked via python -m mavrl_experiments.<module>.

mavrl_experiments depends on mavrl (one-way); mavrl never imports from mavrl_experiments. The split keeps the algorithm package focused and lets infrastructure evolve without touching algorithm code.

Top-level entry-point scripts (train.py, transfer.py, evaluate_reward_model.py, train_online.py) live at the repo root.

Installation

Ensure your current Python is python/3.11.6. On Euler, load the correct python version using:

module load stack/2024-06 python/3.11.6

Ensure that you are at the root of this project. Create a fresh virtual environment with this exact name:

python -m venv venv/

.gitignore will ignore this virtual environment. Activate the virtual environment:

source venv/bin/activate

Install all required dependencies:

pip install -r requirements.txt
pip install -e .

The first line installs all python packages except mavrl. The second install an editable version of mavrl.

Running a single trial

To run a single trial, execute

python -m train.py

Running an experiment

Instead of running just a single trial, you can run a potentially large number of trials through our our cli. Here is an overview of the process:

1. Specifying all configuations

Specify all experintal configurations using the ExperimentGrid class. This will exhaustively run all valid combinations of the specified parameters. For an example on how to specify a grid of configurations, see mavrl_experiments/configs/experiments/sweep_grid_trap.py. You can specify configurations in four ways:

  1. By passing the base_config to the ExperimentGrids constructor. These are parameters that are shared between all configurations.
  2. By adding a parameter sweep with grid.add. Values are specified as lists.
  3. By adding a conditional parameter with grid.add_conditional. Supply a boolean function to the condition argument that defines whether a configuration fulfills the condition to contain these parameter values.
  4. By removing invalid configurations with grid.add_validator.

NOTE: Any paths that are specified in the grid should be absolute paths for the machine that you plan to run the experiment on. Otherwise paths will not be correctly recognized.

Once your grid is setup, populate the database with experiments:

python -m mavrl_experiments.cli add-grid <your_config_name> --seeds 5

This will create a database containing all configuration parameters that will be read out by the workers, but no results yet.

NOTE: Populating the database might take a long time on Euler, while it might only take a few seconds on your local system. Consider populating the database locally and copying it to Euler after.

This command is idempotent: Pre-existing entries with equivalent configurations will not be deleted by issueing it again, only new configurations will be added.

--seeds specifies the number of trials (differing by seed) that are run per configuration. So if you have 100 distinct configurations, --seeds 5 will result in 500 trials.

2. Checking experiment status

At each time-point during the experiment, you can check the progress using

python -m mavrl_experiments.cli status

Since you haven't started yet, you will see something like this.

Experiment Queue Status (rb_experiment_001.db)
========================================
  Pending:    22320
  Running:        0
  Completed:      0
  Failed:         0
----------------------------------------
  Total:      22320

  Progress: [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0.0%

Do not forget to specify the correct database path with this command in case you use a custom path.

3. Submit experiment

Now you can have workers pick up tasks from the queue (see scripts/ for cluster submission scripts).

Hyperparameter search (Optuna)

For finding good multi-modal feedback allocations under a fixed sample budget, use the Optuna-based search in mavrl_experiments/optuna_search.py. It samples a Dirichlet-distributed allocation over modalities (always summing exactly to --budget) and jointly searches over reward-model and PPO retraining hyperparameters defined in the env config.

Configs live at mavrl_experiments/configs/optuna/<env>.py (override the root via MAVRL_CONFIG_ROOT). Each config defines:

  • BASE_CONFIG — fixed parameters,
  • MODALITY_PARAMS — per-modality hyperparameters applied when that modality has samples > 0,
  • HYPERPARAM_SEARCH_SPACE — the search space (categorical lists or (low, high, log) continuous ranges),
  • MODALITIES — the ordered list of modality sample-count keys.

Local end-to-end test

The recipe below runs a minimal single-worker search on lunar_lander_v3. A full trial does a complete PPO retrain (1M timesteps by default), which is slow on a laptop. To iterate faster locally, temporarily add "retrain_n_timesteps": 100_000 to BASE_CONFIG in mavrl_experiments/configs/optuna/lunar_lander_v3.py (don't commit that — it's just for testing).

# 1. Pre-generate cached datasets (1 seed is enough for a smoke test).
#    --gen_samples should be >= the budget you plan to test.
python scripts/pregenerate_datasets.py \
    --config lunar_lander_v3 \
    --cache_dir dataset_cache/lander_local \
    --seeds 1 \
    --gen_samples 256 \
    --gen_samples_demo 256

# 2. Run a small search (single worker, few trials, one seed per trial).
python -m mavrl_experiments.optuna_search \
    --study-name lander_b256_local \
    --storage optuna_journal_lander_local.log \
    --env-config lunar_lander_v3 \
    --budget 256 \
    --n-seeds 1 \
    --n-trials 5 \
    --dataset-cache-dir dataset_cache/lander_local

# 3. Inspect the results (passing --env-config enables the normalized-score column).
python -m mavrl_experiments.optuna_search \
    --study-name lander_b256_local \
    --storage optuna_journal_lander_local.log \
    --env-config lunar_lander_v3 \
    --show-results

The journal file (optuna_journal_lander_local.log) is append-only and NFS-safe, so re-running step 2 with the same --study-name and --storage will continue the same study.

Cluster submission

scripts/submit_optuna.sh runs an Optuna worker as a SLURM array task. Every array element is an independent worker; they coordinate through a shared journal file (NFS-safe, append-only), so there is no central scheduler. Each worker fits its own TPE model from the shared trial history and proposes its own next trial.

Prerequisites

  1. Virtual environment. The script activates venv/ (or ../venv/) automatically. Create it as described in Installation.
  2. Journal directory. Pick a path on a shared filesystem reachable from all compute nodes (e.g. $SCRATCH/mavrl/optuna_studies/). The journal file will be created on first run.
  3. Dataset cache (recommended). Pre-generate datasets once so trials don't redo expensive sample generation. --gen_samples should be at least the budget you intend to search:
    python scripts/pregenerate_datasets.py \
        --config lunar_lander_v3 \
        --cache_dir $SCRATCH/mavrl/dataset_cache/lander \
        --seeds 3 \
        --gen_samples 256 \
        --gen_samples_demo 256
    
    Use the same --seeds value as your trial N_SEEDS (workers seed trials as 0..N_SEEDS-1).

Submission

The script reads its configuration from environment variables. Required:

Variable Meaning
STUDY_NAME Optuna study name. Use a fresh name per (metric, direction, budget)load_if_exists=True silently reuses an existing study's direction.
ENV_CONFIG Config name under mavrl_experiments/configs/optuna/ (e.g. lunar_lander_v3).
BUDGET Total feedback samples per trial (sum across modalities).
STORAGE_PATH Path to the journal .log file.

Optional:

Variable Default Meaning
N_SEEDS 3 Seeds evaluated per trial; the trial value is the mean across seeds.
N_TRIALS 20 Trials per worker. With a 32-task array, total trials ≈ 32 × N_TRIALS.
METRIC eval/regret Final-evaluation key to optimize (e.g. eval/mean_rew, eval/discounted_value).
DIRECTION minimize minimize or maximize. Pair with METRIC correctly.
SINGLE_MODALITY unset If set to pref/demo/rating/stop, the entire BUDGET is allocated to that modality. Useful for single-modality baselines.
WANDB_PROJECT unset Log every trial run to this wandb project.
DATASET_CACHE_DIR unset Point trials at a pre-generated dataset cache.

Combined-modality run (Dirichlet allocation across all modalities):

STUDY_NAME=lander_b256_meanrew \
ENV_CONFIG=lunar_lander_v3 \
BUDGET=256 \
STORAGE_PATH=$SCRATCH/mavrl/optuna_studies/lander_b256_meanrew.log \
METRIC=eval/mean_rew DIRECTION=maximize \
N_SEEDS=3 N_TRIALS=20 \
DATASET_CACHE_DIR=$SCRATCH/mavrl/dataset_cache/lander \
sbatch scripts/submit_optuna.sh

Single-modality baseline (e.g. all-preferences) under the same budget, for comparison:

STUDY_NAME=lander_b256_meanrew_prefonly \
ENV_CONFIG=lunar_lander_v3 \
BUDGET=256 \
STORAGE_PATH=$SCRATCH/mavrl/optuna_studies/lander_b256_meanrew_prefonly.log \
METRIC=eval/mean_rew DIRECTION=maximize \
SINGLE_MODALITY=pref \
N_SEEDS=3 N_TRIALS=20 \
DATASET_CACHE_DIR=$SCRATCH/mavrl/dataset_cache/lander \
sbatch scripts/submit_optuna.sh

Adjusting array size and resources

The script defaults to --array=0-31 (32 workers), 4 CPUs each, 4 hours wall time. Override at submit time:

sbatch --array=0-15 --time=08:00:00 scripts/submit_optuna.sh   # 16 workers, 8h
sbatch --array=0-63 --cpus-per-task=8 scripts/submit_optuna.sh # 64 workers, 8 CPUs each

Logs land in logs/slurm/optuna_<jobid>_<taskid>.out|err.

Monitoring & inspecting results

While running, the journal file is readable:

python -m mavrl_experiments.optuna_search \
    --study-name lander_b256_meanrew \
    --storage $SCRATCH/mavrl/optuna_studies/lander_b256_meanrew.log \
    --env-config lunar_lander_v3 \
    --show-results

This works mid-run (you'll just see partial results) and after completion. Passing --env-config enables a normalized-score column when results/normalization_values.json has entries for the env.

Two main tables: equal-budget and fixed-allocation

There are two pre-built launchers that each submit 66 Optuna studies (6 envs × 11 modality subsets). They answer different questions:

Launcher Allocation Question
launch_equal_budget_table.sh Dirichlet over budget Are modalities complementary when you spend a fixed total budget?
launch_fixed_allocation_table.sh Prescribed per-modality Can MAVRL combine arbitrary offline feedback datasets to produce gains?

Both share the same 11-subset layout (pref, demo, rating, stop, all 6 pairs, and pdrs = all four). The two are designed to live side-by-side in $STORAGE_ROOT — study suffixes differ (_b<N> vs _fixed), so they don't collide.

1. Equal-budget table — modality complementarity

For each env, fix a single total feedback budget and let Optuna's Dirichlet allocation split it across whichever modalities are active in the study. Tests whether two modalities together at total budget B beat the best single modality at B.

# Submit all 66 studies (default per-env budgets: grid=64, control=64, lander=256)
bash scripts/launch_equal_budget_table.sh

# Filter to a subset of envs/subsets / dry-run
ENVS="grid_trap"  SUBSETS="pdrs pref"  bash scripts/launch_equal_budget_table.sh
DRY_RUN=1         bash scripts/launch_equal_budget_table.sh

# Override per-env-group budgets
BUDGET_GRID=128   bash scripts/launch_equal_budget_table.sh

Snapshot the current best value of every cell into one printed table (safe mid-optimization; reads the journal files):

python -m mavrl_experiments.equal_budget_table \
    --storage-root $SCRATCH/mavrl/optuna_studies

Cells render as normalized percentages (uniform=0%, optimal=100%) when results/normalization_values.json covers the env. Filter with --envs grid_cliff lunar_lander_v3 to print a subset of rows.

2. Fixed-allocation table — gains from heterogeneous offline data

For each env, prescribe per-modality sample counts in mavrl_experiments/configs/optuna/<env>_fixed.py:FIXED_SAMPLE_COUNTS. Each study uses exactly those counts (no Dirichlet, no shared budget); Optuna instead searches the optimizer/loss hyperparameters that combine the modalities: td_error_weight, kl_weight, use_importance_weights, lr, batch_size, encoder_hidden_sizes (and the PPO retraining hparams for non-tabular envs). Tests the "you have offline data of various kinds lying around — can our method turn it into a better reward model than any single-modality alternative?" story.

# Submit all 66 studies using prescribed counts from <env>_fixed.py
bash scripts/launch_fixed_allocation_table.sh

# Filter / dry-run (same hooks as the equal-budget launcher)
ENVS="grid_trap acrobot_v1"  bash scripts/launch_fixed_allocation_table.sh
DRY_RUN=1                    bash scripts/launch_fixed_allocation_table.sh

Default FIXED_SAMPLE_COUNTS (small values, totals near a power of 2; tune in the <env>_fixed.py config to match your offline-data scenario):

env pref demo rating stop total
grid_* 23 2 23 16 64
acrobot_v1 23 2 23 16 64
cartpole_v1 23 2 23 16 64
lunar_lander_v3 92 8 92 64 256

To inspect any individual study's best trial (works for both tables):

python -m mavrl_experiments.optuna_search \
    --study-name grid_trap_pdrs_fixed \
    --storage $SCRATCH/mavrl/optuna_studies/grid_trap/grid_trap_pdrs_fixed.log \
    --env-config grid_trap_fixed --show-results

Plotting a study

scripts/plot_optuna_study.py writes interactive Plotly HTML files (optimization history, param importances, slice, parallel coordinates, contour) under figures/optuna/<study_name>/. Safe to run mid-study — the journal backend tolerates concurrent reads.

# Equal-budget joint study, lunar_lander_v3 (pdrs at budget 256)
python scripts/plot_optuna_study.py \
    --study-name lunar_lander_v3_pdrs_b256 \
    --storage-dir $SCRATCH/mavrl/optuna_studies/lunar_lander_v3

Substitute the study name to plot any other env / subset / budget. To sweep all five "tracked" subsets for one env quickly:

for sub in pref demo rating stop pdrs; do
    python scripts/plot_optuna_study.py \
        --study-name lunar_lander_v3_${sub}_b256 \
        --storage-dir $SCRATCH/mavrl/optuna_studies/lunar_lander_v3
done

Then scp the figures/optuna/ tree back to your laptop and open the HTMLs in a browser. The optimization-history plot is usually the most informative for "is the search still improving or has it plateaued."

Resuming and adding more trials

To add more trials to an existing study, resubmit with the same STUDY_NAME and STORAGE_PATH. Workers will load the existing study (load_if_exists=True), fit TPE on the existing history, and append new trials. The original direction/metric is preserved — you cannot change them mid-study; start a fresh study instead.

Tips

  • Test the configuration locally with --n-trials 1 --n-seeds 1 before submitting an array job. Most config errors (typos, missing policies, invalid hyperparam ranges) surface in the first trial.
  • The first few trials in any new study are random startup samples (n_startup_trials); TPE only kicks in after enough completed trials are visible across all workers.
  • Slurm logs print the resolved per-trial allocation as Allocation: {...} at the end of each --show-results invocation, which is the most useful artifact for downstream sweeps.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mavrl-0.0.1.tar.gz (165.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mavrl-0.0.1-py3-none-any.whl (231.2 kB view details)

Uploaded Python 3

File details

Details for the file mavrl-0.0.1.tar.gz.

File metadata

  • Download URL: mavrl-0.0.1.tar.gz
  • Upload date:
  • Size: 165.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for mavrl-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d53cbaaed43df60a75529bff7041067c28d8a908b3d0f12532d626c38af4d74a
MD5 dbacd4161ca56f397babc768c09bff05
BLAKE2b-256 06afd609d5f97df9db5b1504f423220eddadc4089674ca2daa40c2ce421e29cf

See more details on using hashes here.

File details

Details for the file mavrl-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: mavrl-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 231.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for mavrl-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 644d480514e0903527410b75a9dfc69b8807bda909e14190d866e9bf6a977621
MD5 051d9da78497fc2ffd1f547613f1e337
BLAKE2b-256 606a5def772b0a0418af8d9c8dd09455107487154ee66280179bb1ede429d2d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page