Unified Multi-modal Feedback using Amortized Variational Inference
Project description
MAVRL - Unified Multi-modal Feedback using Amortized Variational Inference
This package implements a variational inference approach for learning reward functions from multiple types of feedback (preferences, demonstrations, etc.).
Repository layout
The repository ships two top-level Python packages:
mavrl/— the algorithm itself: encoders, feedback models, datasets, losses, environment wrappers, retraining utilities. Importable asimport mavrl.mavrl_experiments/— the infrastructure that runs the algorithm: Optuna search, distributed file queues, table printers, Slack watchers, CLI entry points, and the experiment configs themselves (mavrl_experiments/configs/{experiments,optuna}/). Importable asimport mavrl_experimentsand invoked viapython -m mavrl_experiments.<module>.
mavrl_experiments depends on mavrl (one-way); mavrl never imports from
mavrl_experiments. The split keeps the algorithm package focused and lets
infrastructure evolve without touching algorithm code.
Top-level entry-point scripts (train.py, transfer.py,
evaluate_reward_model.py, train_online.py) live at the repo root.
Installation
Ensure your current Python is python/3.11.6. On Euler, load the correct python version using:
module load stack/2024-06 python/3.11.6
Ensure that you are at the root of this project. Create a fresh virtual environment with this exact name:
python -m venv venv/
.gitignore will ignore this virtual environment.
Activate the virtual environment:
source venv/bin/activate
Install all required dependencies:
pip install -r requirements.txt
pip install -e .
The first line installs all python packages except mavrl. The second install an editable version of mavrl.
Running a single trial
To run a single trial, execute
python -m train.py
Running an experiment
Instead of running just a single trial, you can run a potentially large number of trials through our our cli. Here is an overview of the process:
1. Specifying all configuations
Specify all experintal configurations using the ExperimentGrid class. This will exhaustively run all valid combinations of the specified parameters.
For an example on how to specify a grid of configurations, see mavrl_experiments/configs/experiments/sweep_grid_trap.py.
You can specify configurations in four ways:
- By passing the
base_configto theExperimentGrids constructor. These are parameters that are shared between all configurations. - By adding a parameter sweep with
grid.add. Values are specified as lists. - By adding a conditional parameter with
grid.add_conditional. Supply a boolean function to theconditionargument that defines whether a configuration fulfills the condition to contain these parameter values. - By removing invalid configurations with
grid.add_validator.
NOTE: Any paths that are specified in the grid should be absolute paths for the machine that you plan to run the experiment on. Otherwise paths will not be correctly recognized.
Once your grid is setup, populate the database with experiments:
python -m mavrl_experiments.cli add-grid <your_config_name> --seeds 5
This will create a database containing all configuration parameters that will be read out by the workers, but no results yet.
NOTE: Populating the database might take a long time on Euler, while it might only take a few seconds on your local system. Consider populating the database locally and copying it to Euler after.
This command is idempotent: Pre-existing entries with equivalent configurations will not be deleted by issueing it again, only new configurations will be added.
--seeds specifies the number of trials (differing by seed) that are run per configuration. So if you have 100 distinct configurations, --seeds 5 will result in 500 trials.
2. Checking experiment status
At each time-point during the experiment, you can check the progress using
python -m mavrl_experiments.cli status
Since you haven't started yet, you will see something like this.
Experiment Queue Status (rb_experiment_001.db)
========================================
Pending: 22320
Running: 0
Completed: 0
Failed: 0
----------------------------------------
Total: 22320
Progress: [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0.0%
Do not forget to specify the correct database path with this command in case you use a custom path.
3. Submit experiment
Now you can have workers pick up tasks from the queue (see scripts/ for cluster submission scripts).
Hyperparameter search (Optuna)
For finding good multi-modal feedback allocations under a fixed sample budget,
use the Optuna-based search in mavrl_experiments/optuna_search.py. It samples
a Dirichlet-distributed allocation over modalities (always summing exactly to
--budget) and jointly searches over reward-model and PPO retraining
hyperparameters defined in the env config.
Configs live at mavrl_experiments/configs/optuna/<env>.py (override the root via
MAVRL_CONFIG_ROOT). Each config defines:
BASE_CONFIG— fixed parameters,MODALITY_PARAMS— per-modality hyperparameters applied when that modality has samples > 0,HYPERPARAM_SEARCH_SPACE— the search space (categorical lists or(low, high, log)continuous ranges),MODALITIES— the ordered list of modality sample-count keys.
Local end-to-end test
The recipe below runs a minimal single-worker search on lunar_lander_v3.
A full trial does a complete PPO retrain (1M timesteps by default), which
is slow on a laptop. To iterate faster locally, temporarily add
"retrain_n_timesteps": 100_000 to BASE_CONFIG in
mavrl_experiments/configs/optuna/lunar_lander_v3.py (don't commit that — it's just for testing).
# 1. Pre-generate cached datasets (1 seed is enough for a smoke test).
# --gen_samples should be >= the budget you plan to test.
python scripts/pregenerate_datasets.py \
--config lunar_lander_v3 \
--cache_dir dataset_cache/lander_local \
--seeds 1 \
--gen_samples 256 \
--gen_samples_demo 256
# 2. Run a small search (single worker, few trials, one seed per trial).
python -m mavrl_experiments.optuna_search \
--study-name lander_b256_local \
--storage optuna_journal_lander_local.log \
--env-config lunar_lander_v3 \
--budget 256 \
--n-seeds 1 \
--n-trials 5 \
--dataset-cache-dir dataset_cache/lander_local
# 3. Inspect the results (passing --env-config enables the normalized-score column).
python -m mavrl_experiments.optuna_search \
--study-name lander_b256_local \
--storage optuna_journal_lander_local.log \
--env-config lunar_lander_v3 \
--show-results
The journal file (optuna_journal_lander_local.log) is append-only and
NFS-safe, so re-running step 2 with the same --study-name and --storage
will continue the same study.
Cluster submission
scripts/submit_optuna.sh runs an Optuna worker as a SLURM array task.
Every array element is an independent worker; they coordinate through a
shared journal file (NFS-safe, append-only), so there is no central
scheduler. Each worker fits its own TPE model from the shared trial
history and proposes its own next trial.
Prerequisites
- Virtual environment. The script activates
venv/(or../venv/) automatically. Create it as described in Installation. - Journal directory. Pick a path on a shared filesystem reachable from
all compute nodes (e.g.
$SCRATCH/mavrl/optuna_studies/). The journal file will be created on first run. - Dataset cache (recommended). Pre-generate datasets once so trials
don't redo expensive sample generation.
--gen_samplesshould be at least the budget you intend to search:python scripts/pregenerate_datasets.py \ --config lunar_lander_v3 \ --cache_dir $SCRATCH/mavrl/dataset_cache/lander \ --seeds 3 \ --gen_samples 256 \ --gen_samples_demo 256
Use the same--seedsvalue as your trialN_SEEDS(workers seed trials as0..N_SEEDS-1).
Submission
The script reads its configuration from environment variables. Required:
| Variable | Meaning |
|---|---|
STUDY_NAME |
Optuna study name. Use a fresh name per (metric, direction, budget) — load_if_exists=True silently reuses an existing study's direction. |
ENV_CONFIG |
Config name under mavrl_experiments/configs/optuna/ (e.g. lunar_lander_v3). |
BUDGET |
Total feedback samples per trial (sum across modalities). |
STORAGE_PATH |
Path to the journal .log file. |
Optional:
| Variable | Default | Meaning |
|---|---|---|
N_SEEDS |
3 |
Seeds evaluated per trial; the trial value is the mean across seeds. |
N_TRIALS |
20 |
Trials per worker. With a 32-task array, total trials ≈ 32 × N_TRIALS. |
METRIC |
eval/regret |
Final-evaluation key to optimize (e.g. eval/mean_rew, eval/discounted_value). |
DIRECTION |
minimize |
minimize or maximize. Pair with METRIC correctly. |
SINGLE_MODALITY |
unset | If set to pref/demo/rating/stop, the entire BUDGET is allocated to that modality. Useful for single-modality baselines. |
WANDB_PROJECT |
unset | Log every trial run to this wandb project. |
DATASET_CACHE_DIR |
unset | Point trials at a pre-generated dataset cache. |
Combined-modality run (Dirichlet allocation across all modalities):
STUDY_NAME=lander_b256_meanrew \
ENV_CONFIG=lunar_lander_v3 \
BUDGET=256 \
STORAGE_PATH=$SCRATCH/mavrl/optuna_studies/lander_b256_meanrew.log \
METRIC=eval/mean_rew DIRECTION=maximize \
N_SEEDS=3 N_TRIALS=20 \
DATASET_CACHE_DIR=$SCRATCH/mavrl/dataset_cache/lander \
sbatch scripts/submit_optuna.sh
Single-modality baseline (e.g. all-preferences) under the same budget, for comparison:
STUDY_NAME=lander_b256_meanrew_prefonly \
ENV_CONFIG=lunar_lander_v3 \
BUDGET=256 \
STORAGE_PATH=$SCRATCH/mavrl/optuna_studies/lander_b256_meanrew_prefonly.log \
METRIC=eval/mean_rew DIRECTION=maximize \
SINGLE_MODALITY=pref \
N_SEEDS=3 N_TRIALS=20 \
DATASET_CACHE_DIR=$SCRATCH/mavrl/dataset_cache/lander \
sbatch scripts/submit_optuna.sh
Adjusting array size and resources
The script defaults to --array=0-31 (32 workers), 4 CPUs each,
4 hours wall time. Override at submit time:
sbatch --array=0-15 --time=08:00:00 scripts/submit_optuna.sh # 16 workers, 8h
sbatch --array=0-63 --cpus-per-task=8 scripts/submit_optuna.sh # 64 workers, 8 CPUs each
Logs land in logs/slurm/optuna_<jobid>_<taskid>.out|err.
Monitoring & inspecting results
While running, the journal file is readable:
python -m mavrl_experiments.optuna_search \
--study-name lander_b256_meanrew \
--storage $SCRATCH/mavrl/optuna_studies/lander_b256_meanrew.log \
--env-config lunar_lander_v3 \
--show-results
This works mid-run (you'll just see partial results) and after
completion. Passing --env-config enables a normalized-score column
when results/normalization_values.json has entries for the env.
Two main tables: equal-budget and fixed-allocation
There are two pre-built launchers that each submit 66 Optuna studies (6 envs × 11 modality subsets). They answer different questions:
| Launcher | Allocation | Question |
|---|---|---|
launch_equal_budget_table.sh |
Dirichlet over budget | Are modalities complementary when you spend a fixed total budget? |
launch_fixed_allocation_table.sh |
Prescribed per-modality | Can MAVRL combine arbitrary offline feedback datasets to produce gains? |
Both share the same 11-subset layout (pref, demo, rating, stop, all
6 pairs, and pdrs = all four). The two are designed to live side-by-side
in $STORAGE_ROOT — study suffixes differ (_b<N> vs _fixed), so they
don't collide.
1. Equal-budget table — modality complementarity
For each env, fix a single total feedback budget and let Optuna's
Dirichlet allocation split it across whichever modalities are active in
the study. Tests whether two modalities together at total budget B beat
the best single modality at B.
# Submit all 66 studies (default per-env budgets: grid=64, control=64, lander=256)
bash scripts/launch_equal_budget_table.sh
# Filter to a subset of envs/subsets / dry-run
ENVS="grid_trap" SUBSETS="pdrs pref" bash scripts/launch_equal_budget_table.sh
DRY_RUN=1 bash scripts/launch_equal_budget_table.sh
# Override per-env-group budgets
BUDGET_GRID=128 bash scripts/launch_equal_budget_table.sh
Snapshot the current best value of every cell into one printed table (safe mid-optimization; reads the journal files):
python -m mavrl_experiments.equal_budget_table \
--storage-root $SCRATCH/mavrl/optuna_studies
Cells render as normalized percentages (uniform=0%, optimal=100%) when
results/normalization_values.json covers the env. Filter with
--envs grid_cliff lunar_lander_v3 to print a subset of rows.
2. Fixed-allocation table — gains from heterogeneous offline data
For each env, prescribe per-modality sample counts in
mavrl_experiments/configs/optuna/<env>_fixed.py:FIXED_SAMPLE_COUNTS. Each study uses
exactly those counts (no Dirichlet, no shared budget); Optuna instead
searches the optimizer/loss hyperparameters that combine the modalities:
td_error_weight, kl_weight, use_importance_weights, lr,
batch_size, encoder_hidden_sizes (and the PPO retraining hparams for
non-tabular envs). Tests the "you have offline data of various kinds
lying around — can our method turn it into a better reward model than any
single-modality alternative?" story.
# Submit all 66 studies using prescribed counts from <env>_fixed.py
bash scripts/launch_fixed_allocation_table.sh
# Filter / dry-run (same hooks as the equal-budget launcher)
ENVS="grid_trap acrobot_v1" bash scripts/launch_fixed_allocation_table.sh
DRY_RUN=1 bash scripts/launch_fixed_allocation_table.sh
Default FIXED_SAMPLE_COUNTS (small values, totals near a power of 2;
tune in the <env>_fixed.py config to match your offline-data scenario):
| env | pref | demo | rating | stop | total |
|---|---|---|---|---|---|
| grid_* | 23 | 2 | 23 | 16 | 64 |
| acrobot_v1 | 23 | 2 | 23 | 16 | 64 |
| cartpole_v1 | 23 | 2 | 23 | 16 | 64 |
| lunar_lander_v3 | 92 | 8 | 92 | 64 | 256 |
To inspect any individual study's best trial (works for both tables):
python -m mavrl_experiments.optuna_search \
--study-name grid_trap_pdrs_fixed \
--storage $SCRATCH/mavrl/optuna_studies/grid_trap/grid_trap_pdrs_fixed.log \
--env-config grid_trap_fixed --show-results
Plotting a study
scripts/plot_optuna_study.py writes interactive Plotly HTML files
(optimization history, param importances, slice, parallel coordinates,
contour) under figures/optuna/<study_name>/. Safe to run mid-study —
the journal backend tolerates concurrent reads.
# Equal-budget joint study, lunar_lander_v3 (pdrs at budget 256)
python scripts/plot_optuna_study.py \
--study-name lunar_lander_v3_pdrs_b256 \
--storage-dir $SCRATCH/mavrl/optuna_studies/lunar_lander_v3
Substitute the study name to plot any other env / subset / budget. To sweep all five "tracked" subsets for one env quickly:
for sub in pref demo rating stop pdrs; do
python scripts/plot_optuna_study.py \
--study-name lunar_lander_v3_${sub}_b256 \
--storage-dir $SCRATCH/mavrl/optuna_studies/lunar_lander_v3
done
Then scp the figures/optuna/ tree back to your laptop and open the
HTMLs in a browser. The optimization-history plot is usually the most
informative for "is the search still improving or has it plateaued."
Resuming and adding more trials
To add more trials to an existing study, resubmit with the same
STUDY_NAME and STORAGE_PATH. Workers will load the existing study
(load_if_exists=True), fit TPE on the existing history, and append
new trials. The original direction/metric is preserved — you cannot
change them mid-study; start a fresh study instead.
Tips
- Test the configuration locally with
--n-trials 1 --n-seeds 1before submitting an array job. Most config errors (typos, missing policies, invalid hyperparam ranges) surface in the first trial. - The first few trials in any new study are random startup samples
(
n_startup_trials); TPE only kicks in after enough completed trials are visible across all workers. - Slurm logs print the resolved per-trial allocation as
Allocation: {...}at the end of each--show-resultsinvocation, which is the most useful artifact for downstream sweeps.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mavrl-0.0.1.tar.gz.
File metadata
- Download URL: mavrl-0.0.1.tar.gz
- Upload date:
- Size: 165.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d53cbaaed43df60a75529bff7041067c28d8a908b3d0f12532d626c38af4d74a
|
|
| MD5 |
dbacd4161ca56f397babc768c09bff05
|
|
| BLAKE2b-256 |
06afd609d5f97df9db5b1504f423220eddadc4089674ca2daa40c2ce421e29cf
|
File details
Details for the file mavrl-0.0.1-py3-none-any.whl.
File metadata
- Download URL: mavrl-0.0.1-py3-none-any.whl
- Upload date:
- Size: 231.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
644d480514e0903527410b75a9dfc69b8807bda909e14190d866e9bf6a977621
|
|
| MD5 |
051d9da78497fc2ffd1f547613f1e337
|
|
| BLAKE2b-256 |
606a5def772b0a0418af8d9c8dd09455107487154ee66280179bb1ede429d2d7
|