Inference-time value steering for vLLM: dynamic abstention + value-filtered decoding.
Project description
value-steer
Inference-time value steering for vLLM: two decode-time interventions driven by a shared scalar value head.
- Dynamic abstention (Knowing When to Quit, ICML 2026) — gate generation to EOS when the value crosses a calibrated threshold.
- Value-filtered decoding (Selective Safety Steering via Value-Filtered Decoding) — at each step, sample K candidates and commit one by a safety value, keeping the natural sample when it is already safe.
Both score the same feature (the backbone's final post-norm hidden state, the tensor
lm_head consumes) with the same head, so one trained probe serves either mode.
Install
pip install value-steer # core (torch, numpy) — pure modules + training/calibration
pip install "value-steer[vllm]" # + the vLLM runtime (serving / decoding)
pip install "value-steer[train]" # + probe training (transformers)
pip install "value-steer[dev]" # + pytest, ruff, build, twine
vLLM is an optional dependency pinned to the behaviorally-validated span
(>=0.19.1,<0.20); install it to match your CUDA driver, and run value-steer-compat
before widening the pin (see Compatibility). The pure modules (value head, steering
ops, calibration, probe training) import without vLLM, so training/calibration boxes
need only the core install.
A pre-trained safety value head (Mistral-7B-Instruct-v0.3 backbone, hh-rlhf labels via a
Llama-3.1 judge) is published at
HenDav/value-steer-safety-head —
see its model card for the feature contract and a ready-to-use config snippet.
Use
Both modes plug in via vLLM's supported --worker-cls surface — no monkeypatching.
Abstention:
vllm serve <model> \
--worker-cls value_steer.worker.ValueSteerWorker \
--additional-config '{"abstain": {"value_head_path": "vhead.pt", "threshold": 0.5}}'
Value-filtered decoding (run with speculative decoding OFF — VFD owns the decode forward):
vllm serve <model> \
--worker-cls value_steer.worker.ValueSteerWorker \
--additional-config '{"vfd": {"value_head_path": "vhead.pt", "threshold": 0.3, "num_candidates": 8}}'
Per-request override via SamplingParams.extra_args (abstain_threshold / vfd_threshold).
Train a probe
from value_steer.train_probe import train_probe, save_probe_checkpoint
from value_steer.value_probe import ValueHead
head = ValueHead(hidden_size) # shared head; fp32 on the post-norm feature
train_probe(backbone, head, train_loader, loss_name="focal", use_td=True, coh_weight=0.1)
save_probe_checkpoint("vhead.pt", head, threshold=calibrated_c, meta={"loss": "focal"})
save_probe_checkpoint writes the bare head weights to vhead.pt (loaded by the
runners) plus a vhead.pt.meta.json sidecar with the feature spec, calibrated
threshold, and metadata. The objective is label-agnostic — your labels define whether
the value means P(unsafe) or P(should-quit).
Calibrate the threshold
from value_steer.calibration import posterior_threshold # VFD / posterior filter
from value_steer.calibration import martingale_threshold # time-to-unsafe martingale
c = posterior_threshold(safe_labels, trajectories, tau=0.05)
Given held-out (label, per-step value trajectory) pairs, this returns the threshold
with a finite-sample bound on false interventions: P_H0(max_t p_t ≥ c) ≤ tau. That is
the guarantee the threshold is supposed to carry — not a hand-tuned number.
Compatibility
The runners bind to a few vLLM internals. compat_check.py is the version gate:
value-steer-compat # static contract checks (needs only `import vllm`)
value-steer-compat abstain # abstention subset
Run it on every vLLM bump. Static checks fail loudly if a bound internal moved; behavioral checks (GPU) assert the feature actually fires — necessary because the runner hooks swallow errors in production, so "it ran" is not "it worked." Pair the agent's per-version run: static first (no GPU, pinpoints the broken contract), GPU behavioral only if static is green.
Tests
pytest -q # pure-logic suite (no GPU, no vLLM): ops, calibration, training, allocator
Status
| Component | State |
|---|---|
| value head, steering ops, calibration, training | complete, CPU-tested |
| abstention runner | complete vs pinned APIs; EOS-fires check is the GPU behavioral test |
| VFD runner | complete and GPU-validated (A100, vLLM 0.19.1, Mistral-7B): single-forward K-candidate decode, end-to-end safer outputs under a Llama-3.1 judge; no silent gaps remain |
--worker-cls entry point, packaging, compat harness, version registry |
complete |
The VFD candidate forward goes through _model_forward + the attention-metadata builder
(standard paged decode); the KV cache-write is backend-specific and requires FlashAttention
v2's KV layout (compute capability ≥ 8.0). GPU behavioral tests live in
tests/test_gpu_behavioral.py (marked gpu, skipped without CUDA) and assert the features
fire, not merely run.
Limitations
- vLLM pin. Bound to the behaviorally-validated span
>=0.19.1,<0.20; the runners ground against vLLM internals that shift across minor versions. The registry invalue_steer/validated_versions.jsonis authoritative and warns at runtime for untested in-range versions — widen only aftervalue-steer-compatpasses on a GPU box. - Serving default is eager. The VFD CUDA-graph/compile path is single-stream only (it
corrupts concurrent requests under cudagraphs);
enforce_eager=Trueis correct for all batch sizes and is the serving default. The compile speedup is an explicit opt-in (vfd.single_stream=True+ one request at a time) for offline/benchmark use. - VFD threshold. The head steers around threshold 0.3; the conformal
posterior_thresholdin a head's sidecar is conservative (bounds false interventions) and can sit higher — start at 0.3 and tune. See docs/training-a-value-head.md.
Citation
If you use value-steer, please cite the software and the two papers it implements; see CITATION.cff.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file value_steer-0.1.0.tar.gz.
File metadata
- Download URL: value_steer-0.1.0.tar.gz
- Upload date:
- Size: 71.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69217ef296c9d41b6eba1c4954c1f532d77512edc2b57e6a8b64321d9089370b
|
|
| MD5 |
c6a598a42182641c9466102b033daff7
|
|
| BLAKE2b-256 |
dbdc2931af2a5eb4432b2b763abbe94a0af5989349901166f4473498967470e8
|
Provenance
The following attestation bundles were made for value_steer-0.1.0.tar.gz:
Publisher:
release.yml on HenDav/value-steering
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
value_steer-0.1.0.tar.gz -
Subject digest:
69217ef296c9d41b6eba1c4954c1f532d77512edc2b57e6a8b64321d9089370b - Sigstore transparency entry: 1951578469
- Sigstore integration time:
-
Permalink:
HenDav/value-steering@0b023a7f0e6d914c36a13f60f2435c750401ac39 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/HenDav
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b023a7f0e6d914c36a13f60f2435c750401ac39 -
Trigger Event:
push
-
Statement type:
File details
Details for the file value_steer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: value_steer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 55.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1266a2a2e0254c1e83b01ad2a6875e3230b66c473fc3e976de2b8991c9fa97db
|
|
| MD5 |
867b7bc667e9b45308be3a36f714aa9a
|
|
| BLAKE2b-256 |
8ccff6146fe6c2f253a8cb8dab315f0d76f65694f5661b1fb1b0071c33722271
|
Provenance
The following attestation bundles were made for value_steer-0.1.0-py3-none-any.whl:
Publisher:
release.yml on HenDav/value-steering
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
value_steer-0.1.0-py3-none-any.whl -
Subject digest:
1266a2a2e0254c1e83b01ad2a6875e3230b66c473fc3e976de2b8991c9fa97db - Sigstore transparency entry: 1951578667
- Sigstore integration time:
-
Permalink:
HenDav/value-steering@0b023a7f0e6d914c36a13f60f2435c750401ac39 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/HenDav
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b023a7f0e6d914c36a13f60f2435c750401ac39 -
Trigger Event:
push
-
Statement type: