Skip to main content

Inference-time value steering for vLLM: dynamic abstention + value-filtered decoding.

Project description

value-steer

CI PyPI Python License

Inference-time value steering for vLLM: two decode-time interventions driven by a shared scalar value head.

Both score the same feature (the backbone's final post-norm hidden state, the tensor lm_head consumes) with the same head, so one trained probe serves either mode.

Install

pip install value-steer              # core (torch, numpy) — pure modules + training/calibration
pip install "value-steer[vllm]"      # + the vLLM runtime (serving / decoding)
pip install "value-steer[train]"     # + probe training (transformers)
pip install "value-steer[dev]"       # + pytest, ruff, build, twine

vLLM is an optional dependency pinned to the behaviorally-validated span (>=0.19.1,<0.20); install it to match your CUDA driver, and run value-steer-compat before widening the pin (see Compatibility). The pure modules (value head, steering ops, calibration, probe training) import without vLLM, so training/calibration boxes need only the core install.

A pre-trained safety value head (Mistral-7B-Instruct-v0.3 backbone, hh-rlhf labels via a Llama-3.1 judge) is published at HenDav/value-steer-safety-head — see its model card for the feature contract and a ready-to-use config snippet.

Use

Both modes plug in via vLLM's supported --worker-cls surface — no monkeypatching.

Abstention:

vllm serve <model> \
  --worker-cls value_steer.worker.ValueSteerWorker \
  --additional-config '{"abstain": {"value_head_path": "vhead.pt", "threshold": 0.5}}'

Value-filtered decoding (run with speculative decoding OFF — VFD owns the decode forward):

vllm serve <model> \
  --worker-cls value_steer.worker.ValueSteerWorker \
  --additional-config '{"vfd": {"value_head_path": "vhead.pt", "threshold": 0.3, "num_candidates": 8}}'

Per-request override via SamplingParams.extra_args (abstain_threshold / vfd_threshold).

Train a probe

from value_steer.train_probe import train_probe, save_probe_checkpoint
from value_steer.value_probe import ValueHead

head = ValueHead(hidden_size)            # shared head; fp32 on the post-norm feature
train_probe(backbone, head, train_loader, loss_name="focal", use_td=True, coh_weight=0.1)
save_probe_checkpoint("vhead.pt", head, threshold=calibrated_c, meta={"loss": "focal"})

save_probe_checkpoint writes the bare head weights to vhead.pt (loaded by the runners) plus a vhead.pt.meta.json sidecar with the feature spec, calibrated threshold, and metadata. The objective is label-agnostic — your labels define whether the value means P(unsafe) or P(should-quit).

Calibrate the threshold

from value_steer.calibration import posterior_threshold   # VFD / posterior filter
from value_steer.calibration import martingale_threshold  # time-to-unsafe martingale

c = posterior_threshold(safe_labels, trajectories, tau=0.05)

Given held-out (label, per-step value trajectory) pairs, this returns the threshold with a finite-sample bound on false interventions: P_H0(max_t p_t ≥ c) ≤ tau. That is the guarantee the threshold is supposed to carry — not a hand-tuned number.

Compatibility

The runners bind to a few vLLM internals. compat_check.py is the version gate:

value-steer-compat            # static contract checks (needs only `import vllm`)
value-steer-compat abstain    # abstention subset

Run it on every vLLM bump. Static checks fail loudly if a bound internal moved; behavioral checks (GPU) assert the feature actually fires — necessary because the runner hooks swallow errors in production, so "it ran" is not "it worked." Pair the agent's per-version run: static first (no GPU, pinpoints the broken contract), GPU behavioral only if static is green.

Tests

pytest -q          # pure-logic suite (no GPU, no vLLM): ops, calibration, training, allocator

Status

Component State
value head, steering ops, calibration, training complete, CPU-tested
abstention runner complete vs pinned APIs; EOS-fires check is the GPU behavioral test
VFD runner complete and GPU-validated (A100, vLLM 0.19.1, Mistral-7B): single-forward K-candidate decode, end-to-end safer outputs under a Llama-3.1 judge; no silent gaps remain
--worker-cls entry point, packaging, compat harness, version registry complete

The VFD candidate forward goes through _model_forward + the attention-metadata builder (standard paged decode); the KV cache-write is backend-specific and requires FlashAttention v2's KV layout (compute capability ≥ 8.0). GPU behavioral tests live in tests/test_gpu_behavioral.py (marked gpu, skipped without CUDA) and assert the features fire, not merely run.

Limitations

  • vLLM pin. Bound to the behaviorally-validated span >=0.19.1,<0.20; the runners ground against vLLM internals that shift across minor versions. The registry in value_steer/validated_versions.json is authoritative and warns at runtime for untested in-range versions — widen only after value-steer-compat passes on a GPU box.
  • Serving default is eager. The VFD CUDA-graph/compile path is single-stream only (it corrupts concurrent requests under cudagraphs); enforce_eager=True is correct for all batch sizes and is the serving default. The compile speedup is an explicit opt-in (vfd.single_stream=True + one request at a time) for offline/benchmark use.
  • VFD threshold. The head steers around threshold 0.3; the conformal posterior_threshold in a head's sidecar is conservative (bounds false interventions) and can sit higher — start at 0.3 and tune. See docs/training-a-value-head.md.

Citation

If you use value-steer, please cite the software and the two papers it implements; see CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

value_steer-0.1.0.tar.gz (71.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

value_steer-0.1.0-py3-none-any.whl (55.7 kB view details)

Uploaded Python 3

File details

Details for the file value_steer-0.1.0.tar.gz.

File metadata

  • Download URL: value_steer-0.1.0.tar.gz
  • Upload date:
  • Size: 71.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for value_steer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69217ef296c9d41b6eba1c4954c1f532d77512edc2b57e6a8b64321d9089370b
MD5 c6a598a42182641c9466102b033daff7
BLAKE2b-256 dbdc2931af2a5eb4432b2b763abbe94a0af5989349901166f4473498967470e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for value_steer-0.1.0.tar.gz:

Publisher: release.yml on HenDav/value-steering

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file value_steer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: value_steer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for value_steer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1266a2a2e0254c1e83b01ad2a6875e3230b66c473fc3e976de2b8991c9fa97db
MD5 867b7bc667e9b45308be3a36f714aa9a
BLAKE2b-256 8ccff6146fe6c2f253a8cb8dab315f0d76f65694f5661b1fb1b0071c33722271

See more details on using hashes here.

Provenance

The following attestation bundles were made for value_steer-0.1.0-py3-none-any.whl:

Publisher: release.yml on HenDav/value-steering

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page