Inference-time value steering for vLLM: dynamic abstention + value-filtered decoding.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

HenDav

These details have not been verified by PyPI

Project description

value-steer

Inference-time value steering for vLLM: two decode-time interventions driven by a shared scalar value head.

Dynamic abstention (Knowing When to Quit, ICML 2026) — gate generation to EOS when the value crosses a calibrated threshold.
Value-filtered decoding (Selective Safety Steering via Value-Filtered Decoding) — at each step, sample K candidates and commit one by a safety value, keeping the natural sample when it is already safe.

Both score the same feature (the backbone's final post-norm hidden state, the tensor lm_head consumes) with the same head, so one trained probe serves either mode.

Install

pip install value-steer              # core (torch, numpy) — pure modules + training/calibration
pip install "value-steer[vllm]"      # + the vLLM runtime (serving / decoding)
pip install "value-steer[train]"     # + probe training (transformers)
pip install "value-steer[dev]"       # + pytest, ruff, build, twine

vLLM is an optional dependency pinned to the behaviorally-validated span (>=0.19.1,<0.20); install it to match your CUDA driver, and run value-steer-compat before widening the pin (see Compatibility). The pure modules (value head, steering ops, calibration, probe training) import without vLLM, so training/calibration boxes need only the core install.

A pre-trained safety value head (Mistral-7B-Instruct-v0.3 backbone, hh-rlhf labels via a Llama-3.1 judge) is published at HenDav/value-steer-safety-head — see its model card for the feature contract and a ready-to-use config snippet.

Use

Both modes plug in via vLLM's supported --worker-cls surface — no monkeypatching.

Abstention:

vllm serve <model> \
  --worker-cls value_steer.worker.ValueSteerWorker \
  --additional-config '{"abstain": {"value_head_path": "vhead.pt", "threshold": 0.5}}'

Value-filtered decoding (run with speculative decoding OFF — VFD owns the decode forward):

vllm serve <model> \
  --worker-cls value_steer.worker.ValueSteerWorker \
  --additional-config '{"vfd": {"value_head_path": "vhead.pt", "threshold": 0.3, "num_candidates": 8}}'

Per-request override via SamplingParams.extra_args (abstain_threshold / vfd_threshold).

Train a probe

from value_steer.train_probe import train_probe, save_probe_checkpoint
from value_steer.value_probe import ValueHead

head = ValueHead(hidden_size)            # shared head; fp32 on the post-norm feature
train_probe(backbone, head, train_loader, loss_name="focal", use_td=True, coh_weight=0.1)
save_probe_checkpoint("vhead.pt", head, threshold=calibrated_c, meta={"loss": "focal"})

save_probe_checkpoint writes the bare head weights to vhead.pt (loaded by the runners) plus a vhead.pt.meta.json sidecar with the feature spec, calibrated threshold, and metadata. The objective is label-agnostic — your labels define whether the value means P(unsafe) or P(should-quit).

Calibrate the threshold

from value_steer.calibration import posterior_threshold   # VFD / posterior filter
from value_steer.calibration import martingale_threshold  # time-to-unsafe martingale

c = posterior_threshold(safe_labels, trajectories, tau=0.05)

Given held-out (label, per-step value trajectory) pairs, this returns the threshold with a finite-sample bound on false interventions: P_H0(max_t p_t ≥ c) ≤ tau. That is the guarantee the threshold is supposed to carry — not a hand-tuned number.

Compatibility

The runners bind to a few vLLM internals. compat_check.py is the version gate:

value-steer-compat            # static contract checks (needs only `import vllm`)
value-steer-compat abstain    # abstention subset

Run it on every vLLM bump. Static checks fail loudly if a bound internal moved; behavioral checks (GPU) assert the feature actually fires — necessary because the runner hooks swallow errors in production, so "it ran" is not "it worked." Pair the agent's per-version run: static first (no GPU, pinpoints the broken contract), GPU behavioral only if static is green.

Tests

pytest -q          # pure-logic suite (no GPU, no vLLM): ops, calibration, training, allocator

Status

Component	State
value head, steering ops, calibration, training	complete, CPU-tested
abstention runner	complete vs pinned APIs; EOS-fires check is the GPU behavioral test
VFD runner	complete and GPU-validated (A100, vLLM 0.19.1, Mistral-7B): single-forward K-candidate decode, end-to-end safer outputs under a Llama-3.1 judge; no silent gaps remain
`--worker-cls` entry point, packaging, compat harness, version registry	complete

The VFD candidate forward goes through _model_forward + the attention-metadata builder (standard paged decode); the KV cache-write is backend-specific and requires FlashAttention v2's KV layout (compute capability ≥ 8.0). GPU behavioral tests live in tests/test_gpu_behavioral.py (marked gpu, skipped without CUDA) and assert the features fire, not merely run.

Limitations

vLLM pin. Bound to the behaviorally-validated span >=0.19.1,<0.20; the runners ground against vLLM internals that shift across minor versions. The registry in value_steer/validated_versions.json is authoritative and warns at runtime for untested in-range versions — widen only after value-steer-compat passes on a GPU box.
Serving default is eager. The VFD CUDA-graph/compile path is single-stream only (it corrupts concurrent requests under cudagraphs); enforce_eager=True is correct for all batch sizes and is the serving default. The compile speedup is an explicit opt-in (vfd.single_stream=True + one request at a time) for offline/benchmark use.
VFD threshold. The head steers around threshold 0.3; the conformal posterior_threshold in a head's sidecar is conservative (bounds false interventions) and can sit higher — start at 0.3 and tune. See docs/training-a-value-head.md.

Citation

If you use value-steer, please cite the software and the two papers it implements; see CITATION.cff.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

HenDav

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

value_steer-0.1.0.tar.gz (71.4 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

value_steer-0.1.0-py3-none-any.whl (55.7 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file value_steer-0.1.0.tar.gz.

File metadata

Download URL: value_steer-0.1.0.tar.gz
Upload date: Jun 25, 2026
Size: 71.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for value_steer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`69217ef296c9d41b6eba1c4954c1f532d77512edc2b57e6a8b64321d9089370b`
MD5	`c6a598a42182641c9466102b033daff7`
BLAKE2b-256	`dbdc2931af2a5eb4432b2b763abbe94a0af5989349901166f4473498967470e8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for value_steer-0.1.0.tar.gz:

Publisher: release.yml on HenDav/value-steering

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: value_steer-0.1.0.tar.gz
- Subject digest: 69217ef296c9d41b6eba1c4954c1f532d77512edc2b57e6a8b64321d9089370b
- Sigstore transparency entry: 1951578469
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: HenDav/value-steering@0b023a7f0e6d914c36a13f60f2435c750401ac39
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/HenDav
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0b023a7f0e6d914c36a13f60f2435c750401ac39
- Trigger Event: push

File details

Details for the file value_steer-0.1.0-py3-none-any.whl.

File metadata

Download URL: value_steer-0.1.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 55.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for value_steer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1266a2a2e0254c1e83b01ad2a6875e3230b66c473fc3e976de2b8991c9fa97db`
MD5	`867b7bc667e9b45308be3a36f714aa9a`
BLAKE2b-256	`8ccff6146fe6c2f253a8cb8dab315f0d76f65694f5661b1fb1b0071c33722271`

See more details on using hashes here.

Provenance

The following attestation bundles were made for value_steer-0.1.0-py3-none-any.whl:

Publisher: release.yml on HenDav/value-steering

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: value_steer-0.1.0-py3-none-any.whl
- Subject digest: 1266a2a2e0254c1e83b01ad2a6875e3230b66c473fc3e976de2b8991c9fa97db
- Sigstore transparency entry: 1951578667
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: HenDav/value-steering@0b023a7f0e6d914c36a13f60f2435c750401ac39
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/HenDav
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0b023a7f0e6d914c36a13f60f2435c750401ac39
- Trigger Event: push

value-steer 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

value-steer

Install

Use

Train a probe

Calibrate the threshold

Compatibility

Tests

Status

Limitations

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance