Hallucination neuron discovery and causal validation for transformer LLMs
Project description
hprobes
Discover and causally validate hallucination-associated FFN neurons (H-Neurons) in transformer LLMs.
Based on arXiv:2512.01797.
Install
pip install hprobes
# or
uv add hprobes
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from hprobes import HProbe
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
# samples: list of dicts with question, options, answer
probe = HProbe(model, tokenizer)
probe.fit(samples, options_key="choices", answer_key="answer")
print(probe.n_neurons_, probe.layer_distribution_)
results = probe.score()
print(f"AUROC {results['auroc']:.3f} gap {results['auroc_gap']:+.3f}")
probe.causal_validate()
CLI
# Fit and score on an MCQ dataset
hprobes run --model google/gemma-3-4b-it --data dataset.jsonl --samples 500
# Transfer: score a saved probe on a different model
hprobes transfer --probe results/probe --model google/gemma-3-4b --data dataset.jsonl
# Fit from pre-generated responses with judge labels
hprobes responses --model google/gemma-3-4b-it --data responses.jsonl
Supported formats
Input files: .jsonl, .json, .parquet
Auto-detected dataset formats: mmlu, medqa, medmcqa. Any other format works by passing options_key and answer_key directly.
Key options
| Parameter | Default | Description |
|---|---|---|
l1_C |
0.01 |
Inverse L1 strength — lower = fewer neurons |
contrastive |
True |
3-vs-1 labeling at the generated answer token |
layer_stride |
1 |
Sample every Nth layer (2 = faster) |
validation_split |
0.2 |
Holdout fraction for scoring |
max_tokens |
1024 |
Truncation length |
Save & load
probe.save("results/gemma_medqa") # writes .json + .pkl
probe = HProbe.load("results/gemma_medqa", model, tokenizer)
probe.score_on(new_samples, options_key="choices", answer_key="answer")
Acknowledgements
This research is conducted in collaboration with the Great Ormond Street Hospital DRIVE Unit.
Contributors
- Huseyin Cavus — Core Contributor
- Dr. Pavithra Rajendran — Machine Learning Lead, GOSH DRIVE
- Sebin Sabu — Senior AI Scientist, GOSH DRIVE
- Jaskaran Singh Kawatra — ML Engineer, GOSH DRIVE
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hprobes-0.5.1.tar.gz.
File metadata
- Download URL: hprobes-0.5.1.tar.gz
- Upload date:
- Size: 178.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c53640982437dd083dbd3eb28308ca689039ad4f46b2e850306e33090a8f7e22
|
|
| MD5 |
39f07be0f2dc71c4fdca10e378f9ffe6
|
|
| BLAKE2b-256 |
bbb6df954813ea344b2672ef8a3e005c031bd64a457281d0eddab20af7c70ab7
|
File details
Details for the file hprobes-0.5.1-py3-none-any.whl.
File metadata
- Download URL: hprobes-0.5.1-py3-none-any.whl
- Upload date:
- Size: 27.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbea4fccd4f5520ce121f87221ceee833b59bafbb74e391a5d312064aee117b1
|
|
| MD5 |
8513d29b853fd0823f5a719b078f646f
|
|
| BLAKE2b-256 |
0d514d8dc67b566004426430f4d4e125ab76889fac00ff8f9846b738365593c3
|