Efficient multi-token attribution for reasoning language models.
Project description
FlashTrace
Fast token attribution for reasoning language models.
FlashTrace traces generated answers back to the prompt tokens that shaped them. Use it from Python or the command line, export JSON traces, and render standalone HTML heatmaps for inspection and sharing.
Paper | Quickstart | CLI | Citation
Why FlashTrace
Reasoning models produce long generated chains, final answers, and intermediate spans that deserve targeted inspection. FlashTrace gives researchers a package-first workflow for tracing a selected generated span back to its supporting prompt tokens.
You get:
- top-k prompt tokens ranked by attribution score
- JSON traces for downstream analysis
- standalone HTML token heatmaps
- optional per-hop attribution panels
- inclusive generation-token span controls for answer and reasoning segments
Install
From a local checkout:
pip install -e .
For development:
pip install -e ".[dev]"
FlashTrace uses PyTorch, Transformers, Accelerate, NumPy, and tqdm. A CUDA-capable GPU is recommended for public-scale Hugging Face models.
Quickstart
from flashtrace import FlashTrace, load_model_and_tokenizer
prompt = """Context: Paris is the capital of France.
Question: What is the capital of France?"""
target = "Paris"
model, tokenizer = load_model_and_tokenizer("Qwen/Qwen3-8B", device_map="auto")
tracer = FlashTrace(model, tokenizer, chunk_tokens=128, sink_chunk_tokens=32)
trace = tracer.trace(
prompt=prompt,
target=target,
output_span=(0, 0),
hops=1,
)
print(trace.topk_inputs(10))
trace.to_json("trace.json")
trace.to_html("trace.html")
trace.topk_inputs(10) returns TokenScore objects aligned to prompt-token indices:
rank index token score
1 2 Paris 0.184
2 7 capital 0.131
3 10 France 0.119
trace.html is a standalone heatmap that highlights prompt tokens by final attribution score and includes trace metadata for the selected generated span.
Command Line
Create prompt and target files:
printf "Context: Paris is the capital of France.\nQuestion: What is the capital of France?\n" > prompt.txt
printf "Paris" > target.txt
Run a trace:
flashtrace trace \
--model Qwen/Qwen3-8B \
--prompt prompt.txt \
--target target.txt \
--output-span 0:0 \
--hops 1 \
--html trace.html \
--json trace.json
The command prints a compact top-k table and writes the requested artifacts.
Useful flags:
--model: Hugging Face model id or local model path--target: UTF-8 target text file--output-span: inclusiveSTART:ENDindices over generated tokens--reasoning-span: inclusiveSTART:ENDindices for a reasoning segment--method:flashtrace,ifr-span, orifr-matrix--recompute-attention: lower-memory attention recomputation path--device-map: Transformers device map, defaultauto--dtype:auto,float16,bfloat16, orfloat32
Token Spans
output_span and reasoning_span use inclusive generation-token indices. The first generated token has index 0.
Use an initial trace to inspect tokenization:
for index, token in enumerate(trace.generation_tokens):
print(index, repr(token))
Then choose spans:
trace = tracer.trace(
prompt=prompt,
target=target,
reasoning_span=(0, 79),
output_span=(80, 85),
hops=1,
)
Scores are aligned to trace.prompt_tokens. trace.per_hop_scores stores the same prompt-token alignment for each hop.
Interpreting Results
High-scoring prompt tokens are the tokens FlashTrace attributes most strongly to the selected generated span. For answer inspection, use output_span around the final answer tokens. For chain-of-thought or reasoning inspection, use reasoning_span around the generated reasoning segment.
Recommended workflow:
- Run a trace with your prompt and target.
- Inspect
trace.generation_tokens. - Select the answer or reasoning span.
- Export
trace.html. - Compare top-k tokens with the source prompt and any expected evidence.
Supported Models
FlashTrace targets Llama/Qwen-style decoder-only Hugging Face causal LMs with:
model.layers- Q/K/V/O attention projections
- RMSNorm or LayerNorm
- RoPE metadata
Validated model families for the first public release:
- Qwen2
- Qwen3
- Llama
Python API
The public package exports:
from flashtrace import FlashTrace, TraceResult, load_model_and_tokenizer
FlashTrace.trace(...) accepts:
prompt: strtarget: str | Noneoutput_span: tuple[int, int] | Nonereasoning_span: tuple[int, int] | Nonehops: intmethod: "flashtrace" | "ifr-span" | "ifr-matrix"renorm_threshold: float | None
TraceResult includes:
prompt_tokensgeneration_tokensscoresper_hop_scoresthinking_ratiosoutput_spanreasoning_spanmethodmetadata
Export helpers:
trace.topk_inputs(20)
trace.to_dict()
trace.to_json("trace.json")
trace.to_html("trace.html")
Examples
python examples/quickstart.py --help
python examples/quickstart.py \
--model Qwen/Qwen3-8B \
--prompt prompt.txt \
--target target.txt \
--html trace.html
Heavy model examples are intended for GPU environments. CPU smoke tests use tiny randomly initialized models.
Repository Map
flashtrace/: reusable Python packageexamples/: public quickstartstests/: CPU smoke testsexp/: paper experiments and research artifactsdocs/superpowers/: design and implementation planning documents
Research Experiments
The exp/ directory contains the paper-era experiment runners, case studies, and saved artifacts. The public package API lives in flashtrace/; experiment scripts keep compatibility imports during the package migration.
Troubleshooting
CUDA memory
Use smaller models, lower precision, device_map="auto", shorter prompts, or --recompute-attention.
Span selection
Print trace.generation_tokens and select inclusive generated-token indices. Tokenization can split visible words into multiple model tokens.
Deterministic generation
Pass a target file for attribution against a known output. Leave --target out when you want the CLI to generate with deterministic defaults.
Tokenizer alignment
Inspect trace.prompt_tokens and trace.generation_tokens when scores appear shifted from visible text. Attribution scores follow tokenizer-level alignment.
HTML export
trace.to_html("trace.html") writes a standalone file that can be opened locally or shared as an artifact.
Paper
FlashTrace implements the method described in Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs.
Citation
@misc{pan2026flashtrace,
title={Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs},
author={Pan, Wenbo and Liu, Zhichao and Wang, Xianlong and Yu, Haining and Jia, Xiaohua},
year={2026},
eprint={2602.01914},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flashtrace-0.1.0.tar.gz.
File metadata
- Download URL: flashtrace-0.1.0.tar.gz
- Upload date:
- Size: 59.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97fe4cfbda8a7025dd0054bdaf45b96f246dcbc24115da876e7ca9954b3fee72
|
|
| MD5 |
4f540774e430f06cffc9821b73d51565
|
|
| BLAKE2b-256 |
1e8a97678ade05e41f092e43f54009af69fd12e77f3117e9a92a73907fafab3b
|
File details
Details for the file flashtrace-0.1.0-py3-none-any.whl.
File metadata
- Download URL: flashtrace-0.1.0-py3-none-any.whl
- Upload date:
- Size: 59.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5afe6f7922e1bcefbf34e5febea23464f5696a25ae1dfcca5d5ab4d02f17379
|
|
| MD5 |
fb7043e885f4300bd49cba4a96341582
|
|
| BLAKE2b-256 |
326ea1fea67289ee155ac9eb6c7a4c41e092d47761e853a144839af77a25a530
|