Efficient multi-token attribution for reasoning language models.

Project description

FlashTrace

Fast token attribution for reasoning language models.

FlashTrace traces generated answers back to the prompt tokens that shaped them. Use it from Python or the command line, export JSON traces, and render standalone HTML heatmaps for inspection and sharing.

Paper | Quickstart | CLI | Citation

Why FlashTrace

Reasoning models produce long generated chains, final answers, and intermediate spans that deserve targeted inspection. FlashTrace gives researchers a package-first workflow for tracing a selected generated span back to its supporting prompt tokens.

You get:

top-k prompt tokens ranked by attribution score
JSON traces for downstream analysis
standalone HTML token heatmaps
optional per-hop attribution panels
inclusive generation-token span controls for answer and reasoning segments

Install

From a local checkout:

pip install -e .

For development:

pip install -e ".[dev]"

FlashTrace uses PyTorch, Transformers, Accelerate, NumPy, and tqdm. A CUDA-capable GPU is recommended for public-scale Hugging Face models.

Quickstart

from flashtrace import FlashTrace, load_model_and_tokenizer

prompt = """Context: Paris is the capital of France.
Question: What is the capital of France?"""
target = "Paris"

model, tokenizer = load_model_and_tokenizer("Qwen/Qwen3-8B", device_map="auto")
tracer = FlashTrace(model, tokenizer, chunk_tokens=128, sink_chunk_tokens=32)

trace = tracer.trace(
    prompt=prompt,
    target=target,
    output_span=(0, 0),
    hops=1,
)

print(trace.topk_inputs(10))
trace.to_json("trace.json")
trace.to_html("trace.html")

trace.topk_inputs(10) returns TokenScore objects aligned to prompt-token indices:

rank  index  token      score
1     2      Paris      0.184
2     7      capital    0.131
3     10     France     0.119

trace.html is a standalone heatmap that highlights prompt tokens by final attribution score and includes trace metadata for the selected generated span.

Command Line

Create prompt and target files:

printf "Context: Paris is the capital of France.\nQuestion: What is the capital of France?\n" > prompt.txt
printf "Paris" > target.txt

Run a trace:

flashtrace trace \
  --model Qwen/Qwen3-8B \
  --prompt prompt.txt \
  --target target.txt \
  --output-span 0:0 \
  --hops 1 \
  --html trace.html \
  --json trace.json

The command prints a compact top-k table and writes the requested artifacts.

Useful flags:

--model: Hugging Face model id or local model path
--target: UTF-8 target text file
--output-span: inclusive START:END indices over generated tokens
--reasoning-span: inclusive START:END indices for a reasoning segment
--method: flashtrace, ifr-span, or ifr-matrix
--recompute-attention: lower-memory attention recomputation path
--device-map: Transformers device map, default auto
--dtype: auto, float16, bfloat16, or float32

Token Spans

output_span and reasoning_span use inclusive generation-token indices. The first generated token has index 0.

Use an initial trace to inspect tokenization:

for index, token in enumerate(trace.generation_tokens):
    print(index, repr(token))

Then choose spans:

trace = tracer.trace(
    prompt=prompt,
    target=target,
    reasoning_span=(0, 79),
    output_span=(80, 85),
    hops=1,
)

Scores are aligned to trace.prompt_tokens. trace.per_hop_scores stores the same prompt-token alignment for each hop.

Interpreting Results

High-scoring prompt tokens are the tokens FlashTrace attributes most strongly to the selected generated span. For answer inspection, use output_span around the final answer tokens. For chain-of-thought or reasoning inspection, use reasoning_span around the generated reasoning segment.

Recommended workflow:

Run a trace with your prompt and target.
Inspect trace.generation_tokens.
Select the answer or reasoning span.
Export trace.html.
Compare top-k tokens with the source prompt and any expected evidence.

Supported Models

FlashTrace targets Llama/Qwen-style decoder-only Hugging Face causal LMs with:

model.layers
Q/K/V/O attention projections
RMSNorm or LayerNorm
RoPE metadata

Validated model families for the first public release:

Qwen2
Qwen3
Llama

Python API

The public package exports:

from flashtrace import FlashTrace, TraceResult, load_model_and_tokenizer

FlashTrace.trace(...) accepts:

prompt: str
target: str | None
output_span: tuple[int, int] | None
reasoning_span: tuple[int, int] | None
hops: int
method: "flashtrace" | "ifr-span" | "ifr-matrix"
renorm_threshold: float | None

TraceResult includes:

prompt_tokens
generation_tokens
scores
per_hop_scores
thinking_ratios
output_span
reasoning_span
method
metadata

Export helpers:

trace.topk_inputs(20)
trace.to_dict()
trace.to_json("trace.json")
trace.to_html("trace.html")

Examples

python examples/quickstart.py --help
python examples/quickstart.py \
  --model Qwen/Qwen3-8B \
  --prompt prompt.txt \
  --target target.txt \
  --html trace.html

Heavy model examples are intended for GPU environments. CPU smoke tests use tiny randomly initialized models.

Repository Map

flashtrace/: reusable Python package
examples/: public quickstarts
tests/: CPU smoke tests
exp/: paper experiments and research artifacts
docs/superpowers/: design and implementation planning documents

Research Experiments

The exp/ directory contains the paper-era experiment runners, case studies, and saved artifacts. The public package API lives in flashtrace/; experiment scripts keep compatibility imports during the package migration.

Troubleshooting

CUDA memory

Use smaller models, lower precision, device_map="auto", shorter prompts, or --recompute-attention.

Span selection

Print trace.generation_tokens and select inclusive generated-token indices. Tokenization can split visible words into multiple model tokens.

Deterministic generation

Pass a target file for attribution against a known output. Leave --target out when you want the CLI to generate with deterministic defaults.

Tokenizer alignment

Inspect trace.prompt_tokens and trace.generation_tokens when scores appear shifted from visible text. Attribution scores follow tokenizer-level alignment.

HTML export

trace.to_html("trace.html") writes a standalone file that can be opened locally or shared as an artifact.

Paper

FlashTrace implements the method described in Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs.

Citation

@misc{pan2026flashtrace,
  title={Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs},
  author={Pan, Wenbo and Liu, Zhichao and Wang, Xianlong and Yu, Haining and Jia, Xiaohua},
  year={2026},
  eprint={2602.01914},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

Project details

Release history Release notifications | RSS feed

0.1.1

May 3, 2026

This version

0.1.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtrace-0.1.0.tar.gz (59.0 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flashtrace-0.1.0-py3-none-any.whl (59.0 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file flashtrace-0.1.0.tar.gz.

File metadata

Download URL: flashtrace-0.1.0.tar.gz
Upload date: May 3, 2026
Size: 59.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.7

File hashes

Hashes for flashtrace-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`97fe4cfbda8a7025dd0054bdaf45b96f246dcbc24115da876e7ca9954b3fee72`
MD5	`4f540774e430f06cffc9821b73d51565`
BLAKE2b-256	`1e8a97678ade05e41f092e43f54009af69fd12e77f3117e9a92a73907fafab3b`

See more details on using hashes here.

File details

Details for the file flashtrace-0.1.0-py3-none-any.whl.

File metadata

Download URL: flashtrace-0.1.0-py3-none-any.whl
Upload date: May 3, 2026
Size: 59.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.7

File hashes

Hashes for flashtrace-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d5afe6f7922e1bcefbf34e5febea23464f5696a25ae1dfcca5d5ab4d02f17379`
MD5	`fb7043e885f4300bd49cba4a96341582`
BLAKE2b-256	`326ea1fea67289ee155ac9eb6c7a4c41e092d47761e853a144839af77a25a530`

See more details on using hashes here.

flashtrace 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

FlashTrace

Why FlashTrace

Install

Quickstart

Command Line

Token Spans

Interpreting Results

Supported Models

Python API

Examples

Repository Map

Research Experiments

Troubleshooting

Paper

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes