Skip to main content

Efficient multi-token attribution for reasoning language models.

Project description

FlashTrace

Fast token attribution for reasoning language models.

FlashTrace traces generated answers back to the prompt tokens that shaped them. Use it from Python or the command line, export JSON traces, and render standalone HTML heatmaps for inspection and sharing.

Paper | Quickstart | CLI | Citation

Why FlashTrace

Reasoning models produce long generated chains, final answers, and intermediate spans that deserve targeted inspection. FlashTrace gives researchers a package-first workflow for tracing a selected generated span back to its supporting prompt tokens.

You get:

  • top-k prompt tokens ranked by attribution score
  • JSON traces for downstream analysis
  • standalone HTML token heatmaps
  • optional per-hop attribution panels
  • inclusive generation-token span controls for answer and reasoning segments

Install

From a local checkout:

pip install -e .

For development:

pip install -e ".[dev]"

FlashTrace uses PyTorch, Transformers, Accelerate, NumPy, and tqdm. A CUDA-capable GPU is recommended for public-scale Hugging Face models.

Quickstart

from flashtrace import FlashTrace, load_model_and_tokenizer

prompt = """Context: Paris is the capital of France.
Question: What is the capital of France?"""
target = "Paris"

model, tokenizer = load_model_and_tokenizer("Qwen/Qwen3-8B", device_map="auto")
tracer = FlashTrace(model, tokenizer, chunk_tokens=128, sink_chunk_tokens=32)

trace = tracer.trace(
    prompt=prompt,
    target=target,
    output_span=(0, 0),
    hops=1,
)

print(trace.topk_inputs(10))
trace.to_json("trace.json")
trace.to_html("trace.html")

trace.topk_inputs(10) returns TokenScore objects aligned to prompt-token indices:

rank  index  token      score
1     2      Paris      0.184
2     7      capital    0.131
3     10     France     0.119

trace.html is a standalone heatmap that highlights prompt tokens by final attribution score and includes trace metadata for the selected generated span.

Command Line

Create prompt and target files:

printf "Context: Paris is the capital of France.\nQuestion: What is the capital of France?\n" > prompt.txt
printf "Paris" > target.txt

Run a trace:

flashtrace trace \
  --model Qwen/Qwen3-8B \
  --prompt prompt.txt \
  --target target.txt \
  --output-span 0:0 \
  --hops 1 \
  --html trace.html \
  --json trace.json

The command prints a compact top-k table and writes the requested artifacts.

Useful flags:

  • --model: Hugging Face model id or local model path
  • --target: UTF-8 target text file
  • --output-span: inclusive START:END indices over generated tokens
  • --reasoning-span: inclusive START:END indices for a reasoning segment
  • --method: flashtrace, ifr-span, or ifr-matrix
  • --recompute-attention: lower-memory attention recomputation path
  • --device-map: Transformers device map, default auto
  • --dtype: auto, float16, bfloat16, or float32

Token Spans

output_span and reasoning_span use inclusive generation-token indices. The first generated token has index 0.

Use an initial trace to inspect tokenization:

for index, token in enumerate(trace.generation_tokens):
    print(index, repr(token))

Then choose spans:

trace = tracer.trace(
    prompt=prompt,
    target=target,
    reasoning_span=(0, 79),
    output_span=(80, 85),
    hops=1,
)

Scores are aligned to trace.prompt_tokens. trace.per_hop_scores stores the same prompt-token alignment for each hop.

Interpreting Results

High-scoring prompt tokens are the tokens FlashTrace attributes most strongly to the selected generated span. For answer inspection, use output_span around the final answer tokens. For chain-of-thought or reasoning inspection, use reasoning_span around the generated reasoning segment.

Recommended workflow:

  1. Run a trace with your prompt and target.
  2. Inspect trace.generation_tokens.
  3. Select the answer or reasoning span.
  4. Export trace.html.
  5. Compare top-k tokens with the source prompt and any expected evidence.

Supported Models

FlashTrace targets Llama/Qwen-style decoder-only Hugging Face causal LMs with:

  • model.layers
  • Q/K/V/O attention projections
  • RMSNorm or LayerNorm
  • RoPE metadata

Validated model families for the first public release:

  • Qwen2
  • Qwen3
  • Llama

Python API

The public package exports:

from flashtrace import FlashTrace, TraceResult, load_model_and_tokenizer

FlashTrace.trace(...) accepts:

  • prompt: str
  • target: str | None
  • output_span: tuple[int, int] | None
  • reasoning_span: tuple[int, int] | None
  • hops: int
  • method: "flashtrace" | "ifr-span" | "ifr-matrix"
  • renorm_threshold: float | None

TraceResult includes:

  • prompt_tokens
  • generation_tokens
  • scores
  • per_hop_scores
  • thinking_ratios
  • output_span
  • reasoning_span
  • method
  • metadata

Export helpers:

trace.topk_inputs(20)
trace.to_dict()
trace.to_json("trace.json")
trace.to_html("trace.html")

Examples

python examples/quickstart.py --help
python examples/quickstart.py \
  --model Qwen/Qwen3-8B \
  --prompt prompt.txt \
  --target target.txt \
  --html trace.html

Heavy model examples are intended for GPU environments. CPU smoke tests use tiny randomly initialized models.

Repository Map

  • flashtrace/: reusable Python package
  • examples/: public quickstarts
  • tests/: CPU smoke tests
  • exp/: paper experiments and research artifacts
  • docs/superpowers/: design and implementation planning documents

Research Experiments

The exp/ directory contains the paper-era experiment runners, case studies, and saved artifacts. The public package API lives in flashtrace/; experiment scripts keep compatibility imports during the package migration.

Troubleshooting

CUDA memory

Use smaller models, lower precision, device_map="auto", shorter prompts, or --recompute-attention.

Span selection

Print trace.generation_tokens and select inclusive generated-token indices. Tokenization can split visible words into multiple model tokens.

Deterministic generation

Pass a target file for attribution against a known output. Leave --target out when you want the CLI to generate with deterministic defaults.

Tokenizer alignment

Inspect trace.prompt_tokens and trace.generation_tokens when scores appear shifted from visible text. Attribution scores follow tokenizer-level alignment.

HTML export

trace.to_html("trace.html") writes a standalone file that can be opened locally or shared as an artifact.

Paper

FlashTrace implements the method described in Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs.

Citation

@misc{pan2026flashtrace,
  title={Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs},
  author={Pan, Wenbo and Liu, Zhichao and Wang, Xianlong and Yu, Haining and Jia, Xiaohua},
  year={2026},
  eprint={2602.01914},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashtrace-0.1.0.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flashtrace-0.1.0-py3-none-any.whl (59.0 kB view details)

Uploaded Python 3

File details

Details for the file flashtrace-0.1.0.tar.gz.

File metadata

  • Download URL: flashtrace-0.1.0.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.7

File hashes

Hashes for flashtrace-0.1.0.tar.gz
Algorithm Hash digest
SHA256 97fe4cfbda8a7025dd0054bdaf45b96f246dcbc24115da876e7ca9954b3fee72
MD5 4f540774e430f06cffc9821b73d51565
BLAKE2b-256 1e8a97678ade05e41f092e43f54009af69fd12e77f3117e9a92a73907fafab3b

See more details on using hashes here.

File details

Details for the file flashtrace-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for flashtrace-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5afe6f7922e1bcefbf34e5febea23464f5696a25ae1dfcca5d5ab4d02f17379
MD5 fb7043e885f4300bd49cba4a96341582
BLAKE2b-256 326ea1fea67289ee155ac9eb6c7a4c41e092d47761e853a144839af77a25a530

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page