A high-performance instrumentation framework for capturing, streaming, and analyzing internal activations of large language models

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rubenfb23

These details have not been verified by PyPI

Project description

LLM Instrumentation Framework

A high-performance instrumentation framework for LLM interpretability and observability.

Objectives

Throughput: Maintain ≥ 90% of un-instrumented inference speed.
Data rate: Sustain ≥ 2 GB/s activation streaming to disk.
Compression: Achieve ≥ 3× reduction with lossy error < 1e-3 when enabled.
Memory: Keep host RAM usage ≤ 24 GB with backpressure and buffering.
Resilience: Automatic checkpoints safeguard long generations from data loss.

Stack

Runtime: PyTorch, asyncio, threading.
GPU: Optional CUDA streams and pinned buffers (see memory/cuda_manager.py).
Compression: LZ4, Zstd, optional no-op.
Analysis: Hooks for downstream causal graphs and SAE-based features.

Install

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Quick Usage

import torch
from llm_instrumentation import capture_activations
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

with capture_activations(model, preset="balanced", output_path="output.stream") as framework:
    _ = model(torch.randint(0, 100, (1, 16)))

analysis = framework.analyze_activations("output.stream")

## Per-token Tracking & Checkpointing (opt-in)

Enable lightweight token boundary tracking without affecting the compression/streaming pipeline. Token metadata is stored in memory and saved to `{output_path}_tokens.json` on context exit. Optional checkpoints persist snapshots every _N_ tokens to make long streaming sessions resumable.

```python
from llm_instrumentation import analyze_activations_with_tokens

with framework.capture_activations("gen.stream", track_per_token=True) as tracker:
    ids = torch.randint(0, 100, (1, 8))
    for _ in range(32):
        with torch.no_grad():
            out = model(ids)
            next_tok = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
        tracker.record_token(next_tok[0].item(), tokenizer.decode(next_tok[0]))
        ids = torch.cat([ids, next_tok], dim=-1)
        if next_tok[0].item() == tokenizer.eos_token_id:
            break

analysis = analyze_activations_with_tokens("gen.stream", framework)
print("bytes_per_token:", analysis.get("bytes_per_token"))

# Persist checkpoints every 64 tokens to allow resuming long captures.
with framework.capture_activations(
    "gen.stream",
    track_per_token=True,
    checkpoint_interval_tokens=64,
) as tracker:
    ...

# Resume the same capture later (appends to the existing stream file).
with framework.capture_activations(
    "gen.stream",
    track_per_token=True,
    checkpoint_interval_tokens=64,
    resume_from_checkpoint=True,
) as tracker:
    ...

Checkpoint files default to {output_path}.ckpt.json; override with checkpoint_path to control placement. Successful captures clean up checkpoints after flushing token metadata.

Configuration

InstrumentationConfig.fast_capture() - minimal overhead capture without compression.
InstrumentationConfig.balanced() - default preset balancing throughput and compression.
InstrumentationConfig.max_compression() - prioritize disk footprint with Zstd.
InstrumentationConfig.attention_analysis() / .mlp_analysis() - capture subsets for focused studies.
Builder-style overrides: e.g. InstrumentationConfig.balanced().with_compression("zstd").with_memory_limit(16).
Direct parameters:
- granularity (HookGranularity): FULL_TENSOR, SAMPLED_SLICES, ATTENTION_ONLY, MLP_ONLY.
- compression_algorithm (str): "lz4", "zstd", or "none".
- target_throughput_gbps (float): Desired streaming rate for tuning.
- max_memory_gb (float|None): Budget for host buffering policies.

Refer to docs/API.md for full API details.

Automatic Configuration

The llm_instrumentation.core.auto_detect module can derive sensible defaults from a model instance:

from llm_instrumentation import capture_activations
from llm_instrumentation.core.auto_detect import create_optimized_config, detect_model_architecture

arch = detect_model_architecture(model)
config = create_optimized_config(arch, purpose="performance_analysis")
with capture_activations(model, config=config, output_path=f"{arch}.stream"):
    ...

This path keeps manual overrides available via the builder helpers while accelerating setup for common analysis workflows.

Stream Format

Each packet:

Header: network-endian !HI → (name_len: uint16, data_len: uint32)
Name: UTF-8 layer/module name (name_len bytes)
Data: compressed tensor bytes (data_len bytes)

See docs/STREAM_FORMAT.md for a parsing example.

Architecture

E2E path: PyTorch forward hooks → async enqueue → compression workers → ring buffer → async file writer. See docs/ARCHITECTURE.md.

Benchmarks & Performance

Run scripts/run_benchmarks.sh and see docs/PERFORMANCE.md for targets, methodology, and how to generate reports.

Block I/O Instrumentation

Overview

STRAP-LLM includes eBPF-based block I/O monitoring to correlate disk performance with activation streaming:

scripts/tracepoints.py: Captures latency histograms and queue depth using stable kernel tracepoints (block:block_rq_issue/block:block_rq_complete)
scripts/analyze_tracepoints.py: Generates summaries and PNG visualizations from persisted JSONL snapshots

Quick Start

Collect I/O metrics:

sudo python3 scripts/tracepoints.py --interval 5 --output tracepoints.jsonl

Analyze results:

python3 scripts/analyze_tracepoints.py \
  --input tracepoints.jsonl \
  --output-dir ../benchmarks/systems/I-O

Features

Low overhead: < 1% CPU usage, ~100ns per I/O request
Stable ABI: Uses kernel tracepoints (no kprobes)
Async persistence: Memory-mapped JSONL writer with batch flushes
Log₂ histograms: Constant memory usage at any IOPS level
Queue depth tracking: In-flight request monitoring per device

CLI Options

Flag	Description	Default
`--interval`	Sampling interval (seconds)	5.0
`--output`	JSONL output file	`tracepoints.jsonl`
`--no-output`	Disable file output	False
`--flush-every`	Snapshots per flush	12
`--fsync`	Force fsync after flush	False

Output Format

Each JSONL line contains:

Timestamp (Unix epoch + ISO 8601)
Per-device latency histogram (log₂ buckets in μs)
Per-device in-flight request count

Example:

{
  "timestamp": 1696262400.123,
  "iso_timestamp": "2025-10-02T14:20:00.123000+00:00",
  "interval_s": 5.0,
  "latency_histogram": [
    {
      "device_name": "nvme0n1",
      "total": 45123,
      "buckets": [
        {"slot": 4, "count": 12000, "bucket_low": 16, "bucket_high": 31}
      ]
    }
  ],
  "inflight": [
    {"device_name": "nvme0n1", "count": 24}
  ]
}

Documentation

See docs/BLOCK_IO_TRACEPOINTS.md for:

Prerequisites and installation
Detailed usage examples
Integration with LLM workflows
Troubleshooting guide
Performance characteristics
Advanced customization

CPU & Memory Metrics

Overview

scripts/system_metrics.py: Engancha tracepoints exceptions:page_fault_user y sched:sched_switch para capturar fallos de página por PID, tiempo fuera de CPU y presión PSI de CPU/I/O/memoria.
Se ejecuta como root y persiste snapshots JSONL con los campos off_cpu_ns, page_faults y pressure cada N segundos.
La salida complementa los histogramas de latencia/colas producidos por tracepoints.py para correlacionar latencia de servicio con contención de CPU, swapping y presión sistémica.

Quick Start

sudo python3 scripts/system_metrics.py --interval 5 --output system_metrics.jsonl

Cada línea JSON incluye timestamp, iso_timestamp, interval_s, un mapa off_cpu_ns (PID → nanosegundos fuera de CPU), page_faults (PID → fallos de página de usuario) y la estructura pressure con métricas PSI para CPU, I/O y memoria.

Para ver las muestras sólo por pantalla añade --no-output. Usa --flush-every y --fsync para controlar el flushing asíncrono en disco.

CLI Options

Flag	Description	Default
`--interval`	Intervalo entre snapshots (s)	5.0
`--output`	Archivo JSONL de salida	`system_metrics.jsonl`
`--no-output`	Deshabilita escritura a disco	False
`--flush-every`	Snapshots por flush	12
`--fsync`	Forzar fsync tras cada flush	False

Correlación

Combina system_metrics.jsonl y tracepoints.jsonl con scripts/analyze_tracepoints.py o cargas personalizadas en pandas para atribuir latencia a contención de CPU, fallos de página, presión PSI o I/O de disco.

Development

Tests: pytest -q in repo root or the package directory.
Examples: examples/basic_usage.py.
Contributions: PRs welcome. Keep changes focused and covered by tests.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rubenfb23

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.3.1

Nov 5, 2025

1.3.0

Nov 5, 2025

1.2.11

Nov 3, 2025

This version

1.2.0

Nov 3, 2025

1.1.0

Nov 2, 2025

0.1.0

Nov 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_instrumentation-1.2.0.tar.gz (67.5 kB view details)

Uploaded Nov 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_instrumentation-1.2.0-py3-none-any.whl (29.9 kB view details)

Uploaded Nov 3, 2025 Python 3

File details

Details for the file llm_instrumentation-1.2.0.tar.gz.

File metadata

Download URL: llm_instrumentation-1.2.0.tar.gz
Upload date: Nov 3, 2025
Size: 67.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_instrumentation-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`4ac4fc13676eada2eb2a570ea484239a63a55ac6fa810c2dcf4465b617e9ddda`
MD5	`02e869ea639e71315d0864bb8c775e51`
BLAKE2b-256	`ad9524f6ffc06773b325faa45fb58fcdcff6e658dc376693b3d03e9c482243b3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_instrumentation-1.2.0.tar.gz:

Publisher: python-publish.yml on rubenfb23/STRAP-LLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_instrumentation-1.2.0.tar.gz
- Subject digest: 4ac4fc13676eada2eb2a570ea484239a63a55ac6fa810c2dcf4465b617e9ddda
- Sigstore transparency entry: 662443246
- Sigstore integration time: Nov 3, 2025
Source repository:
- Permalink: rubenfb23/STRAP-LLM@ec18a16292ba75b8cb2560153f94804fed7771ce
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/rubenfb23
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@ec18a16292ba75b8cb2560153f94804fed7771ce
- Trigger Event: release

File details

Details for the file llm_instrumentation-1.2.0-py3-none-any.whl.

File metadata

Download URL: llm_instrumentation-1.2.0-py3-none-any.whl
Upload date: Nov 3, 2025
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_instrumentation-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d18a21ef324aac7773964c198a94885f4d86c1f359b97d41fcbd3fc6e64d6c3d`
MD5	`0dbb175d11977f61c14dc0b500137cb7`
BLAKE2b-256	`61907dd0bbe72149c5a9d0a8633f0c74e9ac2b7ae2ba4f76204117cdde65ae5b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_instrumentation-1.2.0-py3-none-any.whl:

Publisher: python-publish.yml on rubenfb23/STRAP-LLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_instrumentation-1.2.0-py3-none-any.whl
- Subject digest: d18a21ef324aac7773964c198a94885f4d86c1f359b97d41fcbd3fc6e64d6c3d
- Sigstore transparency entry: 662443250
- Sigstore integration time: Nov 3, 2025
Source repository:
- Permalink: rubenfb23/STRAP-LLM@ec18a16292ba75b8cb2560153f94804fed7771ce
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/rubenfb23
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@ec18a16292ba75b8cb2560153f94804fed7771ce
- Trigger Event: release

llm-instrumentation 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

LLM Instrumentation Framework

Objectives

Stack

Install

Quick Usage

Configuration

Automatic Configuration

Stream Format

Architecture

Benchmarks & Performance

Block I/O Instrumentation

Overview

Quick Start

Features

CLI Options

Output Format

Documentation

CPU & Memory Metrics

Overview

Quick Start

CLI Options

Correlación

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance