Skip to main content

A high-performance instrumentation framework for capturing, streaming, and analyzing internal activations of large language models

Project description

LLM Instrumentation Framework

A high-performance instrumentation framework for LLM interpretability and observability.

Objectives

  • Throughput: Maintain ≥ 90% of un-instrumented inference speed.
  • Data rate: Sustain ≥ 2 GB/s activation streaming to disk.
  • Compression: Achieve ≥ 3× reduction with lossy error < 1e-3 when enabled.
  • Memory: Keep host RAM usage ≤ 24 GB with backpressure and buffering.

Stack

  • Runtime: PyTorch, asyncio, threading.
  • GPU: Optional CUDA streams and pinned buffers (see memory/cuda_manager.py).
  • Compression: LZ4, Zstd, optional no-op.
  • Analysis: Hooks for downstream causal graphs and SAE-based features.

Install

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Quick Usage

import torch
from llm_instrumentation import (
    InstrumentationFramework,
    InstrumentationConfig,
    HookGranularity,
)
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

config = InstrumentationConfig(
    granularity=HookGranularity.ATTENTION_ONLY,
    compression_algorithm="lz4",  # or "zstd" or "none"
    target_throughput_gbps=2.0,
    max_memory_gb=24,
)

framework = InstrumentationFramework(config)
framework.instrument_model(model)

with framework.capture_activations("output.stream"):
    _ = model(torch.randint(0, 100, (1, 16)))

analysis = framework.analyze_activations("output.stream")

## Per-token Tracking (opt-in)

Enable lightweight token boundary tracking without affecting the compression/streaming pipeline. Token metadata is stored in memory and saved to `{output_path}_tokens.json` on context exit.

```python
from llm_instrumentation import analyze_activations_with_tokens

with framework.capture_activations("gen.stream", track_per_token=True) as tracker:
    ids = torch.randint(0, 100, (1, 8))
    for _ in range(32):
        with torch.no_grad():
            out = model(ids)
            next_tok = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
        tracker.record_token(next_tok[0].item(), tokenizer.decode(next_tok[0]))
        ids = torch.cat([ids, next_tok], dim=-1)
        if next_tok[0].item() == tokenizer.eos_token_id:
            break

analysis = analyze_activations_with_tokens("gen.stream", framework)
print("bytes_per_token:", analysis.get("bytes_per_token"))

## Configuration

- `granularity` (`HookGranularity`):
  - `FULL_TENSOR`: Capture all supported layer outputs.
  - `SAMPLED_SLICES`: Randomly samples elements by `sampling_rate`.
  - `ATTENTION_ONLY`: Only layers whose names include `attn`.
  - `MLP_ONLY`: Only layers whose names include `mlp`.
- `compression_algorithm` (`str`): `"lz4"`, `"zstd"`, or `"none"`.
- `target_throughput_gbps` (`float`): Desired streaming rate for tuning.
- `max_memory_gb` (`float|None`): Budget for host buffering policies.

Refer to `docs/API.md` for full API details.

## Stream Format

Each packet:

- Header: network-endian `!HI` → `(name_len: uint16, data_len: uint32)`
- Name: UTF-8 layer/module name (`name_len` bytes)
- Data: compressed tensor bytes (`data_len` bytes)

See `docs/STREAM_FORMAT.md` for a parsing example.

## Architecture

E2E path: PyTorch forward hooks → async enqueue → compression workers → ring buffer → async file writer. See `docs/ARCHITECTURE.md`.

## Benchmarks & Performance

Run `scripts/run_benchmarks.sh` and see `docs/PERFORMANCE.md` for targets, methodology, and how to generate reports.

## Block I/O Instrumentation

### Overview

STRAP-LLM includes eBPF-based block I/O monitoring to correlate disk performance with activation streaming:

- **`scripts/tracepoints.py`**: Captures latency histograms and queue depth using stable kernel tracepoints (`block:block_rq_issue`/`block:block_rq_complete`)
- **`scripts/analyze_tracepoints.py`**: Generates summaries and PNG visualizations from persisted JSONL snapshots

### Quick Start

**Collect I/O metrics:**

```bash
sudo python3 scripts/tracepoints.py --interval 5 --output tracepoints.jsonl

Analyze results:

python3 scripts/analyze_tracepoints.py \
  --input tracepoints.jsonl \
  --output-dir ../benchmarks/systems/I-O

Features

  • Low overhead: < 1% CPU usage, ~100ns per I/O request
  • Stable ABI: Uses kernel tracepoints (no kprobes)
  • Async persistence: Memory-mapped JSONL writer with batch flushes
  • Log₂ histograms: Constant memory usage at any IOPS level
  • Queue depth tracking: In-flight request monitoring per device

CLI Options

Flag Description Default
--interval Sampling interval (seconds) 5.0
--output JSONL output file tracepoints.jsonl
--no-output Disable file output False
--flush-every Snapshots per flush 12
--fsync Force fsync after flush False

Output Format

Each JSONL line contains:

  • Timestamp (Unix epoch + ISO 8601)
  • Per-device latency histogram (log₂ buckets in μs)
  • Per-device in-flight request count

Example:

{
  "timestamp": 1696262400.123,
  "iso_timestamp": "2025-10-02T14:20:00.123000+00:00",
  "interval_s": 5.0,
  "latency_histogram": [
    {
      "device_name": "nvme0n1",
      "total": 45123,
      "buckets": [
        {"slot": 4, "count": 12000, "bucket_low": 16, "bucket_high": 31}
      ]
    }
  ],
  "inflight": [
    {"device_name": "nvme0n1", "count": 24}
  ]
}

Documentation

See docs/BLOCK_IO_TRACEPOINTS.md for:

  • Prerequisites and installation
  • Detailed usage examples
  • Integration with LLM workflows
  • Troubleshooting guide
  • Performance characteristics
  • Advanced customization

CPU & Memory Metrics

Overview

  • scripts/system_metrics.py: Engancha tracepoints exceptions:page_fault_user y sched:sched_switch para capturar fallos de página por PID, tiempo fuera de CPU y presión PSI de CPU/I/O/memoria.
  • Se ejecuta como root y persiste snapshots JSONL con los campos off_cpu_ns, page_faults y pressure cada N segundos.
  • La salida complementa los histogramas de latencia/colas producidos por tracepoints.py para correlacionar latencia de servicio con contención de CPU, swapping y presión sistémica.

Quick Start

sudo python3 scripts/system_metrics.py --interval 5 --output system_metrics.jsonl

Cada línea JSON incluye timestamp, iso_timestamp, interval_s, un mapa off_cpu_ns (PID → nanosegundos fuera de CPU), page_faults (PID → fallos de página de usuario) y la estructura pressure con métricas PSI para CPU, I/O y memoria.

Para ver las muestras sólo por pantalla añade --no-output. Usa --flush-every y --fsync para controlar el flushing asíncrono en disco.

CLI Options

Flag Description Default
--interval Intervalo entre snapshots (s) 5.0
--output Archivo JSONL de salida system_metrics.jsonl
--no-output Deshabilita escritura a disco False
--flush-every Snapshots por flush 12
--fsync Forzar fsync tras cada flush False

Correlación

Combina system_metrics.jsonl y tracepoints.jsonl con scripts/analyze_tracepoints.py o cargas personalizadas en pandas para atribuir latencia a contención de CPU, fallos de página, presión PSI o I/O de disco.

Development

  • Tests: pytest -q in repo root or the package directory.
  • Examples: examples/basic_usage.py.
  • Contributions: PRs welcome. Keep changes focused and covered by tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_instrumentation-1.1.0.tar.gz (60.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_instrumentation-1.1.0-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_instrumentation-1.1.0.tar.gz.

File metadata

  • Download URL: llm_instrumentation-1.1.0.tar.gz
  • Upload date:
  • Size: 60.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_instrumentation-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5ec298bf22fd0c3fca8af397bee5809354c37fa54b4c1f7d0606998f41ee239c
MD5 0ac65f7a40785f62bfb300289f7b68e9
BLAKE2b-256 22e96bf0d302cd4b3650554099c437e0519d2e15aae90fea5309dbfd345059cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_instrumentation-1.1.0.tar.gz:

Publisher: python-publish.yml on rubenfb23/STRAP-LLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_instrumentation-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_instrumentation-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b7c4ce8baf7f2e3900ceeb56cd820880fc7ddb7f499973df64aaa9e47bc2f730
MD5 1fcf6f611d6966d95ec42bde415993a1
BLAKE2b-256 7cdde36e8505a98f6579fe4a21a386d739cc9f5f71ff3281e9ae375df301c6d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_instrumentation-1.1.0-py3-none-any.whl:

Publisher: python-publish.yml on rubenfb23/STRAP-LLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page