Skip to main content

Zero-config runtime observability for SGLang inference: attack-surface, supply-chain, anomaly, and PII detectors emitted as structured events.

Project description

sglang-lens

CI License

Zero-config runtime observability for SGLang inference, with opt-in interventions for teams that need to block, redact, or rate-limit.

Try it in 30 seconds

A checked-in synthetic CVE-2026-5760 canary lives at demo/malicious-model/. From a clean Python 3.10+ environment:

git clone https://github.com/glenfmessenger/sglang-lens
cd sglang-lens
pip install .
sglang-lens scan-model demo/malicious-model/

You'll get a supply_chain/jinja_ssti detection at severity critical and a non-zero exit code. Re-run against any real HuggingFace tokenizer directory and the same command exits cleanly. See demo/README.md for the full canary write-up.


What it does

sglang-lens is a Python observability layer for LLM-inference paths. The detectors and interventions are usable today in two shapes:

  • As an in-process layer — call Lens.decide_request(...) from your own request handler. This is the surface the unit tests cover.
  • As a sidecar HTTP proxy in front of any OpenAI-compatible inference server. A reference harness ships in examples/proxy_with_lens.py and is the only path that has been end-to-end-tested against a real SGLang serving stack.

Direct integration with the SGLang Model Gateway — as either a native IOChain plugin or a WASM Middleware module — is on the roadmap. src/sglang_lens/middleware.py implements the IOChain contract we believe the Gateway expects; src/sglang_lens/wasm_bridge.py is host-side scaffolding for a WASM module that hasn't been built yet. Neither has been validated against a real Gateway instance. See GitHub Issues for status.

There are two tiers:

  • Tier 1 (observability) is on by default. Detectors inspect every request and response, scan models on load, and emit structured events. They never modify the request or block it.
  • Tier 2 (interventions) is off by default. Each intervention has its own enabled: false flag. When enabled, an intervention may block a request, redact its body, or attach an X-Lens-Triggered header.

SGLANG_LENS=1 with no config gets you Tier 1 only. Tier 2 requires an explicit YAML opt-in per feature. Nothing is suppressed without you asking for it.


Why

The March 2026 SGLang RCEs. On 12 March 2026, two unauthenticated remote code execution CVEs were assigned against SGLang's ZMQ-based control plane — CVE-2026-3059 and CVE-2026-3060. Both stem from pickle deserialisation on the broker socket, reachable via the multimodal and disaggregation transports respectively. Both were exploitable against any cluster that enabled multimodal generation or encoder parallel disaggregation with the ZMQ broker reachable outside localhost — which is the default network binding when those features are on. Default text-only deployments are not affected. Operators who didn't know those ports were open found out from incident response. sglang-lens detects and logs the risky surface at startup; with Tier 2's network_safety in enforce mode it refuses to start at all.

Supply-chain risk in model weights — CVE-2026-5760. GGUF and Safetensors files distribute Jinja2 chat templates alongside the tensor data. The canonical exploit is CVE-2026-5760 (April 2026): a malicious tokenizer.chat_template shipped with a model weight triggers Jinja2 SSTI when SGLang renders it at /v1/rerank, allowing arbitrary Python execution the moment the model is hit with a request. sglang-lens scans every model on load and emits a structured event for any pattern matching known-bad template signatures; the Tier 2 model-source allow-list refuses to load anything outside an explicit set of prefixes.

Compliance requirements that post-hoc log scraping can't satisfy. Regulated environments need an auditable record that PII was observed leaving the model, with correlation IDs that match the originating request. Tailing nginx access logs after the fact doesn't produce this. sglang-lens emits per-request events at ingress and egress with stable correlation IDs, and Tier 2 attaches X-Lens-Triggered: true + X-Lens-Reason headers to the HTTP response so callers know inline.

This is not a safety system. It does not provide probabilistic guarantees against adversarial prompts or model misbehaviour. It provides operational visibility and runtime instrumentation, plus a small number of opt-in hard controls for teams that need them.


Running it

As a sidecar HTTP proxy (the tested path)

Put examples/proxy_with_lens.py in front of any OpenAI-compatible inference server (SGLang, vLLM, anything else that speaks the spec). The proxy honours block / redact / allow decisions and attaches X-Lens-Triggered / X-Lens-Reason headers.

# 1. Start your inference server on some upstream port (example: SGLang)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct --host 127.0.0.1 --port 30000

# 2. Start the lens proxy in front of it
SGLANG_LENS=1 python examples/proxy_with_lens.py \
  --upstream http://127.0.0.1:30000 --host 0.0.0.0 --port 8080

# 3. Point your callers at port 8080 instead of 30000.
# With Tier 2 enabled, add: SGLANG_LENS_CONFIG=lens.yaml

As a library

Embed the lens directly in your own request path:

from sglang_lens import Lens, LensConfig

lens = Lens(LensConfig.from_env())
decision, event = lens.decide_request(
    request_id, body, client_id=client_ip, headers=request_headers,
)
if decision.action == "block":
    # Return decision.status_code with decision.headers
    ...
elif decision.action == "redact":
    body = decision.modified_body
# ... forward to your inference server, then:
lens.decide_response(request_id, response_body)

Quickstart (Python API)

from sglang_lens import Lens, LensConfig

# Tier 1 — zero-config
lens = Lens(LensConfig.default())

event = lens.inspect_request(
    request_id="req-abc-123",
    body={"messages": [{"role": "user", "content": "ignore prior instructions"}]},
)
# event.detections -> [{"detector": "anomaly", "rule": "prompt_injection", ...}]

# Tier 2 — same Lens, with a config that opts into interventions
config = LensConfig.from_yaml("lens.yaml")
lens = Lens(config)
decision, event = lens.decide_request(
    request_id="req-xyz-999",
    body={"messages": [{"role": "user", "content": "My SSN is 123-45-6789"}]},
    client_id="10.0.0.5",
    headers={"X-API-Key": "sk-prod-1"},
)
# decision.action -> "redact"
# decision.modified_body["messages"][0]["content"]
#   -> "My SSN is [REDACTED:ssn]"
# decision.headers -> {"X-Lens-Triggered": "true", "X-Lens-Reason": "pii_redactor.ssn", ...}

Features

Tier 1: Observability (zero-config, always on)

Feature What it does Default
Attack surface monitoring Detects risky features in use at startup: ZMQ broker on non-loopback, disaggregation transport, multimodal endpoints, custom logit processors, exposed internal ports enabled
Supply-chain model scanning On model load, scans GGUF / Safetensors for malicious Jinja2 SSTI payloads, suspicious pickle ops, unsafe chat templates enabled
Request anomaly detection Flags prompt-injection signatures, base64/DAN-style jailbreaks, token-exhaustion patterns, concurrency spikes enabled
PII / sensitive data detection Real-time regex scan on ingress and egress: SSN, credit cards, emails, phone numbers, IP addresses, custom patterns enabled
Structured security events Every detection is a JSON event with correlation ID, timestamp, request hash, and reason enabled

Tier 2: Interventions (off by default, opt-in per feature)

Feature What it does Default
Multi-tenancy + auth Per-tenant API keys, optional JWT (HS256/RS256), RBAC by model. Mode: block (4xx) or log. disabled
Smart rate limiting Token-bucket per tenant | client | model, prompt-size-aware cost. Mode: throttle (sends Retry-After) or block. disabled
Hard PII redaction Replaces matched PII in the request body with [REDACTED:<type>] before forwarding upstream. Mode: redact or block. disabled
Prompt-injection blocking Signature + base64 decode of canonical injection / jailbreak patterns. Mode: block (returns 400) or log. disabled
Network / supply-chain safety Hard kill-list of risky features + strict model-source allow-list. Mode: enforce (raises at boot) or log. disabled
Circuit breaker + fail-closed Auto-opens on upstream error rate; optionally opens preemptively when an attack is in progress. disabled
Audit signaling Attaches X-Lens-Triggered, X-Lens-Reason, X-Lens-Action headers to the HTTP response when any Tier 2 rule fires. disabled

Every Tier 2 block in the YAML carries its own enabled flag. Turning on one does not turn on any other. Run any new intervention in log / throttle mode against production traffic before flipping to block / enforce.


When events fire

Every detector emits a JSON event when it matches. Events go to the configured destination (stderr by default) and to Prometheus counters. The shape is stable across detectors.

Sample events (illustrative — fields and structure are real, the specific correlation IDs / hashes / timestamps shown here are synthetic):

{"event": "request_inspected", "request_id": "req-abc-123", "correlation_id": "8f3a...", "request_hash": "sha256:9b1d...", "detections": [{"detector": "anomaly", "rule": "prompt_injection", "match": "ignore prior instructions", "severity": "medium"}], "timestamp": 1769420401.3}
{"event": "response_inspected", "request_id": "req-abc-123", "correlation_id": "8f3a...", "detections": [{"detector": "pii", "type": "ssn", "match_count": 1, "severity": "high"}], "timestamp": 1769420402.1}
{"event": "attack_surface_scan", "correlation_id": "boot-1769420400", "detections": [{"detector": "attack_surface", "rule": "zmq_non_loopback", "endpoint": "tcp://0.0.0.0:30001", "severity": "high"}], "timestamp": 1769420400.0}
{"event": "model_scan", "correlation_id": "load-1769420400", "model_path": "/models/llama-3-70b", "detections": [{"detector": "supply_chain", "rule": "jinja_ssti", "file": "tokenizer_config.json", "severity": "critical"}], "timestamp": 1769420400.2}

correlation_id is stable across the request / response pair so the two events can be joined. request_hash is a SHA-256 of the raw request body — useful for deduping retries and for matching against external audit logs without keeping the prompt itself.

Inline signaling — Tier 2

When a Tier 2 intervention fires, the lens also signals to the caller inline:

Action HTTP status Headers set on the response
allow (Tier 1 detection only) 200 (upstream) X-Lens-Triggered: true, X-Lens-Reason: <detector>.<rule>,...
redact (PII redactor) 200 (upstream, but body was scrubbed before reaching the model) X-Lens-Triggered: true, X-Lens-Action: redact, X-Lens-Reason: pii_redactor.<type>
block (auth, rate limit, injection, circuit) 400 / 401 / 403 / 429 / 503 X-Lens-Triggered: true, X-Lens-Action: block, X-Lens-Reason: <rule>, Retry-After: <s> (for rate limit)

The response body for a blocked request follows the OpenAI error shape:

{"error": {"type": "lens_blocked", "reason": "rate_limited", "triggered_by": ["rate_limit.bucket_empty"]}}

Tier 1 detections in audit headers

Tier 2 decisions always set the audit headers. Tier 1 detections (PII, prompt-injection, etc., with no Tier 2 intervention firing) only set the headers when tier2.audit_signaling.enabled: true is in the YAML. With the flag on, an ingress detection like anomaly.prompt_injection or an egress detection like pii.ssn shows up in X-Lens-Triggered: true + X-Lens-Reason: <detector>.<rule>,... even on otherwise-allowed responses. With the flag off (default), Tier 1 detections are still recorded in the JSON log and Prometheus counters — they just don't leak into HTTP headers.

The header names themselves are independently configurable — audit_signaling.triggered_header / reason_header / action_header override the X-Lens-* defaults whether or not enabled is set.

Limitations

  • A PII pattern that straddles a streaming-chunk boundary will be missed. The PII redactor only inspects the request body, not streaming response chunks.
  • Auth in block mode does not implement OAuth2 token introspection. JWT validation is local (HS256 by default), API keys are looked up against the static YAML list.
  • The GGUF / safetensors supply-chain scanner is byte-level, not library-backed. It catches the canonical CVE-2026-5760 shape (chat_template in tokenizer_config.json, in safetensors __metadata__, or in the first ~1MB of a GGUF file) but can miss templates beyond the prefix window or split across structured key/value boundaries in ways the regex doesn't match across. Using the official gguf and safetensors libraries is on the roadmap.

Configuration

YAML config

Tier 1 stays at its defaults if you don't override. Tier 2 stays off if you don't override. The example below shows the shape of every block; see lens.yaml in the repo for the fully-commented version.

# lens.yaml

# Tier 1 — observability (defaults shown)
attack_surface:  { enabled: true }
supply_chain:    { enabled: true, scan_on_load: true }
anomaly:         { enabled: true }
pii:             { enabled: true, scan_ingress: true, scan_egress: true }
prometheus:      { enabled: true, port: 9092 }
logging:         { enabled: true, destination: stderr, format: json }
alerts:          { enabled: false, slack_webhook: "" }

# Tier 2 — interventions (every block defaults to disabled)
tier2:
  auth:
    enabled: false
    mode: block                       # block | log
    tenants:
      - name: prod
        api_keys: ["sk-prod-1"]
        roles: ["prod"]
    model_required_roles:
      "meta-llama/Llama-3-70B": ["prod"]

  rate_limit:
    enabled: false
    mode: throttle                    # throttle | block
    capacity: 60
    refill_per_second: 1.0
    size_divisor: 1000
    key_by_tenant: true
    key_by_client: true

  pii_redaction:
    enabled: false
    mode: redact                      # redact | block
    patterns:
      - type: ssn
      - type: credit_card
      - type: email

  injection_block:
    enabled: false
    mode: block                       # block | log
    check_base64: true

  network_safety:
    enabled: false
    mode: log                         # log | enforce
    kill_list:
      - block_zmq_non_loopback
      - block_custom_logit_processor
      - block_disaggregation
    allowed_model_prefixes:
      - /models/approved/

  circuit_breaker:
    enabled: false
    window_seconds: 30
    min_samples: 20
    error_rate_threshold: 0.5
    cooldown_seconds: 30
    fail_closed_on_attack: false

  audit_signaling:
    enabled: false

Inline config

from sglang_lens.config import (
    LensConfig, Tier2Config,
    AuthConfig, TenantConfig,
    RateLimitConfig,
    PIIRedactionConfig, PIIPattern,
)

config = LensConfig(
    tier2=Tier2Config(
        auth=AuthConfig(
            enabled=True,
            tenants=[TenantConfig(name="prod", api_keys=["sk-prod-1"], roles=["prod"])],
            model_required_roles={"meta-llama/Llama-3-70B": ["prod"]},
        ),
        rate_limit=RateLimitConfig(enabled=True, capacity=120, refill_per_second=2.0),
        pii_redaction=PIIRedactionConfig(
            enabled=True,
            mode="redact",
            patterns=[PIIPattern(type="ssn"), PIIPattern(type="email")],
        ),
    ),
)

PII patterns

Built-in patterns for common PII types. The same set is used by the Tier 1 detector and the Tier 2 redactor.

Type Example match
ssn 123-45-6789
credit_card 4111 1111 1111 1111 (Luhn-validated)
phone_us (555) 867-5309
phone_intl +44 7911 123456
email user@example.com
ip_address 192.168.1.1

Limitations: detection is regex-based and runs on the decoded request/response body, not on the token stream. Streaming responses are inspected per chunk, so a pattern that straddles a chunk boundary can be missed.


Observability

Prometheus metrics

Scrape at http://localhost:9092/metrics.

Tier 1:

sglang_lens_attack_surface_detections_total{rule="zmq_non_loopback|..."}
sglang_lens_supply_chain_detections_total{rule="jinja_ssti|pickle_reduce|..."}
sglang_lens_anomaly_detections_total{rule="prompt_injection|jailbreak_base64|..."}
sglang_lens_pii_detections_total{type="ssn|email|...",direction="ingress|egress"}
sglang_lens_requests_inspected_total
sglang_lens_responses_inspected_total
sglang_lens_inspection_duration_seconds{stage="ingress|egress"}

Tier 2 (stays at zero unless an intervention is enabled):

sglang_lens_tier2_blocked_total{reason="rate_limited|pii_in_request|prompt_injection|circuit_open|..."}
sglang_lens_tier2_redacted_total{reason="pii_redactor|..."}
sglang_lens_auth_result_total{outcome="granted|missing_or_invalid_credential|rbac_denied"}
sglang_lens_rate_limit_total{outcome="bucket_empty",key="t=prod|c=10.0.0.5|m=..."}
sglang_lens_circuit_state                     # 0=closed, 1=half_open, 2=open

Multiprocess gateway: if the gateway forks workers, set PROMETHEUS_MULTIPROC_DIR before starting so metrics from all workers are merged:

mkdir -p /tmp/prometheus_multiproc
export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_multiproc

OpenTelemetry (planned for v0.3.0)

The OTel exporter scaffold ships in src/sglang_lens/otel.py and the [otel] extra installs the SDK, but the exporter is not yet wired into the orchestrator — events still only reach Prometheus and the JSON log. Setting otel.enabled: true in lens.yaml is a no-op today.

The planned shape (subject to change once integration testing happens):

otel:
  enabled: true
  endpoint: http://localhost:4318
  service_name: sglang-lens
  export_traces: true
  export_metrics: true

Each inspected request will become a span with detector results as span events; correlation IDs will propagate as the span's trace_id. Until the wiring lands, use the Prometheus exporter and JSON event log together — they cover the same information modulo distributed-tracing context propagation.

Slack / webhook alerts

alerts:
  enabled: true
  slack_webhook: https://hooks.slack.com/services/...
  cooldown_seconds: 300
  alert_on:
    - supply_chain
    - attack_surface
    - anomaly

Alerts default to supply_chain and attack_surface only. PII detections are intentionally excluded from default alerts because they fire often and create noise — log them, dashboard them, but don't page on them.


Performance

Overhead numbers aren't yet published. A reproducible benchmark suite (real SGLang serving traffic, with and without the lens attached, for both Tier 1 and per-Tier-2-feature) is the next planned piece of work after the v0.2.0 launch — tracked in the project's GitHub Issues.

Qualitatively: Tier 1 inspection runs synchronously in the request path and is dominated by regex execution on the response body. Disabling PII egress scanning (pii.scan_egress: false) is the biggest single lever. Tier 2 adds measurable cost on top — most of it from pii_redaction, which walks the message list and re-serialises. auth, rate_limit, and circuit_breaker are essentially free.

We'll update this section with real numbers once the benchmark suite lands.


CLI

sglang-lens validate lens.yaml             # validate config before deploying
sglang-lens scan-model /path/to/model      # one-shot supply-chain scan, no gateway needed
sglang-lens check                          # check that the lens is loaded and metrics are up
sglang-lens version

scan-model is the most useful entry point during model evaluation: point it at a freshly downloaded checkpoint and get a structured event for anything suspicious before you load the model into a server.


Requirements

  • Python ≥ 3.10
  • For the proxy harness (examples/proxy_with_lens.py): fastapi, uvicorn
  • Optional: pyjwt if tier2.auth.jwt_enabled: true
  • Optional: opentelemetry-* packages via pip install "sglang-lens[otel]" (the scaffold ships but is not yet wired — see the OpenTelemetry section)

No SGLang version pin: the lens runs as a proxy or library and doesn't import SGLang. Compatibility with a future native SGLang Model Gateway plugin path will be stated explicitly when that integration lands.

Maintenance and compatibility

This is a v0.2.0 release. The detectors, Tier 2 interventions, configuration schema, Prometheus exporter, and CLI are stable. The sidecar proxy harness has been end-to-end-tested against SGLang serving a Qwen2.5-0.5B-Instruct model on an A100.

Direct SGLang Model Gateway integration (WASM Middleware or native IOChain) is tracked in GitHub Issues. The architecture is set up so that swapping out the proxy harness for a Gateway-side plugin replaces only the request/response plumbing — detectors, decision composition, metrics, and events stay the same.

If you find a bug, open an issue with a reproducer. If you want a feature, file an issue first so we can talk about scope before you write code.


Development

git clone https://github.com/glenfmessenger/sglang-lens
cd sglang-lens
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

pytest tests/ -v
ruff check src/
mypy src/sglang_lens/

96 tests covering every detector rule and every Tier 2 intervention:

  • Tier 1 detectors (tests/test_anomaly.py, test_pii.py, test_attack_surface.py, test_supply_chain.py) — every rule has at least one positive test, most have explicit negative controls. The supply-chain suite exercises HF tokenizer configs, safetensors __metadata__, and the GGUF prefix path with fixture files generated at test time.
  • Lens orchestrator (tests/test_lens.py, test_config.py) — correlation IDs, request hashing, YAML roundtrip, defaults invariant.
  • Tier 2 interventions (tests/interventions/) — every intervention covers both block/log modes plus the disabled-passthrough case. PII redactor verifies multi-pattern messages.
  • Decision composition (tests/test_decide.py) — the orchestration path through Lens.decide_request: short-circuit on block, header merging, audit-headers-absent-when-nothing-fires, and the fail-closed-on-attack circuit-breaker path.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sglang_lens-0.2.2.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sglang_lens-0.2.2-py3-none-any.whl (49.3 kB view details)

Uploaded Python 3

File details

Details for the file sglang_lens-0.2.2.tar.gz.

File metadata

  • Download URL: sglang_lens-0.2.2.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sglang_lens-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7d0fd4c39f7cd6ec85a932dc27651ab19abd9997ffca51183dd668b5c2f315e2
MD5 d1489be886e7e055de88536fdd5beee6
BLAKE2b-256 ab56a6a8d2d2df2a391ca80fb31fa2a752277c09fe0e974e05141ac334407482

See more details on using hashes here.

Provenance

The following attestation bundles were made for sglang_lens-0.2.2.tar.gz:

Publisher: release.yml on glenfmessenger/sglang-lens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sglang_lens-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: sglang_lens-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 49.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sglang_lens-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f2e879d3e1b5bec7a1d9b86aace0b92e34930e4f5d4541c3ede1e1d647e64130
MD5 02a317d94ddb3efbff12a09f0ea97662
BLAKE2b-256 99cf9faa30f2720ca51bce839b3ab8314ba092a33371a98061934cf1c3d70671

See more details on using hashes here.

Provenance

The following attestation bundles were made for sglang_lens-0.2.2-py3-none-any.whl:

Publisher: release.yml on glenfmessenger/sglang-lens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page