Skip to main content

Compress logs for LLM analysis (Rust-powered)

Project description

logzip (Rust)

PyPI version PyPI downloads Python 3.9+ License: MIT Rust

Compress logs before sending to LLM. Powered by Rust & PyO3.

raw log → [logzip compress] → compressed text → LLM (Claude Code / Cursor / API)

Before / After

Raw Log (Uvicorn):

INFO: 127.0.0.1:45678 - "GET /api/v1/status HTTP/1.1" 200 OK
INFO: 127.0.0.1:45679 - "GET /api/v1/status HTTP/1.1" 200 OK
... (100 similar lines) ...

logzip output:

--- PREFIX ---
INFO: 127.0.0.1:
--- LEGEND ---
#0# = - "GET /api/v1/status HTTP/1.1" 200 OK
--- BODY ---
45678 #0#
45679 #0#
...

Typical savings: 52–58% on structured logs (systemd, uvicorn, docker).
Anomalies and unique lines stay uncompressed — visible at a glance in the BODY.

Why use logzip? (RAG & LLM)

When working with logs in LLMs (Claude, GPT, RAG systems), you face two problems:

  1. Context Limit: Logs are huge. A 10MB log is ~2.5M tokens.
  2. Noise: 90% of the log consists of repeating INFO and identical requests that drown out the real error.

logzip is well-suited for RAG pipelines: it compresses the context before sending it to the model, saving money on tokens and increasing answer accuracy by highlighting anomalies.


Performance (7.96 MB Log, ~2M tokens)

Benchmarked on a real 7.96 MB production log.

logzip modes

Mode CLI Time (ms) Size (KB) Saved (%) Output type
fast --quality fast ~200 ~4,900 ~40% text/LLM
balanced --quality balanced 404 3,928 52% text/LLM
balanced + 2 passes --quality balanced --bpe-passes 2 418 3,404 58% text/LLM
max --quality max 507 3,511 57% text/LLM

Recommended. A second compression pass finds repeated token sequences in already-compressed text — 14 ms overhead, 7% more savings vs balanced.

--quality max uses a larger legend (512 vs 128 entries) which adds overhead without a second pass benefit. Use --bpe-passes 2 with balanced instead.

vs. binary compressors (for context)

Tool Time (ms) Size (KB) Saved (%) LLM-readable?
lz4 6 1,280 84% No
zstd (lvl 3) 14 819 90% No
zlib (lvl 6) 69 840 90% No
logzip (recommended) 418 3,404 58% Yes

Binary compressors produce opaque binary blobs — LLMs cannot read them. logzip trades ~30% size for fully human- and LLM-readable output.

Token estimation: 1 token ≈ 4 characters (rough estimate for English-like logs).

Economic Impact

┌──────────────────────────────────────────────────────────┐
│  logzip Savings (7.96 MB Production Log)                 │
├──────────────────────────────────────────────────────────┤
│  Raw Size:        8,151 KB  (~1,990,000 tokens)          │
│  After balanced:  3,928 KB  (~959,000 tokens,  -52%)     │
│  After 2 passes:  3,404 KB  (~831,000 tokens,  -58%)     │
├──────────────────────────────────────────────────────────┤
│  Cost Before:     $5.97                                  │
│  Cost After:      $2.49      (Claude 3.5 Sonnet Input)   │
│  LLM Efficiency:  2.4x larger context for the same price │
└──────────────────────────────────────────────────────────┘

Install

Python API + logzip-py CLI:

pip install logzip

Rust CLI + MCP Server:

cargo install logzip

CLI

Two CLIs are available. Both provide compress and decompress subcommands with identical flags.

Rust binary (cargo install logziplogzip):

# stdin → stdout
logzip compress < app.log

# quality preset (fast|balanced|max)
logzip compress --quality balanced < app.log

# recommended: balanced + second pass
logzip compress --quality balanced --bpe-passes 2 < app.log

# with preamble (LLM decode instructions at the top)
logzip compress --preamble < app.log > compressed.txt

# save + show stats
logzip compress --stats -i app.log -o app.logzip

# lossless timestamps: keep full sub-second precision (default trims to milliseconds)
logzip compress --exact-timestamps -i app.log -o app.logzip

# decompress
logzip decompress -i app.logzip

Python CLI (pip install logziplogzip-py):

# same flags as above, plus:

# explicit profile (otherwise auto-detected)
logzip-py compress --profile journalctl < /tmp/syslog.txt

Python API

from logzip import compress, decompress

# compress
result = compress(raw_log_text)
print(result.render(with_preamble=True))   # → for LLM
print(result.stats_str())                  # → for logs

# fine-grained control
result = compress(
    raw_log_text,
    max_legend_entries=128,   # legend size
    bpe_passes=2,             # second-pass compression (compresses repeated token sequences)
    do_normalize=True,        # collapse timestamps, ANSI, IPs
    do_templates=True,        # structural template extraction
    exact_timestamps=False,   # True → keep full sub-second timestamp precision
)

# decompress
original = decompress(result.render())

MCP Server (Local — Claude Desktop / Claude Code)

Requires the Rust binary (cargo install logzip, see Install).

Add to your claude_desktop_config.json:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "logzip": {
      "command": "logzip",
      "args": ["mcp", "--allow-dir", "/var/log", "--allow-dir", "/home/user/logs"]
    }
  }
}

Or add via Claude Code CLI:

claude mcp add logzip -- logzip mcp --allow-dir /var/log

Available tools

Tool Description
compress_content(content, quality) Compress log text pasted directly into the conversation
get_stats(path) File size, token estimate, detected profile — call first to decide strategy
compress_file(path, quality) Compress entire file — for files < 200 K tokens
compress_tail(path, lines, quality) Compress last N lines — efficient for large files

Available prompts

Prompt Description
analyze_logs Compresses the log server-side and prepares an SRE analysis context

How to use

LLMs don't automatically pick up MCP tools — you need to reference them explicitly. Two ways:

Option A — explicit ask (works everywhere):

"Use logzip to analyze /var/log/syslog"

Option B — analyze_logs prompt (Claude Code):

/mcp → logzip → analyze_logs → path: /var/log/syslog

This compresses the log server-side and drops an SRE-ready context into the conversation.

Option C — install the log-analysis skill (Claude Code, recommended):

The skill makes Claude automatically reach for logzip whenever you mention a log file — no explicit instruction needed.

# 1. Register this repo as a single-plugin marketplace
#    (reads .claude-plugin/marketplace.json from the repo root)
claude plugin marketplace add NailShakurov/logzip

# 2. Install the logzip plugin from that marketplace
claude plugin install logzip@logzip

The install target is <plugin-name>@<marketplace-name> — both are logzip here (the marketplace name comes from the name field in marketplace.json, not the repo path). To pull later updates, run claude plugin marketplace update logzip.

After that, asking "what's in /var/log/syslog?" is enough — Claude calls get_stats and compress_tail on its own.

Security

The MCP server only reads files inside directories specified via --allow-dir.
If no --allow-dir is given, defaults to the current working directory.
All paths are canonicalized before comparison to prevent path traversal attacks.


Through the eyes of an LLM

Unlike gzip/zstd which produce binary noise, logzip produces structured text. The model reads the legend once and works with the compressed body directly — it doesn't need to expand every token to understand the log.

Input for LLM:

This is a compressed log. Rules: #0# is replaced by GET /api/v1/status.

--- BODY --- 12:00:01 #0# 200 OK 12:00:02 #0# 500 ERR <-- Boom, anomaly!

The model instantly spots the 500 error without wading through thousands of identical successful requests.

Architecture & Safety

flowchart TD
    A([raw log text]) --> B

    subgraph pipe["Compression Pipeline"]
        B["① Profile Detection\njournalctl · docker · uvicorn · nodejs · plain"]
        C["② Normalizer\nANSI · timestamps · hex zeros · common prefix"]
        D["③ Frequency Analysis\nparallel n-gram counting — rayon"]
        E["④ Preserve Filter\nUUID · IPv4 · hex≥16 · custom regex\nkeeps diagnostic IDs in body"]
        F["⑤ Greedy Legend Builder\nO(N) positional index — up to 512 entries"]
        G["⑥ AhoCorasick Substitution\nsingle-pass k-way merge"]
        H{bpe_passes > 0?}
        I["⑦ Recursive BPE\n2nd-pass on compressed body"]
        J["⑧ Template Extraction\nstructural repeats → &tag = value"]

        B --> C --> D --> E --> F --> G --> H
        H -->|yes| I --> J
        H -->|no| J
    end

    J --> K([CompressResult])
    K --> L[body]
    K --> M[legend]
    K --> N[templates]
    K --> O[stats]
  1. Normalizer: Collapses ANSI, timestamps, IPs, and common prefixes.
  2. Frequency Analysis: Parallel n-gram counting using rayon.
  3. Preserve Filter: Skips UUID, IPv4, long hex, and custom patterns — keeps them visible in the body for LLM analysis.
  4. Greedy Legend: Optimized selection using a positional index (O(N)).
  5. Direct Replacement: Fast substitution without re-scanning.
  6. Second Pass: Compresses repeated token sequences in the already-compressed body.
  7. Templates: Structural template extraction.

Safety First

  • Pure Rust: Core logic is 100% Rust.
  • Zero unsafe: The codebase contains no unsafe blocks, ensuring memory safety within the Python runtime.
  • Stress-tested: Handled multi-GB logs without memory leaks or crashes.

Reproducibility

Want to verify our benchmarks? Run the included script:

python benchmark.py

Roadmap

Priority:

  • Streaming mode for multi-GB logs

Planned:

  • MCP server for Claude Code
  • Suffix automaton for arbitrary repetition search

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logzip-2.1.3.tar.gz (25.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

logzip-2.1.3-cp39-abi3-win_amd64.whl (845.4 kB view details)

Uploaded CPython 3.9+Windows x86-64

logzip-2.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (918.0 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

logzip-2.1.3-cp39-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (1.7 MB view details)

Uploaded CPython 3.9+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file logzip-2.1.3.tar.gz.

File metadata

  • Download URL: logzip-2.1.3.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for logzip-2.1.3.tar.gz
Algorithm Hash digest
SHA256 fd046a27347df766410fa04c75216121f96053fc800d473e1dd47479ce4a5838
MD5 04007c3e7ed8440963be7f23b8aca3ae
BLAKE2b-256 1b129e78541b14262841b5aab73aa68055371a283a3d1eb784f8698cfcc9bced

See more details on using hashes here.

Provenance

The following attestation bundles were made for logzip-2.1.3.tar.gz:

Publisher: publish.yml on NailShakurov/logzip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file logzip-2.1.3-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: logzip-2.1.3-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 845.4 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for logzip-2.1.3-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 79216ae1d6a18ed087829040f0034f042fe158ea7d0f7386224f9ef1c004b9f3
MD5 67aca1aa73558872cdaadba1c7133481
BLAKE2b-256 1aa128f5da0eeb571cfd443f10f16395f7bc88a52b8fc7a7e7a3db0c0589d659

See more details on using hashes here.

Provenance

The following attestation bundles were made for logzip-2.1.3-cp39-abi3-win_amd64.whl:

Publisher: publish.yml on NailShakurov/logzip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file logzip-2.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for logzip-2.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6963bd97ce28feeb5f2032c4f12d84958c0beb88f69d6a56af0155671cc925d5
MD5 849ac495a4586f15a47f498edcf5d9fe
BLAKE2b-256 efd603001f28d37aba236cc082a3b8494a97ec768436fdec71f4e1963083e7b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for logzip-2.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on NailShakurov/logzip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file logzip-2.1.3-cp39-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for logzip-2.1.3-cp39-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 b71fc36d835d99b927e4559e4994e894e7f0d73d0844fd05e70136dbc9b8006a
MD5 ea9e73de4e58d8cf4adf9631ae505e3a
BLAKE2b-256 1998b949a6e51dd1654f3cef02f50a75450f63cd008b4d85e7cca9f618b3c03b

See more details on using hashes here.

Provenance

The following attestation bundles were made for logzip-2.1.3-cp39-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish.yml on NailShakurov/logzip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page