Skip to main content

Semantic anomaly detection for system log files

Project description

Cordon

Semantic anomaly detection for system log files

Cordon uses transformer-based embeddings and density-based scoring to identify semantically unusual patterns in large log files, designed to reduce massive logs down to the most anomalous sections for analysis.

Key principle: Repetitive patterns (even errors) are considered "normal background." Cordon surfaces unusual, rare, or clustered events that stand out semantically from the bulk of the logs.

Features

  • Semantic Analysis: Uses transformer models to understand log content meaning, not just keyword matching
  • Density-Based Scoring: Identifies anomalies using k-NN distance in embedding space
  • Noise Reduction: Filters out repetitive logs, keeping only unusual patterns
  • Multiple Backends: sentence-transformers (default) or llama.cpp for containers

Installation

From PyPI (Recommended)

# With uv (recommended)
uv pip install cordon

# With pip
pip install cordon

From Source

# Clone the repository
git clone https://github.com/calebevans/cordon.git
cd cordon

# With uv (recommended)
uv pip install -e .

# With pip
pip install -e .

For development:

uv pip install -e ".[dev]"
pre-commit install

For llama.cpp backend (GPU acceleration in containers):

uv pip install -e ".[llama-cpp]"

Container Installation

make container-build

See Container Guide for GPU support and advanced usage.

Quick Start

Command Line

# Basic usage
cordon system.log

# Multiple files
cordon app.log error.log

# With options
cordon --window-size 10 --k-neighbors 10 --anomaly-percentile 0.05 app.log

# Save results to file
cordon --output anomalies.xml system.log

# Show detailed statistics and save results
cordon --detailed --output results.xml app.log

# llama.cpp backend (for containers)
cordon --backend llama-cpp system.log

Python Library

from pathlib import Path
from cordon import SemanticLogAnalyzer, AnalysisConfig

# Basic usage
analyzer = SemanticLogAnalyzer()
output = analyzer.analyze_file(Path("system.log"))
print(output)

# Advanced configuration
config = AnalysisConfig(
    window_size=10,
    k_neighbors=10,
    anomaly_percentile=0.05,
    device="cuda",
    batch_size=64
)
analyzer = SemanticLogAnalyzer(config)
result = analyzer.analyze_file_detailed(Path("app.log"))

Backend Options

sentence-transformers (Default)

Best for native installations with GPU access.

cordon system.log  # Auto-detects GPU (MPS/CUDA)
cordon --device cuda system.log
cordon --device cpu system.log

llama.cpp Backend

Best for container deployments with GPU acceleration via Vulkan.

# Auto-downloads model on first run
cordon --backend llama-cpp system.log

# With GPU acceleration
cordon --backend llama-cpp --n-gpu-layers 10 system.log

# Custom model
cordon --backend llama-cpp --model-path ./model.gguf system.log

See llama.cpp Guide for details on models, performance, and GPU setup.

Container Usage

Build

# Build locally
make container-build

Run

# Pull published image from GitHub Container Registry
podman pull ghcr.io/calebevans/cordon:latest  # or :dev for development builds

# Run with published image
podman run --rm -v /path/to/logs:/logs:Z ghcr.io/calebevans/cordon:latest /logs/system.log

# Run with locally built image
make container-run DIR=/path/to/logs ARGS="/logs/system.log"

# With GPU (requires Podman with libkrun)
podman run --device /dev/dri -v /path/to/logs:/logs:Z ghcr.io/calebevans/cordon:latest \
  --backend llama-cpp --n-gpu-layers 10 /logs/system.log

See Container Guide for full details.

Primary Use Case: LLM Context Reduction

Cordon attempts to solve the problem of log files being too large for LLM context windows by reducing them to semantically significant sections.

Real-world reduction rates from benchmarks:

  • 1M-line HDFS logs → 20K lines (98% reduction with p=0.02 threshold)
  • 5M-line HDFS logs → 100K lines (98% reduction with p=0.02 threshold)

Example workflow:

# Extract anomalies
analyzer = SemanticLogAnalyzer()
anomalies = analyzer.analyze_file(Path("production.log"))

# Send curated context to LLM (now fits in context window)

The output is intentionally lossy—it discards repetitive patterns to focus on semantically unusual events.

How It Works

Pipeline

  1. Ingestion: Read log file line-by-line
  2. Segmentation: Create overlapping windows of N lines
  3. Vectorization: Embed windows using transformer models
  4. Scoring: Calculate k-NN density scores
  5. Thresholding: Select top X% based on scores
  6. Merging: Combine overlapping significant windows
  7. Formatting: Generate XML-tagged output

Scoring

  • Higher score = Semantically unique = Anomalous
  • Lower score = Repetitive = Normal background noise

The score for each window is the average cosine distance to its k nearest neighbors in the embedding space.

Important: Repetitive patterns are filtered even if critical. The same FATAL error repeated 100 times scores as "normal" because it's semantically similar to itself.

See Cordon's architecture for full details.

Configuration

Analysis Parameters

Parameter Default CLI Flag Description
window_size 5 --window-size Lines per window (non-overlapping)
k_neighbors 5 --k-neighbors Number of neighbors for density calculation
anomaly_percentile 0.1 --anomaly-percentile Top N% to keep (0.1 = 10%)
batch_size 32 --batch-size Batch size for embedding generation
scoring_workers Auto --workers Parallel workers for k-NN scoring (default: half of CPU cores)

Backend Options

Parameter Default CLI Flag Description
backend sentence-transformers --backend Embedding backend
model_name all-MiniLM-L6-v2 --model-name HuggingFace model
device Auto --device Device (cuda/mps/cpu)
model_path None --model-path GGUF model path (llama-cpp)
n_gpu_layers 0 --n-gpu-layers GPU layers (llama-cpp)

Output Options

Parameter Default CLI Flag Description
detailed False --detailed Show detailed statistics (timing, score distribution)
output None --output, -o Save anomalous blocks to file (default: stdout)

Run cordon --help for full CLI documentation.

⚠️ Important: Token Limits and Window Sizing

Transformer models have token limits that affect how much of each window is analyzed. Windows exceeding the limit are automatically truncated to the first N tokens.

Cordon will warn you if significant truncation is detected and suggest better settings for your logs.

Default model (all-MiniLM-L6-v2) has a 256-token limit:

  • Compact logs (20-30 tokens/line): Default window_size=5 works perfectly
  • Standard logs (40-50 tokens/line): Default settings work well
  • Verbose logs (50-70 tokens/line): Consider larger window with a bigger model
  • Very verbose logs (80+ tokens/line): Use a larger-context model

For verbose system logs, use larger-context models:

# BAAI/bge-base-en-v1.5 supports 512 tokens (~8-10 verbose lines)
cordon --model-name "BAAI/bge-base-en-v1.5" --window-size 8 your.log

See Configuration Guidelines for detailed recommendations.

Use Cases

What Cordon Is Good For

  • LLM Pre-processing: Reduce large logs to small anomalous sections prior to analysis
  • Initial Triage: First-pass screening of unfamiliar logs to find "what's unusual here?"
  • Anomaly Detection: Surface semantically unique events (rare errors, state transitions, unusual clusters)
  • Exploratory Analysis: Discover unexpected patterns without knowing what to search for

What Cordon Is NOT Good For

  • Complete error analysis (repetitive errors filtered)
  • Specific error hunting (use grep/structured logging)
  • Compliance logging (this is lossy by design)

Performance

Cordon automatically chooses the best approach:

Strategy When RAM Usage Speed
In-Memory <50k windows ~200-500MB Fastest
Memory-Mapped ≥50k windows ~50-100MB Moderate

What's a "window"? A window is a non-overlapping chunk of N consecutive log lines (default: 5 lines). A 10,000-line log with window_size=5 creates 2,000 windows.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cordon-0.1.3.tar.gz (16.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cordon-0.1.3-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file cordon-0.1.3.tar.gz.

File metadata

  • Download URL: cordon-0.1.3.tar.gz
  • Upload date:
  • Size: 16.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cordon-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2c1b954b070b41116fa044a527ca2704d69dc571db3cc7f764ed995576ca93ab
MD5 a41e2c5dd35a2c11fde83bd25f4f163a
BLAKE2b-256 c37ff1128c100669f6440c6394f4bfaef8adad4cb77e145e17843e609505570b

See more details on using hashes here.

Provenance

The following attestation bundles were made for cordon-0.1.3.tar.gz:

Publisher: release.yml on calebevans/cordon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cordon-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: cordon-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 27.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cordon-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a2c4237dec1de4054842d5d2ee68f06a984f57e77cde0964a1bedbfc29ade1fe
MD5 f58b2b3555db3150f7759dad0762c621
BLAKE2b-256 6fd8d09fa86c49545e4105333f8e3304fee284f6df5016dfca21464b948d69ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for cordon-0.1.3-py3-none-any.whl:

Publisher: release.yml on calebevans/cordon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page