Semantic anomaly detection for system log files
Project description
Cordon
Semantic anomaly detection for system log files
Cordon uses transformer-based embeddings and density-based scoring to identify semantically unusual patterns in large log files, designed to reduce massive logs down to the most anomalous sections for analysis.
Key principle: Repetitive patterns (even errors) are considered "normal background." Cordon surfaces unusual, rare, or clustered events that stand out semantically from the bulk of the logs.
Features
- Semantic Analysis: Uses transformer models to understand log content meaning, not just keyword matching
- Density-Based Scoring: Identifies anomalies using k-NN distance in embedding space
- Noise Reduction: Filters out repetitive logs, keeping only unusual patterns
- Multiple Backends: sentence-transformers (default) or llama.cpp for containers
Installation
From PyPI (Recommended)
# With uv (recommended)
uv pip install cordon
# With pip
pip install cordon
From Source
# Clone the repository
git clone https://github.com/calebevans/cordon.git
cd cordon
# With uv (recommended)
uv pip install -e .
# With pip
pip install -e .
For development:
uv pip install -e ".[dev]"
pre-commit install
For llama.cpp backend (GPU acceleration in containers):
uv pip install -e ".[llama-cpp]"
For FAISS support (better performance on large logs):
uv pip install -e ".[faiss-cpu]" # CPU
uv pip install -e ".[faiss-gpu]" # GPU
Container Installation
make container-build
See Container Guide for GPU support and advanced usage.
Quick Start
Command Line
# Basic usage
cordon system.log
# Multiple files
cordon app.log error.log
# With options
cordon --window-size 10 --k-neighbors 10 --anomaly-percentile 0.05 app.log
# With FAISS for large logs
cordon --use-faiss large.log
# llama.cpp backend (for containers)
cordon --backend llama-cpp --use-faiss system.log
Python Library
from pathlib import Path
from cordon import SemanticLogAnalyzer, AnalysisConfig
# Basic usage
analyzer = SemanticLogAnalyzer()
output = analyzer.analyze_file(Path("system.log"))
print(output)
# Advanced configuration
config = AnalysisConfig(
window_size=10,
stride=5,
k_neighbors=10,
anomaly_percentile=0.05,
device="cuda"
)
analyzer = SemanticLogAnalyzer(config)
result = analyzer.analyze_file_detailed(Path("app.log"))
Backend Options
sentence-transformers (Default)
Best for native installations with GPU access.
cordon system.log # Auto-detects GPU (MPS/CUDA)
cordon --device cuda system.log
cordon --device cpu system.log
llama.cpp Backend
Best for container deployments with GPU acceleration via Vulkan.
# Auto-downloads model on first run
cordon --backend llama-cpp --use-faiss system.log
# With GPU acceleration
cordon --backend llama-cpp --use-faiss --n-gpu-layers 10 system.log
# Custom model
cordon --backend llama-cpp --use-faiss --model-path ./model.gguf system.log
See llama.cpp Guide for details on models, performance, and GPU setup.
Container Usage
Build
# Build locally
make container-build
Run
# Pull published image from GitHub Container Registry
podman pull ghcr.io/calebevans/cordon:latest # or :dev for development builds
# Run with published image
podman run --rm -v /path/to/logs:/logs:Z ghcr.io/calebevans/cordon:latest /logs/system.log
# Run with locally built image
make container-run DIR=/path/to/logs ARGS="/logs/system.log"
# With GPU (requires Podman with libkrun)
podman run --device /dev/dri -v /path/to/logs:/logs:Z ghcr.io/calebevans/cordon:latest \
--backend llama-cpp --use-faiss --n-gpu-layers 10 /logs/system.log
See Container Guide for full details.
Primary Use Case: LLM Context Reduction
Cordon attempts to solve for when log files are too large for a context window:
Problem: 50GB production log → Can't fit in any LLM context window
Solution: Cordon → 12 anomalous blocks (few KB) → Send to LLM for analysis
Example workflow:
# Extract anomalies
analyzer = SemanticLogAnalyzer()
anomalies = analyzer.analyze_file(Path("production.log"))
# Send curated context to LLM (now fits in context window)
The output is intentionally lossy—it discards repetitive patterns to focus on semantically unusual events.
How It Works
Pipeline
- Ingestion: Read log file line-by-line
- Segmentation: Create overlapping windows of N lines
- Vectorization: Embed windows using transformer models
- Scoring: Calculate k-NN density scores
- Thresholding: Select top X% based on scores
- Merging: Combine overlapping significant windows
- Formatting: Generate XML-tagged output
Scoring
- Higher score = Semantically unique = Anomalous
- Lower score = Repetitive = Normal background noise
The score for each window is the average cosine distance to its k nearest neighbors in the embedding space.
Important: Repetitive patterns are filtered even if critical. The same FATAL error repeated 100 times scores as "normal" because it's semantically similar to itself.
See Cordon's architecture for full details.
Configuration
Analysis Parameters
| Parameter | Default | CLI Flag | Description |
|---|---|---|---|
window_size |
5 | --window-size |
Lines per window |
stride |
2 | --stride |
Lines to skip between windows |
k_neighbors |
5 | --k-neighbors |
Number of neighbors for density calculation |
anomaly_percentile |
0.1 | --anomaly-percentile |
Top N% to keep (0.1 = 10%) |
Backend Options
| Parameter | Default | CLI Flag | Description |
|---|---|---|---|
backend |
sentence-transformers |
--backend |
Embedding backend |
model_name |
all-MiniLM-L6-v2 |
--model-name |
HuggingFace model |
device |
Auto | --device |
Device (cuda/mps/cpu) |
model_path |
None | --model-path |
GGUF model path (llama-cpp) |
n_gpu_layers |
0 | --n-gpu-layers |
GPU layers (llama-cpp) |
use_faiss |
False | --use-faiss |
Use FAISS for large logs |
Run cordon --help for full CLI documentation.
⚠️ Important: Token Limits and Window Sizing
Transformer models have token limits that affect how much of each window is analyzed. Windows exceeding the limit are automatically truncated to the first N tokens.
Cordon will warn you if significant truncation is detected and suggest better settings for your logs.
Default model (all-MiniLM-L6-v2) has a 256-token limit:
- Compact logs (20-30 tokens/line): Default
window_size=5works perfectly - Standard logs (40-50 tokens/line): Default settings work well
- Verbose logs (50-70 tokens/line): Consider larger window with a bigger model
- Very verbose logs (80+ tokens/line): Use a larger-context model
For verbose system logs, use larger-context models:
# BAAI/bge-base-en-v1.5 supports 512 tokens (~8-10 verbose lines)
cordon --model-name "BAAI/bge-base-en-v1.5" --window-size 8 --stride 4 your.log
See Configuration Guidelines for detailed recommendations.
Use Cases
What Cordon Is Good For
- LLM Pre-processing: Reduce large logs to small anomalous sections prior to analysis
- Initial Triage: First-pass screening of unfamiliar logs to find "what's unusual here?"
- Anomaly Detection: Surface semantically unique events (rare errors, state transitions, unusual clusters)
- Exploratory Analysis: Discover unexpected patterns without knowing what to search for
What Cordon Is NOT Good For
- Complete error analysis (repetitive errors filtered)
- Specific error hunting (use grep/structured logging)
- Compliance logging (this is lossy by design)
Performance
Cordon automatically chooses the best approach:
| Strategy | When | RAM Usage | Speed |
|---|---|---|---|
| In-Memory | <50k windows | ~200-500MB | Fastest |
| Memory-Mapped | 50k-500k windows | ~50-100MB | Moderate |
| FAISS | >500k windows | ~50MB | Fast |
What's a "window"? A window is a sliding chunk of N consecutive log lines (default: 10 lines). A 10,000-line log with window_size=10 and stride=5 creates ~2,000 windows.
See Test Examples for real-world results across 9 log types.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cordon-0.1.1.tar.gz.
File metadata
- Download URL: cordon-0.1.1.tar.gz
- Upload date:
- Size: 100.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03f57a90776afdb92f1ab32d57c105f86a43b5a54650758666054a16ee5d0d16
|
|
| MD5 |
5ba61662200918550968714fa840079b
|
|
| BLAKE2b-256 |
19c51f97bc99d22e9d25c02acee5b7998b7a09db16143c5ca688ef2ba5f776d5
|
Provenance
The following attestation bundles were made for cordon-0.1.1.tar.gz:
Publisher:
release.yml on calebevans/cordon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cordon-0.1.1.tar.gz -
Subject digest:
03f57a90776afdb92f1ab32d57c105f86a43b5a54650758666054a16ee5d0d16 - Sigstore transparency entry: 722201135
- Sigstore integration time:
-
Permalink:
calebevans/cordon@aa4c3974f667a247e4d87fc2db940c54006300d6 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/calebevans
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@aa4c3974f667a247e4d87fc2db940c54006300d6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file cordon-0.1.1-py3-none-any.whl.
File metadata
- Download URL: cordon-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fe5cecd44363d60045e7a0a692060dce013a6172e95768f4a1cea92e9653aa5
|
|
| MD5 |
951b93d7c8bda70a30c3cd701dc71c1a
|
|
| BLAKE2b-256 |
592090dc437635dec9c2e8820ac72daf8f0a49a5e026584d8a00936f95c51734
|
Provenance
The following attestation bundles were made for cordon-0.1.1-py3-none-any.whl:
Publisher:
release.yml on calebevans/cordon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cordon-0.1.1-py3-none-any.whl -
Subject digest:
4fe5cecd44363d60045e7a0a692060dce013a6172e95768f4a1cea92e9653aa5 - Sigstore transparency entry: 722201178
- Sigstore integration time:
-
Permalink:
calebevans/cordon@aa4c3974f667a247e4d87fc2db940c54006300d6 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/calebevans
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@aa4c3974f667a247e4d87fc2db940c54006300d6 -
Trigger Event:
release
-
Statement type: