Semantic anomaly detection for system log files

These details have not been verified by PyPI

Project description

Cordon

Semantic anomaly detection for system log files

Cordon uses transformer-based embeddings and density-based scoring to identify semantically unusual patterns in large log files, designed to reduce massive logs down to the most anomalous sections for analysis.

Key principle: Repetitive patterns (even errors) are considered "normal background." Cordon surfaces unusual, rare, or clustered events that stand out semantically from the bulk of the logs.

Features

Semantic Analysis: Uses transformer models to understand log content meaning, not just keyword matching
Density-Based Scoring: Identifies anomalies using k-NN distance in embedding space
Noise Reduction: Filters out repetitive logs, keeping only unusual patterns
Multiple Backends: sentence-transformers (default) or llama.cpp for containers

Installation

Native Installation

# With uv (recommended)
uv pip install -e .

# With pip
pip install -e .

For development:

uv pip install -e ".[dev]"
pre-commit install

For llama.cpp backend (GPU acceleration in containers):

uv pip install -e ".[llama-cpp]"

For FAISS support (better performance on large logs):

uv pip install -e ".[faiss-cpu]"  # CPU
uv pip install -e ".[faiss-gpu]"  # GPU

Container Installation

make container-build

See Container Guide for GPU support and advanced usage.

Quick Start

Command Line

# Basic usage
cordon system.log

# Multiple files
cordon app.log error.log

# With options
cordon --window-size 10 --k-neighbors 10 --anomaly-percentile 0.05 app.log

# With FAISS for large logs
cordon --use-faiss large.log

# llama.cpp backend (for containers)
cordon --backend llama-cpp --use-faiss system.log

Python Library

from pathlib import Path
from cordon import SemanticLogAnalyzer, AnalysisConfig

# Basic usage
analyzer = SemanticLogAnalyzer()
output = analyzer.analyze_file(Path("system.log"))
print(output)

# Advanced configuration
config = AnalysisConfig(
    window_size=10,
    stride=5,
    k_neighbors=10,
    anomaly_percentile=0.05,
    device="cuda"
)
analyzer = SemanticLogAnalyzer(config)
result = analyzer.analyze_file_detailed(Path("app.log"))

Backend Options

sentence-transformers (Default)

Best for native installations with GPU access.

cordon system.log  # Auto-detects GPU (MPS/CUDA)
cordon --device cuda system.log
cordon --device cpu system.log

llama.cpp Backend

Best for container deployments with GPU acceleration via Vulkan.

# Auto-downloads model on first run
cordon --backend llama-cpp --use-faiss system.log

# With GPU acceleration
cordon --backend llama-cpp --use-faiss --n-gpu-layers 10 system.log

# Custom model
cordon --backend llama-cpp --use-faiss --model-path ./model.gguf system.log

See llama.cpp Guide for details on models, performance, and GPU setup.

Container Usage

# Build
make container-build

# Run (specify directory with your logs)
make container-run DIR=/path/to/logs ARGS="/logs/system.log"

# With GPU (requires Podman with libkrun)
podman run --device /dev/dri -v $(pwd)/logs:/logs cordon:latest \
  --backend llama-cpp --use-faiss --n-gpu-layers 10 /logs/system.log

See Container Guide for full details.

Primary Use Case: LLM Context Reduction

Cordon attempts to solve for when log files are too large for a context window:

Problem: 50GB production log → Can't fit in any LLM context window
Solution: Cordon → 12 anomalous blocks (few KB) → Send to LLM for analysis

Example workflow:

# Extract anomalies
analyzer = SemanticLogAnalyzer()
anomalies = analyzer.analyze_file(Path("production.log"))

# Send curated context to LLM (now fits in context window)

The output is intentionally lossy—it discards repetitive patterns to focus on semantically unusual events.

How It Works

Pipeline

Ingestion: Read log file line-by-line
Segmentation: Create overlapping windows of N lines
Vectorization: Embed windows using transformer models
Scoring: Calculate k-NN density scores
Thresholding: Select top X% based on scores
Merging: Combine overlapping significant windows
Formatting: Generate XML-tagged output

Scoring

Higher score = Semantically unique = Anomalous
Lower score = Repetitive = Normal background noise

The score for each window is the average cosine distance to its k nearest neighbors in the embedding space.

Important: Repetitive patterns are filtered even if critical. The same FATAL error repeated 100 times scores as "normal" because it's semantically similar to itself.

See Cordon's architecture for full details.

Configuration

Analysis Parameters

Parameter	Default	CLI Flag	Description
`window_size`	5	`--window-size`	Lines per window
`stride`	2	`--stride`	Lines to skip between windows
`k_neighbors`	5	`--k-neighbors`	Number of neighbors for density calculation
`anomaly_percentile`	0.1	`--anomaly-percentile`	Top N% to keep (0.1 = 10%)

Backend Options

Parameter	Default	CLI Flag	Description
`backend`	`sentence-transformers`	`--backend`	Embedding backend
`model_name`	`all-MiniLM-L6-v2`	`--model-name`	HuggingFace model
`device`	Auto	`--device`	Device (cuda/mps/cpu)
`model_path`	None	`--model-path`	GGUF model path (llama-cpp)
`n_gpu_layers`	0	`--n-gpu-layers`	GPU layers (llama-cpp)
`use_faiss`	False	`--use-faiss`	Use FAISS for large logs

Run cordon --help for full CLI documentation.

⚠️ Important: Token Limits and Window Sizing

Transformer models have token limits that affect how much of each window is analyzed. Windows exceeding the limit are automatically truncated to the first N tokens.

Cordon will warn you if significant truncation is detected and suggest better settings for your logs.

Default model (all-MiniLM-L6-v2) has a 256-token limit:

Compact logs (20-30 tokens/line): Default window_size=5 works perfectly
Standard logs (40-50 tokens/line): Default settings work well
Verbose logs (50-70 tokens/line): Consider larger window with a bigger model
Very verbose logs (80+ tokens/line): Use a larger-context model

For verbose system logs, use larger-context models:

# BAAI/bge-base-en-v1.5 supports 512 tokens (~8-10 verbose lines)
cordon --model-name "BAAI/bge-base-en-v1.5" --window-size 8 --stride 4 your.log

See Configuration Guidelines for detailed recommendations.

Use Cases

What Cordon Is Good For

LLM Pre-processing: Reduce large logs to small anomalous sections prior to analysis
Initial Triage: First-pass screening of unfamiliar logs to find "what's unusual here?"
Anomaly Detection: Surface semantically unique events (rare errors, state transitions, unusual clusters)
Exploratory Analysis: Discover unexpected patterns without knowing what to search for

What Cordon Is NOT Good For

Complete error analysis (repetitive errors filtered)
Specific error hunting (use grep/structured logging)
Compliance logging (this is lossy by design)

Performance

Cordon automatically chooses the best approach:

Strategy	When	RAM Usage	Speed
In-Memory	<50k windows	~200-500MB	Fastest
Memory-Mapped	50k-500k windows	~50-100MB	Moderate
FAISS	>500k windows	~50MB	Fast

What's a "window"? A window is a sliding chunk of N consecutive log lines (default: 10 lines). A 10,000-line log with window_size=10 and stride=5 creates ~2,000 windows.

See Test Examples for real-world results across 9 log types.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.2

Apr 4, 2026

1.0.1

Mar 30, 2026

1.0.0

Mar 25, 2026

0.3.3

Jan 20, 2026

0.3.2

Jan 16, 2026

0.3.1

Dec 20, 2025

0.3.0

Dec 17, 2025

0.2.1

Dec 2, 2025

0.2.0

Dec 2, 2025

0.1.3

Dec 1, 2025

0.1.2

Dec 1, 2025

0.1.1

Nov 24, 2025

This version

0.1.0

Nov 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cordon-0.1.0.tar.gz (100.2 kB view details)

Uploaded Nov 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cordon-0.1.0-py3-none-any.whl (27.4 kB view details)

Uploaded Nov 24, 2025 Python 3

File details

Details for the file cordon-0.1.0.tar.gz.

File metadata

Download URL: cordon-0.1.0.tar.gz
Upload date: Nov 24, 2025
Size: 100.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cordon-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`198c9829dc53c379bf0bbdd2ceb535397ce8786e174f7053e21ee6bc0913434e`
MD5	`213191713c3b26e1f61d12683e29b622`
BLAKE2b-256	`5d657075e9b1f62ff3154b3b3e12d530151ef9bff7a47c66ce5e1c4663bb4a22`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cordon-0.1.0.tar.gz:

Publisher: release.yml on calebevans/cordon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cordon-0.1.0.tar.gz
- Subject digest: 198c9829dc53c379bf0bbdd2ceb535397ce8786e174f7053e21ee6bc0913434e
- Sigstore transparency entry: 721398671
- Sigstore integration time: Nov 24, 2025
Source repository:
- Permalink: calebevans/cordon@9c39f169465c2725e8e44f8d138053038522f2f0
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/calebevans
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9c39f169465c2725e8e44f8d138053038522f2f0
- Trigger Event: release

File details

Details for the file cordon-0.1.0-py3-none-any.whl.

File metadata

Download URL: cordon-0.1.0-py3-none-any.whl
Upload date: Nov 24, 2025
Size: 27.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cordon-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`147efef9149ce4a506d79a06c58adcc86247a6c6547fb78e13da1eae7e18909a`
MD5	`4410c2c24dfb959ea2010d6dede95d20`
BLAKE2b-256	`5953f38814efdd68336f628df7ce3f7dba4b6e37cffdd4fd0c89496a5de82e14`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cordon-0.1.0-py3-none-any.whl:

Publisher: release.yml on calebevans/cordon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cordon-0.1.0-py3-none-any.whl
- Subject digest: 147efef9149ce4a506d79a06c58adcc86247a6c6547fb78e13da1eae7e18909a
- Sigstore transparency entry: 721398676
- Sigstore integration time: Nov 24, 2025
Source repository:
- Permalink: calebevans/cordon@9c39f169465c2725e8e44f8d138053038522f2f0
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/calebevans
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9c39f169465c2725e8e44f8d138053038522f2f0
- Trigger Event: release

cordon 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Cordon

Features

Installation

Native Installation

Container Installation

Quick Start

Command Line

Python Library

Backend Options

sentence-transformers (Default)

llama.cpp Backend

Container Usage

Primary Use Case: LLM Context Reduction

How It Works

Pipeline

Scoring

Configuration

Analysis Parameters

Backend Options

⚠️ Important: Token Limits and Window Sizing

Use Cases

What Cordon Is Good For

What Cordon Is NOT Good For

Performance

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance