Parse unstructured logs by learning patterns automatically
Project description
log-sculptor
Parse unstructured logs by learning patterns automatically. No regex required.
Features
- Automatic pattern learning - Analyzes log files and learns structural patterns
- Smart field naming - Infers meaningful field names from context (method, status, path, etc.)
- Type detection - Automatically detects timestamps, IPs, URLs, UUIDs, numbers, booleans
- Multi-line support - Handles stack traces and continuation lines
- Format drift detection - Detects when log formats change mid-file
- Multiple outputs - JSON Lines, SQLite, DuckDB, Parquet
- Performance optimized - Streaming processing and parallel learning for large files
- Testing utilities - Mocks, fixtures, and sample data generators for reliable integrations
Installation
pip install log-sculptor
# With optional outputs
pip install log-sculptor[duckdb] # DuckDB support
pip install log-sculptor[parquet] # Parquet support
pip install log-sculptor[all] # All optional dependencies
Quick Start
# Learn and parse in one step
log-sculptor auto server.log -f jsonl -o parsed.jsonl
# Or separate steps for reuse
log-sculptor learn server.log -o patterns.json
log-sculptor parse server.log -p patterns.json -f jsonl -o parsed.jsonl
CLI Commands
auto
Learn patterns and parse in one step.
log-sculptor auto server.log -f jsonl -o output.jsonl
log-sculptor auto server.log -f sqlite -o logs.db
log-sculptor auto server.log -f duckdb -o logs.duckdb # requires [duckdb]
log-sculptor auto server.log -f parquet -o logs.parquet # requires [parquet]
# With multi-line support (stack traces, continuations)
log-sculptor auto server.log --multiline -f jsonl -o output.jsonl
learn
Learn patterns from a log file.
log-sculptor learn server.log -o patterns.json
# With clustering for similar patterns
log-sculptor learn server.log -o patterns.json --cluster
# Incremental learning (update existing patterns)
log-sculptor learn new.log --update patterns.json -o patterns.json
# Handle multi-line entries
log-sculptor learn server.log -o patterns.json --multiline
parse
Parse a log file using learned patterns.
log-sculptor parse server.log -p patterns.json -f jsonl -o output.jsonl
log-sculptor parse server.log -p patterns.json -f sqlite -o logs.db --include-raw
show
Display patterns from a patterns file.
log-sculptor show patterns.json
validate
Validate patterns against a log file.
log-sculptor validate patterns.json server.log
# Exit codes: 0 = all matched, 1 = partial match, 2 = no matches
merge
Merge similar patterns in a patterns file.
log-sculptor merge patterns.json -o merged.json --threshold 0.8
drift
Detect format changes in a log file.
log-sculptor drift server.log -p patterns.json
log-sculptor drift server.log -p patterns.json --window 50
fast-learn
Learn patterns using parallel processing (for large files).
log-sculptor fast-learn large.log -o patterns.json --workers 4
generate
Generate sample log data for testing and demos.
log-sculptor generate sample.log -t app -n 1000
log-sculptor generate apache.log -t apache -n 500 --seed 42
log-sculptor generate mixed.log -t mixed -n 1000 # For drift testing
Available types: app, apache, syslog, json, mixed
Output Formats
JSON Lines (jsonl)
{"line_number": 1, "pattern_id": "a1b2c3", "matched": true, "fields": {"timestamp": "2024-01-15T10:30:00", "level": "INFO", "message": "Server started"}}
SQLite
Creates two tables:
patterns- Pattern metadata (id, frequency, confidence, structure, example)logs- Parsed records with extracted fields as columns
DuckDB
Same schema as SQLite, optimized for analytical queries. Requires pip install log-sculptor[duckdb].
Parquet
Columnar format for efficient analytics. Creates output.parquet for logs and output_patterns.parquet for patterns. Requires pip install log-sculptor[parquet].
Python API
from log_sculptor.core import learn_patterns, parse_logs, PatternSet
# Learn patterns
patterns = learn_patterns("server.log")
patterns.save("patterns.json")
# Parse logs
for record in parse_logs("server.log", patterns):
print(record.fields)
# Load existing patterns
patterns = PatternSet.load("patterns.json")
# Incremental learning
new_patterns = learn_patterns("new_logs.log")
patterns.update(new_patterns, merge=True)
# Merge similar patterns
patterns.merge_similar(threshold=0.8)
Streaming for Large Files
from log_sculptor.core.streaming import stream_parse, parallel_learn
# Memory-efficient parsing
for record in stream_parse("large.log", patterns):
process(record)
# Parallel pattern learning
patterns = parallel_learn("large.log", num_workers=4)
Format Drift Detection
from log_sculptor.core import detect_drift
report = detect_drift("server.log", patterns)
print(f"Format changes: {len(report.format_changes)}")
for change in report.format_changes:
print(f" Line {change.line_number}: {change.old_pattern_id} -> {change.new_pattern_id}")
How It Works
-
Tokenization - Lines are split into typed tokens (TIMESTAMP, IP, QUOTED, BRACKET, NUMBER, WORD, PUNCT, WHITESPACE)
-
Clustering - Lines with identical token signatures are grouped together
-
Pattern Generation - Each cluster becomes a pattern with fields for variable tokens
-
Smart Naming - Field names are inferred from context:
- Previous token as indicator ("status 200" -> field named "status")
- Value patterns (GET/POST -> "method", 404 -> "status", /api/users -> "path")
- Token types (timestamps, IPs, UUIDs get appropriate names)
-
Type Detection - Field values are typed:
- Timestamps (ISO 8601, Apache CLF, syslog, Unix epoch)
- IPs, URLs, UUIDs
- Integers, floats, booleans
Testing Utilities
log-sculptor includes comprehensive testing utilities for building reliable integrations.
Sample Data Generation
from log_sculptor.testing import (
generate_apache_logs,
generate_syslog,
generate_json_logs,
generate_app_logs,
write_sample_logs,
)
# Generate Apache logs
for line in generate_apache_logs(count=100, seed=42):
print(line)
# Generate JSON structured logs
for line in generate_json_logs(count=50):
print(line)
# Write directly to file
write_sample_logs("test.log", generator="app", count=1000, seed=42)
Mock Objects
from log_sculptor.testing import (
MockFileReader,
MockFileWriter,
MockPatternMatcher,
MockTypeDetector,
)
# Mock file reader
reader = MockFileReader()
reader.add_file("/test.log", ["line1", "line2", "line3"])
lines = reader.read_lines(Path("/test.log"))
assert reader.read_count == 1
# Mock pattern matcher with custom responses
matcher = MockPatternMatcher()
matcher.add_response("GET /api", MockPattern(id="http"), {"method": "GET"})
pattern, fields = matcher.match("GET /api")
Test Fixtures
from log_sculptor.testing import (
create_test_patterns,
create_test_log_file,
SandboxContext,
isolated_test,
)
# Create test patterns
patterns = create_test_patterns(count=3, with_examples=True)
# Create test log file
create_test_log_file(Path("test.log"), generator="apache", count=100)
# Isolated test environment with mocks
with isolated_test() as ctx:
log_file = ctx.create_log_file("test.log", generator="app", count=50)
ctx.add_mock_file("/virtual.log", ["mock line 1", "mock line 2"])
# Tests run in isolation, temp files cleaned up automatically
Dependency Injection
from log_sculptor import get_container, register, resolve, reset_container
from log_sculptor.di import FileReader, FileWriter
# Register custom implementations
class MyFileReader:
def read_lines(self, path):
# Custom implementation
pass
register(FileReader, lambda: MyFileReader())
reader = resolve(FileReader)
# Reset for test isolation
reset_container()
pytest-agents Integration
log-sculptor integrates with pytest-agents for enhanced test organization and AI-powered testing capabilities.
import pytest
@pytest.mark.unit
def test_pattern_learning(tmp_path):
"""Unit test with pytest-agents marker."""
from log_sculptor.core.patterns import learn_patterns
from log_sculptor.testing.generators import write_sample_logs
log_file = tmp_path / "test.log"
write_sample_logs(log_file, generator="apache", count=50, seed=42)
patterns = learn_patterns(log_file)
assert len(patterns.patterns) > 0
@pytest.mark.integration
def test_full_workflow(tmp_path):
"""Integration test with pytest-agents marker."""
# Full learn -> save -> load -> parse workflow
pass
@pytest.mark.performance
def test_large_file_parsing(tmp_path):
"""Performance benchmark test."""
pass
Install with: pip install pytest-agents (requires Python 3.11+)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file log_sculptor-0.1.0.tar.gz.
File metadata
- Download URL: log_sculptor-0.1.0.tar.gz
- Upload date:
- Size: 58.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e97dda403dd54becc1849864d755403ece48feb000d5607b095d0fa0796d8da7
|
|
| MD5 |
d35243ba054ba9c52b1ea1a5aef56d21
|
|
| BLAKE2b-256 |
205d1cc7ee7187d95ee5dc46ebb5068a9493a393dd3595ed6d7280da33d9c78f
|
File details
Details for the file log_sculptor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: log_sculptor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
120ae786db92ed2db4abf682f345560ae358388c433a67db9146688167177599
|
|
| MD5 |
ed313a9acbb248f6d537c5c0d45fd712
|
|
| BLAKE2b-256 |
12744fe45c48b224015f0c977cfe631387a04687f97a6fad6fd051c7698e9e14
|