Skip to main content

Parse unstructured logs by learning patterns automatically

Project description

log-sculptor

CI CodeQL codecov pytest-agents Python 3.10+ License: MIT

Parse unstructured logs by learning patterns automatically. No regex required.

Features

  • Automatic pattern learning - Analyzes log files and learns structural patterns
  • Smart field naming - Infers meaningful field names from context (method, status, path, etc.)
  • Type detection - Automatically detects timestamps, IPs, URLs, UUIDs, numbers, booleans
  • Multi-line support - Handles stack traces and continuation lines
  • Format drift detection - Detects when log formats change mid-file
  • Multiple outputs - JSON Lines, SQLite, DuckDB, Parquet
  • Performance optimized - Streaming processing and parallel learning for large files
  • Testing utilities - Mocks, fixtures, and sample data generators for reliable integrations

Installation

pip install log-sculptor

# With optional outputs
pip install log-sculptor[duckdb]    # DuckDB support
pip install log-sculptor[parquet]   # Parquet support
pip install log-sculptor[all]       # All optional dependencies

Quick Start

# Learn and parse in one step
log-sculptor auto server.log -f jsonl -o parsed.jsonl

# Or separate steps for reuse
log-sculptor learn server.log -o patterns.json
log-sculptor parse server.log -p patterns.json -f jsonl -o parsed.jsonl

CLI Commands

auto

Learn patterns and parse in one step.

log-sculptor auto server.log -f jsonl -o output.jsonl
log-sculptor auto server.log -f sqlite -o logs.db
log-sculptor auto server.log -f duckdb -o logs.duckdb  # requires [duckdb]
log-sculptor auto server.log -f parquet -o logs.parquet  # requires [parquet]

# With multi-line support (stack traces, continuations)
log-sculptor auto server.log --multiline -f jsonl -o output.jsonl

learn

Learn patterns from a log file.

log-sculptor learn server.log -o patterns.json

# With clustering for similar patterns
log-sculptor learn server.log -o patterns.json --cluster

# Incremental learning (update existing patterns)
log-sculptor learn new.log --update patterns.json -o patterns.json

# Handle multi-line entries
log-sculptor learn server.log -o patterns.json --multiline

parse

Parse a log file using learned patterns.

log-sculptor parse server.log -p patterns.json -f jsonl -o output.jsonl
log-sculptor parse server.log -p patterns.json -f sqlite -o logs.db --include-raw

show

Display patterns from a patterns file.

log-sculptor show patterns.json

validate

Validate patterns against a log file.

log-sculptor validate patterns.json server.log
# Exit codes: 0 = all matched, 1 = partial match, 2 = no matches

merge

Merge similar patterns in a patterns file.

log-sculptor merge patterns.json -o merged.json --threshold 0.8

drift

Detect format changes in a log file.

log-sculptor drift server.log -p patterns.json
log-sculptor drift server.log -p patterns.json --window 50

fast-learn

Learn patterns using parallel processing (for large files).

log-sculptor fast-learn large.log -o patterns.json --workers 4

generate

Generate sample log data for testing and demos.

log-sculptor generate sample.log -t app -n 1000
log-sculptor generate apache.log -t apache -n 500 --seed 42
log-sculptor generate mixed.log -t mixed -n 1000  # For drift testing

Available types: app, apache, syslog, json, mixed

Output Formats

JSON Lines (jsonl)

{"line_number": 1, "pattern_id": "a1b2c3", "matched": true, "fields": {"timestamp": "2024-01-15T10:30:00", "level": "INFO", "message": "Server started"}}

SQLite

Creates two tables:

  • patterns - Pattern metadata (id, frequency, confidence, structure, example)
  • logs - Parsed records with extracted fields as columns

DuckDB

Same schema as SQLite, optimized for analytical queries. Requires pip install log-sculptor[duckdb].

Parquet

Columnar format for efficient analytics. Creates output.parquet for logs and output_patterns.parquet for patterns. Requires pip install log-sculptor[parquet].

Python API

from log_sculptor.core import learn_patterns, parse_logs, PatternSet

# Learn patterns
patterns = learn_patterns("server.log")
patterns.save("patterns.json")

# Parse logs
for record in parse_logs("server.log", patterns):
    print(record.fields)

# Load existing patterns
patterns = PatternSet.load("patterns.json")

# Incremental learning
new_patterns = learn_patterns("new_logs.log")
patterns.update(new_patterns, merge=True)

# Merge similar patterns
patterns.merge_similar(threshold=0.8)

Streaming for Large Files

from log_sculptor.core.streaming import stream_parse, parallel_learn

# Memory-efficient parsing
for record in stream_parse("large.log", patterns):
    process(record)

# Parallel pattern learning
patterns = parallel_learn("large.log", num_workers=4)

Format Drift Detection

from log_sculptor.core import detect_drift

report = detect_drift("server.log", patterns)
print(f"Format changes: {len(report.format_changes)}")
for change in report.format_changes:
    print(f"  Line {change.line_number}: {change.old_pattern_id} -> {change.new_pattern_id}")

How It Works

  1. Tokenization - Lines are split into typed tokens (TIMESTAMP, IP, QUOTED, BRACKET, NUMBER, WORD, PUNCT, WHITESPACE)

  2. Clustering - Lines with identical token signatures are grouped together

  3. Pattern Generation - Each cluster becomes a pattern with fields for variable tokens

  4. Smart Naming - Field names are inferred from context:

    • Previous token as indicator ("status 200" -> field named "status")
    • Value patterns (GET/POST -> "method", 404 -> "status", /api/users -> "path")
    • Token types (timestamps, IPs, UUIDs get appropriate names)
  5. Type Detection - Field values are typed:

    • Timestamps (ISO 8601, Apache CLF, syslog, Unix epoch)
    • IPs, URLs, UUIDs
    • Integers, floats, booleans

Testing Utilities

log-sculptor includes comprehensive testing utilities for building reliable integrations.

Sample Data Generation

from log_sculptor.testing import (
    generate_apache_logs,
    generate_syslog,
    generate_json_logs,
    generate_app_logs,
    write_sample_logs,
)

# Generate Apache logs
for line in generate_apache_logs(count=100, seed=42):
    print(line)

# Generate JSON structured logs
for line in generate_json_logs(count=50):
    print(line)

# Write directly to file
write_sample_logs("test.log", generator="app", count=1000, seed=42)

Mock Objects

from log_sculptor.testing import (
    MockFileReader,
    MockFileWriter,
    MockPatternMatcher,
    MockTypeDetector,
)

# Mock file reader
reader = MockFileReader()
reader.add_file("/test.log", ["line1", "line2", "line3"])
lines = reader.read_lines(Path("/test.log"))
assert reader.read_count == 1

# Mock pattern matcher with custom responses
matcher = MockPatternMatcher()
matcher.add_response("GET /api", MockPattern(id="http"), {"method": "GET"})
pattern, fields = matcher.match("GET /api")

Test Fixtures

from log_sculptor.testing import (
    create_test_patterns,
    create_test_log_file,
    SandboxContext,
    isolated_test,
)

# Create test patterns
patterns = create_test_patterns(count=3, with_examples=True)

# Create test log file
create_test_log_file(Path("test.log"), generator="apache", count=100)

# Isolated test environment with mocks
with isolated_test() as ctx:
    log_file = ctx.create_log_file("test.log", generator="app", count=50)
    ctx.add_mock_file("/virtual.log", ["mock line 1", "mock line 2"])
    # Tests run in isolation, temp files cleaned up automatically

Dependency Injection

from log_sculptor import get_container, register, resolve, reset_container
from log_sculptor.di import FileReader, FileWriter

# Register custom implementations
class MyFileReader:
    def read_lines(self, path):
        # Custom implementation
        pass

register(FileReader, lambda: MyFileReader())
reader = resolve(FileReader)

# Reset for test isolation
reset_container()

pytest-agents Integration

log-sculptor integrates with pytest-agents for enhanced test organization and AI-powered testing capabilities.

import pytest

@pytest.mark.unit
def test_pattern_learning(tmp_path):
    """Unit test with pytest-agents marker."""
    from log_sculptor.core.patterns import learn_patterns
    from log_sculptor.testing.generators import write_sample_logs

    log_file = tmp_path / "test.log"
    write_sample_logs(log_file, generator="apache", count=50, seed=42)

    patterns = learn_patterns(log_file)
    assert len(patterns.patterns) > 0

@pytest.mark.integration
def test_full_workflow(tmp_path):
    """Integration test with pytest-agents marker."""
    # Full learn -> save -> load -> parse workflow
    pass

@pytest.mark.performance
def test_large_file_parsing(tmp_path):
    """Performance benchmark test."""
    pass

Install with: pip install pytest-agents (requires Python 3.11+)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

log_sculptor-0.1.0.tar.gz (58.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

log_sculptor-0.1.0-py3-none-any.whl (44.1 kB view details)

Uploaded Python 3

File details

Details for the file log_sculptor-0.1.0.tar.gz.

File metadata

  • Download URL: log_sculptor-0.1.0.tar.gz
  • Upload date:
  • Size: 58.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for log_sculptor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e97dda403dd54becc1849864d755403ece48feb000d5607b095d0fa0796d8da7
MD5 d35243ba054ba9c52b1ea1a5aef56d21
BLAKE2b-256 205d1cc7ee7187d95ee5dc46ebb5068a9493a393dd3595ed6d7280da33d9c78f

See more details on using hashes here.

File details

Details for the file log_sculptor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: log_sculptor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for log_sculptor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 120ae786db92ed2db4abf682f345560ae358388c433a67db9146688167177599
MD5 ed313a9acbb248f6d537c5c0d45fd712
BLAKE2b-256 12744fe45c48b224015f0c977cfe631387a04687f97a6fad6fd051c7698e9e14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page