Skip to main content

Semantic-aware log compression with automatic schema extraction and queryable storage

Project description

LogPress - Semantic Log Compression System

Tests Coverage Python License

Master's Thesis Research Project: Automatic schema extraction from unstructured system logs using constraint-based parsing and semantic-aware compression.

๐ŸŽฏ Research Goals

  • Automatic Schema Discovery: Extract implicit log schemas without manual annotation
  • Semantic-Aware Compression: Achieve 8-30ร— compression while maintaining queryability
  • Real-World Validation: Tested on diverse log sources (2M+ entries)

๐Ÿš€ Quick Start

Installation

Preferred: Install from PyPI

# Install from PyPI (recommended)
pip install LogPress

Alternative: Docker (no Python setup required)

# Interactive mode
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

From source (developer mode)

# Clone repository
git clone https://github.com/adam-bouafia/LogPress.git
cd LogPress

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\\Scripts\\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

Interactive Mode (Recommended)

# Beautiful terminal UI with dataset auto-discovery
python -m logpress.cli.interactive

Features:

  • ๐Ÿ” Auto-discovers datasets in data/datasets/
  • ๐Ÿ“Š Real-time compression progress
  • ๐ŸŽจ Rich terminal UI with tables and progress bars
  • โšก Query compressed logs interactively

Command-Line Usage

# Compress logs
python -m logpress compress \
  -i data/datasets/Apache/Apache_full.log \
  -o evaluation/compressed/apache.lsc \
  --min-support 3 \
  -m

# Query compressed logs
python -m logpress query \
  -c evaluation/compressed/apache.lsc \
  --severity ERROR \
  --limit 20

# Run full evaluation
python evaluation/run_full_evaluation.py

Docker Usage

# Interactive mode (Python rich UI)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Bash menu (alternative)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive-bash

# Run specific command
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m

Pre-built Docker Image (GHCR & Docker Hub)

We publish pre-built Docker images to the GitHub Container Registry (GHCR) and mirror to Docker Hub. There are two ways to run LogPress with Docker:

  1. From a local clone (recommended for development):
# Clone repository and run with docker-compose
git clone https://github.com/adam-bouafia/LogPress.git
cd LogPress
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive
  1. Use pre-built images from GHCR or Docker Hub (recommended for quick start):
# Pull the image from GHCR
docker pull ghcr.io/adam-bouafia/logpress:latest

# Or pull from Docker Hub mirror
docker pull adambouafia/logpress:latest

# Run the CLI (example: show version)
docker run --rm ghcr.io/adam-bouafia/logpress:latest python -m logpress --version

# Run a compress command using the GHCR image
docker run --rm \
  -v "$(pwd)/data:/app/data:ro" \
  -v "$(pwd)/evaluation:/app/evaluation:rw" \
  ghcr.io/adam-bouafia/logpress:latest \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m

If you prefer Docker Hub, images are mirrored to Docker Hub as adambouafia/logpress:latest and to specific version tags such as adambouafia/logpress:1.0.1.

# Pull the image from GHCR
docker pull ghcr.io/adam-bouafia/logpress:latest

# Run the CLI (example: show version)
docker run --rm ghcr.io/adam-bouafia/logpress:latest python -m logpress --version

# Run a compress command using the GHCR image
docker run --rm \
  -v "$(pwd)/data:/app/data:ro" \
  -v "$(pwd)/evaluation:/app/evaluation:rw" \
  ghcr.io/adam-bouafia/logpress:latest \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m

If you prefer Docker Hub, you or the CI workflow can mirror the image to Docker Hub with the adambouafia/logpress:latest tag. For example:

# (Optional) Tag and push to Docker Hub (requires Docker Hub credentials)
docker tag ghcr.io/adam-bouafia/logpress:latest adambouafia/logpress:latest
docker login --username <docker-hub-username>
docker push adambouafia/logpress:latest

๐Ÿ“ Project Structure (MCP Architecture)

LogPress/
โ”œโ”€โ”€ logpress/                  # Core Python package (Model-Context-Protocol)
โ”‚   โ”œโ”€โ”€ models/             # Data structures (Token, LogTemplate, CompressedLog)
โ”‚   โ”œโ”€โ”€ protocols/          # Abstract interfaces (EncoderProtocol, CompressorProtocol)
โ”‚   โ”œโ”€โ”€ context/           # Business logpress
โ”‚   โ”‚   โ”œโ”€โ”€ tokenization/  # Smart log tokenization (FSM-based)
โ”‚   โ”‚   โ”œโ”€โ”€ extraction/    # Template generation (log alignment algorithm)
โ”‚   โ”‚   โ”œโ”€โ”€ classification/# Semantic type recognition (pattern-based)
โ”‚   โ”‚   โ””โ”€โ”€ encoding/      # Compression codecs (delta, dictionary, varint)
โ”‚   โ”œโ”€โ”€ services/          # High-level orchestration
โ”‚   โ”‚   โ”œโ”€โ”€ compressor.py  # 6-stage compression pipeline
โ”‚   โ”‚   โ”œโ”€โ”€ query_engine.py# Queryable decompression
โ”‚   โ”‚   โ””โ”€โ”€ evaluator.py   # Accuracy metrics vs ground truth
โ”‚   โ”œโ”€โ”€ cli/              # User interfaces
โ”‚   โ”‚   โ”œโ”€โ”€ interactive.py # Rich terminal UI
โ”‚   โ”‚   โ””โ”€โ”€ commands.py    # Click-based CLI
โ”‚   โ””โ”€โ”€ tests/            # Test suite (25 tests, 100% passing)
โ”‚       โ”œโ”€โ”€ unit/         # Component testing
โ”‚       โ”œโ”€โ”€ integration/  # Workflow testing
โ”‚       โ”œโ”€โ”€ e2e/          # End-to-end testing
โ”‚       โ””โ”€โ”€ performance/  # Benchmarks
โ”‚
โ”œโ”€โ”€ data/                  # Input data
โ”‚   โ”œโ”€โ”€ datasets/         # 8 real-world log sources (~1.07M entries)
โ”‚   โ”‚   โ”œโ”€โ”€ Apache/       # Web server logs (52K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ HealthApp/    # Android health tracking (212K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ HPC/          # High-performance computing cluster logs (433K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ Linux/        # Linux system logs (26K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ Mac/          # macOS system logs (117K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ OpenStack/    # Cloud infrastructure logs (137K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ Proxifier/    # Network proxy logs (21K lines)
โ”‚   โ”‚   โ””โ”€โ”€ Zookeeper/    # Distributed coordination logs (74K lines)
โ”‚   โ””โ”€โ”€ ground_truth/     # Manual annotations for validation
โ”‚
โ”œโ”€โ”€ evaluation/           # Outputs & results
โ”‚   โ”œโ”€โ”€ compressed/       # .lsc compressed files
โ”‚   โ”œโ”€โ”€ results/          # Evaluation metrics (JSON/Markdown)
โ”‚   โ””โ”€โ”€ schema_versions/  # Schema evolution tracking
โ”‚
โ”œโ”€โ”€ deployment/          # Infrastructure
โ”‚   โ”œโ”€โ”€ Dockerfile       # Container image
โ”‚   โ”œโ”€โ”€ docker-compose.yml# Service orchestration
โ”‚   โ””โ”€โ”€ Makefile         # Build automation
โ”‚
โ”œโ”€โ”€ documentation/       # Project documentation
โ”‚   โ”œโ”€โ”€ README.md        # Documentation index
โ”‚   โ”œโ”€โ”€ TESTING.md       # Test strategy
โ”‚   โ”œโ”€โ”€ MCP_ARCHITECTURE.md # System design
โ”‚   โ””โ”€โ”€ API.md           # Python API reference
โ”‚
โ””โ”€โ”€ scripts/            # Automation scripts
    โ”œโ”€โ”€ logpress-interactive.sh  # Bash interactive menu
    โ”œโ”€โ”€ run-tests.sh           # Test suite runner
    โ””โ”€โ”€ run-pre-production-tests.sh # Validation

See individual README files in each directory for detailed information.

๐Ÿ”ฌ Research Methodology

1. Schema Extraction Pipeline

6-Stage Process:

  1. Tokenization: FSM-based parser handles diverse log formats
  2. Semantic Classification: Pattern-based field type detection (timestamp, IP, severity, etc.)
  3. Field Grouping: Identify related fields (ip+port, user+action)
  4. Template Generation: Log alignment algorithm extracts schemas
  5. Schema Versioning: Track format evolution over time
  6. Validation: Compare against manual ground truth (precision/recall)

Example:

Raw Logs:
  [Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP
  [Thu Jun 09 06:07:05 2005] [notice] LDAP: SSL support unavailable
  
Extracted Template:
  [TIMESTAMP] [SEVERITY] LDAP: [MESSAGE]

2. Semantic-Aware Compression

Category-Specific Codecs:

  • Timestamps: Delta encoding (8-10ร— compression)
  • Severity/Status: Dictionary encoding (5-7ร— compression)
  • Metrics: Gorilla time-series compression (3-5ร— compression)
  • Messages: Token pool with references (variable)
  • Stack traces: Reference tracking (store once, reuse pointer)

Queryable Index: Columnar storage enables filtering without full decompression.

3. Evaluation Metrics

Accuracy (vs manual annotations):

  • Precision: % of extracted fields that are correct
  • Recall: % of actual fields that were found
  • F1-Score: Harmonic mean
  • Target: >90% accuracy

Compression Performance:

  • Compression ratio vs gzip baseline
  • Query latency overhead
  • Target: >10ร— compression, <2ร— query slowdown

๐Ÿงช Testing

Run Complete Test Suite

# All tests with coverage
bash scripts/run-tests.sh

# View coverage report
firefox htmlcov/index.html

Pre-Production Validation

# Validate before deployment
bash scripts/run-pre-production-tests.sh

Test Status: โœ… 25/25 tests passing (100%)

  • Unit tests: 9 tests
  • Integration tests: 8 tests
  • E2E tests: 3 tests
  • Performance benchmarks: 5 tests

Performance Benchmarks

# Run benchmarks
python -m pytest logpress/tests/performance/ --benchmark-only

# Expected results:
# - Compression: >500 ops/sec
# - Template extraction: >900 ops/sec
# - Linear scalability: 100 โ†’ 10,000 logs

๐Ÿ“š Documentation

๐ŸŽ“ Research Context

Master's Thesis: Automatic Schema Extraction from Unstructured System Logs
Duration: 26 weeks (4 phases)
Target Venues: VLDB, SIGMOD, IEEE BigData
Novel Contribution: Semantic-aware compression adapting to log content types

Related Work

  • Log Parsing: Drain, Spell, LogPai
  • Schema Inference: Lakehouse formats (Parquet, ORC)
  • Compression: Generic (gzip, zstd) vs specialized (LogShrink)

Key Differentiators

  • โœ… No ML models (constraint-based approach)
  • โœ… Semantic awareness (field-type-specific compression)
  • โœ… Query preservation (columnar indexes)
  • โœ… Schema evolution tracking
  • โœ… Lossless compression (exact reconstruction)

๐Ÿ› ๏ธ Development

Setup Development Environment

# Install test dependencies
pip install pytest pytest-cov pytest-benchmark pytest-mock

# Run tests on file changes (watch mode)
pip install pytest-watch
ptw logpress/tests/ -- -v

Contribution Workflow

  1. Create feature branch: git checkout -b feature/new-encoder
  2. Make changes and add tests
  3. Run validation: bash scripts/run-pre-production-tests.sh
  4. Submit PR (GitHub Actions runs full test suite)

Adding New Semantic Type Patterns

# logpress/context/classification/semantic_types.py

def recognize_custom_field(token: str) -> Tuple[str, float]:
    """
    Add pattern for new field type.
    
    Returns:
        (field_type, confidence_score)
    """
    if re.match(r'^[A-Z]{3}-\d{4}$', token):
        return ('ERROR_CODE', 0.95)  # High confidence
    return ('UNKNOWN', 0.0)

Adding New Compression Codecs

# logpress/context/encoding/custom_encoder.py

from logpress.protocols import EncoderProtocol

class CustomEncoder(EncoderProtocol):
    def encode(self, values: List[Any]) -> bytes:
        # Your encoding logpress
        pass
    
    def decode(self, data: bytes) -> List[Any]:
        # Your decoding logpress
        pass

๐Ÿ“ฆ Dependencies

Core Libraries

msgpack>=1.0.0          # Serialization
zstandard>=0.21.0       # Compression baseline
python-dateutil>=2.8.0  # Timestamp parsing
regex>=2023.0.0         # Advanced pattern matching
rich>=13.0.0            # Terminal UI
click>=8.1.0            # CLI framework

Testing

pytest>=7.4.0           # Test framework
pytest-cov>=4.1.0       # Coverage reporting
pytest-benchmark>=4.0.0 # Performance testing
pytest-mock>=3.12.0     # Mocking utilities

Optional Tools

# Baseline comparison
gzip --version

# Command-line benchmarking
cargo install hyperfine

# Memory profiling
pip install memory-profiler

๐Ÿณ Docker Deployment

Build & Run

# Build all services
docker-compose -f deployment/docker-compose.yml build

# Run interactive CLI
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Run compression
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc

Environment Variables

# Set in docker-compose.yml
PYTHONUNBUFFERED=1      # Real-time output
TERM=xterm-256color     # Colored terminal
MIN_SUPPORT=3           # Template extraction threshold
ZSTD_LEVEL=15           # Compression level (1-22)

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

  • Additional semantic type patterns
  • New compression codecs
  • Query optimization
  • Schema visualization
  • Performance improvements

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ”— Links

๐Ÿ“ž Contact


Status: โœ… Production Ready | ๐Ÿงช All Tests Passing (25/25) | ๐Ÿ“Š Coverage: 42%

Built with โค๏ธ for research in log analysis and semantic compression.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logpress-2.0.0.tar.gz (69.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logpress-2.0.0-py3-none-any.whl (72.9 kB view details)

Uploaded Python 3

File details

Details for the file logpress-2.0.0.tar.gz.

File metadata

  • Download URL: logpress-2.0.0.tar.gz
  • Upload date:
  • Size: 69.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for logpress-2.0.0.tar.gz
Algorithm Hash digest
SHA256 fc0bfbc05b1431ba32db0b666b9fcbf5bc67a36ac26b2698792c5c0903d34a8e
MD5 abcd13e8574b2c981649b940e8255625
BLAKE2b-256 837e8b499478195ba7f71c509d9ce75da99be8c8f92664ee14a5c157973cccb9

See more details on using hashes here.

File details

Details for the file logpress-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: logpress-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 72.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for logpress-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b2ef9ec4dfbe3ff1abe179eb47bb3d2fb6c5a9e8156571583213ed41e13befc
MD5 61c380375eac795a6f8a8e36b9d893aa
BLAKE2b-256 b33b724761f80887b47e8862e2633d45a67ac8e2cb24c3031915500a0cd5034b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page