Skip to main content

Semantic-aware log compression with automatic schema extraction and queryable storage

Project description

LogPress - Semantic Log Compression System

Tests Coverage Python License

Master's Thesis Research Project: Automatic schema extraction from unstructured system logs using constraint-based parsing and semantic-aware compression.

๐ŸŽฏ Research Goals

  • Automatic Schema Discovery: Extract implicit log schemas without manual annotation
  • Semantic-Aware Compression: Achieve 8-30ร— compression while maintaining queryability
  • Real-World Validation: Tested on diverse log sources (2M+ entries)

๐Ÿš€ Quick Start

Installation

# Clone repository
git clone https://github.com/adam-bouafia/LogPress.git
cd LogPress

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

Interactive Mode (Recommended)

# Beautiful terminal UI with dataset auto-discovery
python -m logpress.cli.interactive

Features:

  • ๐Ÿ” Auto-discovers datasets in data/datasets/
  • ๐Ÿ“Š Real-time compression progress
  • ๐ŸŽจ Rich terminal UI with tables and progress bars
  • โšก Query compressed logs interactively

Command-Line Usage

# Compress logs
python -m logpress compress \
  -i data/datasets/Apache/Apache_full.log \
  -o evaluation/compressed/apache.lsc \
  --min-support 3 \
  -m

# Query compressed logs
python -m logpress query \
  -c evaluation/compressed/apache.lsc \
  --severity ERROR \
  --limit 20

# Run full evaluation
python evaluation/run_full_evaluation.py

Docker Usage

# Interactive mode (Python rich UI)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Bash menu (alternative)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive-bash

# Run specific command
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m

๐Ÿ“ Project Structure (MCP Architecture)

LogPress/
โ”œโ”€โ”€ logpress/                  # Core Python package (Model-Context-Protocol)
โ”‚   โ”œโ”€โ”€ models/             # Data structures (Token, LogTemplate, CompressedLog)
โ”‚   โ”œโ”€โ”€ protocols/          # Abstract interfaces (EncoderProtocol, CompressorProtocol)
โ”‚   โ”œโ”€โ”€ context/           # Business logpress
โ”‚   โ”‚   โ”œโ”€โ”€ tokenization/  # Smart log tokenization (FSM-based)
โ”‚   โ”‚   โ”œโ”€โ”€ extraction/    # Template generation (log alignment algorithm)
โ”‚   โ”‚   โ”œโ”€โ”€ classification/# Semantic type recognition (pattern-based)
โ”‚   โ”‚   โ””โ”€โ”€ encoding/      # Compression codecs (delta, dictionary, varint)
โ”‚   โ”œโ”€โ”€ services/          # High-level orchestration
โ”‚   โ”‚   โ”œโ”€โ”€ compressor.py  # 6-stage compression pipeline
โ”‚   โ”‚   โ”œโ”€โ”€ query_engine.py# Queryable decompression
โ”‚   โ”‚   โ””โ”€โ”€ evaluator.py   # Accuracy metrics vs ground truth
โ”‚   โ”œโ”€โ”€ cli/              # User interfaces
โ”‚   โ”‚   โ”œโ”€โ”€ interactive.py # Rich terminal UI
โ”‚   โ”‚   โ””โ”€โ”€ commands.py    # Click-based CLI
โ”‚   โ””โ”€โ”€ tests/            # Test suite (25 tests, 100% passing)
โ”‚       โ”œโ”€โ”€ unit/         # Component testing
โ”‚       โ”œโ”€โ”€ integration/  # Workflow testing
โ”‚       โ”œโ”€โ”€ e2e/          # End-to-end testing
โ”‚       โ””โ”€โ”€ performance/  # Benchmarks
โ”‚
โ”œโ”€โ”€ data/                  # Input data
โ”‚   โ”œโ”€โ”€ datasets/         # 5 real-world log sources (497K entries)
โ”‚   โ”‚   โ”œโ”€โ”€ Apache/       # Web server logs (52K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ HealthApp/    # Android health tracking (212K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ Zookeeper/    # Distributed coordination (74K lines)
โ”‚   โ”‚   โ”œโ”€โ”€ OpenStack/    # Cloud infrastructure (137K lines)
โ”‚   โ”‚   โ””โ”€โ”€ Proxifier/    # Network proxy (21K lines)
โ”‚   โ””โ”€โ”€ ground_truth/     # Manual annotations for validation
โ”‚
โ”œโ”€โ”€ evaluation/           # Outputs & results
โ”‚   โ”œโ”€โ”€ compressed/       # .lsc compressed files
โ”‚   โ”œโ”€โ”€ results/          # Evaluation metrics (JSON/Markdown)
โ”‚   โ””โ”€โ”€ schema_versions/  # Schema evolution tracking
โ”‚
โ”œโ”€โ”€ deployment/          # Infrastructure
โ”‚   โ”œโ”€โ”€ Dockerfile       # Container image
โ”‚   โ”œโ”€โ”€ docker-compose.yml# Service orchestration
โ”‚   โ””โ”€โ”€ Makefile         # Build automation
โ”‚
โ”œโ”€โ”€ documentation/       # Project documentation
โ”‚   โ”œโ”€โ”€ README.md        # Documentation index
โ”‚   โ”œโ”€โ”€ TESTING.md       # Test strategy
โ”‚   โ”œโ”€โ”€ MCP_ARCHITECTURE.md # System design
โ”‚   โ””โ”€โ”€ API.md           # Python API reference
โ”‚
โ””โ”€โ”€ scripts/            # Automation scripts
    โ”œโ”€โ”€ logpress-interactive.sh  # Bash interactive menu
    โ”œโ”€โ”€ run-tests.sh           # Test suite runner
    โ””โ”€โ”€ run-pre-production-tests.sh # Validation

See individual README files in each directory for detailed information.

๐Ÿ”ฌ Research Methodology

1. Schema Extraction Pipeline

6-Stage Process:

  1. Tokenization: FSM-based parser handles diverse log formats
  2. Semantic Classification: Pattern-based field type detection (timestamp, IP, severity, etc.)
  3. Field Grouping: Identify related fields (ip+port, user+action)
  4. Template Generation: Log alignment algorithm extracts schemas
  5. Schema Versioning: Track format evolution over time
  6. Validation: Compare against manual ground truth (precision/recall)

Example:

Raw Logs:
  [Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP
  [Thu Jun 09 06:07:05 2005] [notice] LDAP: SSL support unavailable
  
Extracted Template:
  [TIMESTAMP] [SEVERITY] LDAP: [MESSAGE]

2. Semantic-Aware Compression

Category-Specific Codecs:

  • Timestamps: Delta encoding (8-10ร— compression)
  • Severity/Status: Dictionary encoding (5-7ร— compression)
  • Metrics: Gorilla time-series compression (3-5ร— compression)
  • Messages: Token pool with references (variable)
  • Stack traces: Reference tracking (store once, reuse pointer)

Queryable Index: Columnar storage enables filtering without full decompression.

3. Evaluation Metrics

Accuracy (vs manual annotations):

  • Precision: % of extracted fields that are correct
  • Recall: % of actual fields that were found
  • F1-Score: Harmonic mean
  • Target: >90% accuracy

Compression Performance:

  • Compression ratio vs gzip baseline
  • Query latency overhead
  • Target: >10ร— compression, <2ร— query slowdown

๐Ÿงช Testing

Run Complete Test Suite

# All tests with coverage
bash scripts/run-tests.sh

# View coverage report
firefox htmlcov/index.html

Pre-Production Validation

# Validate before deployment
bash scripts/run-pre-production-tests.sh

Test Status: โœ… 25/25 tests passing (100%)

  • Unit tests: 9 tests
  • Integration tests: 8 tests
  • E2E tests: 3 tests
  • Performance benchmarks: 5 tests

Performance Benchmarks

# Run benchmarks
python -m pytest logpress/tests/performance/ --benchmark-only

# Expected results:
# - Compression: >500 ops/sec
# - Template extraction: >900 ops/sec
# - Linear scalability: 100 โ†’ 10,000 logs

๐Ÿ“š Documentation

๐ŸŽ“ Research Context

Master's Thesis: Automatic Schema Extraction from Unstructured System Logs
Duration: 26 weeks (4 phases)
Target Venues: VLDB, SIGMOD, IEEE BigData
Novel Contribution: Semantic-aware compression adapting to log content types

Related Work

  • Log Parsing: Drain, Spell, LogPai
  • Schema Inference: Lakehouse formats (Parquet, ORC)
  • Compression: Generic (gzip, zstd) vs specialized (LogShrink)

Key Differentiators

  • โœ… No ML models (constraint-based approach)
  • โœ… Semantic awareness (field-type-specific compression)
  • โœ… Query preservation (columnar indexes)
  • โœ… Schema evolution tracking
  • โœ… Lossless compression (exact reconstruction)

๐Ÿ› ๏ธ Development

Setup Development Environment

# Install test dependencies
pip install pytest pytest-cov pytest-benchmark pytest-mock

# Run tests on file changes (watch mode)
pip install pytest-watch
ptw logpress/tests/ -- -v

Contribution Workflow

  1. Create feature branch: git checkout -b feature/new-encoder
  2. Make changes and add tests
  3. Run validation: bash scripts/run-pre-production-tests.sh
  4. Submit PR (GitHub Actions runs full test suite)

Adding New Semantic Type Patterns

# logpress/context/classification/semantic_types.py

def recognize_custom_field(token: str) -> Tuple[str, float]:
    """
    Add pattern for new field type.
    
    Returns:
        (field_type, confidence_score)
    """
    if re.match(r'^[A-Z]{3}-\d{4}$', token):
        return ('ERROR_CODE', 0.95)  # High confidence
    return ('UNKNOWN', 0.0)

Adding New Compression Codecs

# logpress/context/encoding/custom_encoder.py

from logpress.protocols import EncoderProtocol

class CustomEncoder(EncoderProtocol):
    def encode(self, values: List[Any]) -> bytes:
        # Your encoding logpress
        pass
    
    def decode(self, data: bytes) -> List[Any]:
        # Your decoding logpress
        pass

๐Ÿ“ฆ Dependencies

Core Libraries

msgpack>=1.0.0          # Serialization
zstandard>=0.21.0       # Compression baseline
python-dateutil>=2.8.0  # Timestamp parsing
regex>=2023.0.0         # Advanced pattern matching
rich>=13.0.0            # Terminal UI
click>=8.1.0            # CLI framework

Testing

pytest>=7.4.0           # Test framework
pytest-cov>=4.1.0       # Coverage reporting
pytest-benchmark>=4.0.0 # Performance testing
pytest-mock>=3.12.0     # Mocking utilities

Optional Tools

# Baseline comparison
gzip --version

# Command-line benchmarking
cargo install hyperfine

# Memory profiling
pip install memory-profiler

๐Ÿณ Docker Deployment

Build & Run

# Build all services
docker-compose -f deployment/docker-compose.yml build

# Run interactive CLI
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Run compression
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc

Environment Variables

# Set in docker-compose.yml
PYTHONUNBUFFERED=1      # Real-time output
TERM=xterm-256color     # Colored terminal
MIN_SUPPORT=3           # Template extraction threshold
ZSTD_LEVEL=15           # Compression level (1-22)

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

  • Additional semantic type patterns
  • New compression codecs
  • Query optimization
  • Schema visualization
  • Performance improvements

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ”— Links

๐Ÿ“ž Contact


Status: โœ… Production Ready | ๐Ÿงช All Tests Passing (25/25) | ๐Ÿ“Š Coverage: 42%

Built with โค๏ธ for research in log analysis and semantic compression.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logpress-0.1.0.tar.gz (63.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logpress-0.1.0-py3-none-any.whl (72.3 kB view details)

Uploaded Python 3

File details

Details for the file logpress-0.1.0.tar.gz.

File metadata

  • Download URL: logpress-0.1.0.tar.gz
  • Upload date:
  • Size: 63.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for logpress-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9bfc0a1ceb6dab2bdb5211607089142c23bf9258a1072f92f54163f40c2a57e1
MD5 3f374de831084ad8c0d35c6e39dc3397
BLAKE2b-256 20e8f1b039bce6b59573294e3de78724dd47b2b704a6c539713dcc399f0962ad

See more details on using hashes here.

File details

Details for the file logpress-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: logpress-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 72.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for logpress-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9bdc703c483309a79d56bd782d1078b8b76725ea2db697c28fc81710859e09b6
MD5 c7f4e384ea37e5b7aa31d132a9099b7e
BLAKE2b-256 368310945c8ce0c526f19fa4f663c2d76f183b4d36e899b3b32c4520033c44e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page