Semantic-aware log compression with automatic schema extraction and queryable storage

These details have not been verified by PyPI

Project links

Project description

LogPress - Semantic Log Compression System

Master's Thesis Research Project: Automatic schema extraction from unstructured system logs using constraint-based parsing and semantic-aware compression.

🎯 Research Goals

Automatic Schema Discovery: Extract implicit log schemas without manual annotation
Semantic-Aware Compression: Achieve 8-30× compression while maintaining queryability
Real-World Validation: Tested on diverse log sources (2M+ entries)

🚀 Quick Start

Installation

# Clone repository
git clone https://github.com/adam-bouafia/LogPress.git
cd LogPress

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

Interactive Mode (Recommended)

# Beautiful terminal UI with dataset auto-discovery
python -m logpress.cli.interactive

Features:

🔍 Auto-discovers datasets in data/datasets/
📊 Real-time compression progress
🎨 Rich terminal UI with tables and progress bars
⚡ Query compressed logs interactively

Command-Line Usage

# Compress logs
python -m logpress compress \
  -i data/datasets/Apache/Apache_full.log \
  -o evaluation/compressed/apache.lsc \
  --min-support 3 \
  -m

# Query compressed logs
python -m logpress query \
  -c evaluation/compressed/apache.lsc \
  --severity ERROR \
  --limit 20

# Run full evaluation
python evaluation/run_full_evaluation.py

Docker Usage

# Interactive mode (Python rich UI)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Bash menu (alternative)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive-bash

# Run specific command
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m

📁 Project Structure (MCP Architecture)

LogPress/
├── logpress/                  # Core Python package (Model-Context-Protocol)
│   ├── models/             # Data structures (Token, LogTemplate, CompressedLog)
│   ├── protocols/          # Abstract interfaces (EncoderProtocol, CompressorProtocol)
│   ├── context/           # Business logpress
│   │   ├── tokenization/  # Smart log tokenization (FSM-based)
│   │   ├── extraction/    # Template generation (log alignment algorithm)
│   │   ├── classification/# Semantic type recognition (pattern-based)
│   │   └── encoding/      # Compression codecs (delta, dictionary, varint)
│   ├── services/          # High-level orchestration
│   │   ├── compressor.py  # 6-stage compression pipeline
│   │   ├── query_engine.py# Queryable decompression
│   │   └── evaluator.py   # Accuracy metrics vs ground truth
│   ├── cli/              # User interfaces
│   │   ├── interactive.py # Rich terminal UI
│   │   └── commands.py    # Click-based CLI
│   └── tests/            # Test suite (25 tests, 100% passing)
│       ├── unit/         # Component testing
│       ├── integration/  # Workflow testing
│       ├── e2e/          # End-to-end testing
│       └── performance/  # Benchmarks
│
├── data/                  # Input data
│   ├── datasets/         # 5 real-world log sources (497K entries)
│   │   ├── Apache/       # Web server logs (52K lines)
│   │   ├── HealthApp/    # Android health tracking (212K lines)
│   │   ├── Zookeeper/    # Distributed coordination (74K lines)
│   │   ├── OpenStack/    # Cloud infrastructure (137K lines)
│   │   └── Proxifier/    # Network proxy (21K lines)
│   └── ground_truth/     # Manual annotations for validation
│
├── evaluation/           # Outputs & results
│   ├── compressed/       # .lsc compressed files
│   ├── results/          # Evaluation metrics (JSON/Markdown)
│   └── schema_versions/  # Schema evolution tracking
│
├── deployment/          # Infrastructure
│   ├── Dockerfile       # Container image
│   ├── docker-compose.yml# Service orchestration
│   └── Makefile         # Build automation
│
├── documentation/       # Project documentation
│   ├── README.md        # Documentation index
│   ├── TESTING.md       # Test strategy
│   ├── MCP_ARCHITECTURE.md # System design
│   └── API.md           # Python API reference
│
└── scripts/            # Automation scripts
    ├── logpress-interactive.sh  # Bash interactive menu
    ├── run-tests.sh           # Test suite runner
    └── run-pre-production-tests.sh # Validation

See individual README files in each directory for detailed information.

🔬 Research Methodology

1. Schema Extraction Pipeline

6-Stage Process:

Tokenization: FSM-based parser handles diverse log formats
Semantic Classification: Pattern-based field type detection (timestamp, IP, severity, etc.)
Field Grouping: Identify related fields (ip+port, user+action)
Template Generation: Log alignment algorithm extracts schemas
Schema Versioning: Track format evolution over time
Validation: Compare against manual ground truth (precision/recall)

Example:

Raw Logs:
  [Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP
  [Thu Jun 09 06:07:05 2005] [notice] LDAP: SSL support unavailable
  
Extracted Template:
  [TIMESTAMP] [SEVERITY] LDAP: [MESSAGE]

2. Semantic-Aware Compression

Category-Specific Codecs:

Timestamps: Delta encoding (8-10× compression)
Severity/Status: Dictionary encoding (5-7× compression)
Metrics: Gorilla time-series compression (3-5× compression)
Messages: Token pool with references (variable)
Stack traces: Reference tracking (store once, reuse pointer)

Queryable Index: Columnar storage enables filtering without full decompression.

3. Evaluation Metrics

Accuracy (vs manual annotations):

Precision: % of extracted fields that are correct
Recall: % of actual fields that were found
F1-Score: Harmonic mean
Target: >90% accuracy

Compression Performance:

Compression ratio vs gzip baseline
Query latency overhead
Target: >10× compression, <2× query slowdown

🧪 Testing

Run Complete Test Suite

# All tests with coverage
bash scripts/run-tests.sh

# View coverage report
firefox htmlcov/index.html

Pre-Production Validation

# Validate before deployment
bash scripts/run-pre-production-tests.sh

Test Status: ✅ 25/25 tests passing (100%)

Unit tests: 9 tests
Integration tests: 8 tests
E2E tests: 3 tests
Performance benchmarks: 5 tests

Performance Benchmarks

# Run benchmarks
python -m pytest logpress/tests/performance/ --benchmark-only

# Expected results:
# - Compression: >500 ops/sec
# - Template extraction: >900 ops/sec
# - Linear scalability: 100 → 10,000 logs

📚 Documentation

Documentation Index - Complete documentation overview
Testing Guide - Test strategy and commands
MCP Architecture - System design details
API Reference - Python API usage
Docker Guide - Container deployment

🎓 Research Context

Master's Thesis: Automatic Schema Extraction from Unstructured System Logs
Duration: 26 weeks (4 phases)
Target Venues: VLDB, SIGMOD, IEEE BigData
Novel Contribution: Semantic-aware compression adapting to log content types

Related Work

Log Parsing: Drain, Spell, LogPai
Schema Inference: Lakehouse formats (Parquet, ORC)
Compression: Generic (gzip, zstd) vs specialized (LogShrink)

Key Differentiators

✅ No ML models (constraint-based approach)
✅ Semantic awareness (field-type-specific compression)
✅ Query preservation (columnar indexes)
✅ Schema evolution tracking
✅ Lossless compression (exact reconstruction)

🛠️ Development

Setup Development Environment

# Install test dependencies
pip install pytest pytest-cov pytest-benchmark pytest-mock

# Run tests on file changes (watch mode)
pip install pytest-watch
ptw logpress/tests/ -- -v

Contribution Workflow

Create feature branch: git checkout -b feature/new-encoder
Make changes and add tests
Run validation: bash scripts/run-pre-production-tests.sh
Submit PR (GitHub Actions runs full test suite)

Adding New Semantic Type Patterns

# logpress/context/classification/semantic_types.py

def recognize_custom_field(token: str) -> Tuple[str, float]:
    """
    Add pattern for new field type.
    
    Returns:
        (field_type, confidence_score)
    """
    if re.match(r'^[A-Z]{3}-\d{4}$', token):
        return ('ERROR_CODE', 0.95)  # High confidence
    return ('UNKNOWN', 0.0)

Adding New Compression Codecs

# logpress/context/encoding/custom_encoder.py

from logpress.protocols import EncoderProtocol

class CustomEncoder(EncoderProtocol):
    def encode(self, values: List[Any]) -> bytes:
        # Your encoding logpress
        pass
    
    def decode(self, data: bytes) -> List[Any]:
        # Your decoding logpress
        pass

📦 Dependencies

Core Libraries

msgpack>=1.0.0          # Serialization
zstandard>=0.21.0       # Compression baseline
python-dateutil>=2.8.0  # Timestamp parsing
regex>=2023.0.0         # Advanced pattern matching
rich>=13.0.0            # Terminal UI
click>=8.1.0            # CLI framework

Testing

pytest>=7.4.0           # Test framework
pytest-cov>=4.1.0       # Coverage reporting
pytest-benchmark>=4.0.0 # Performance testing
pytest-mock>=3.12.0     # Mocking utilities

Optional Tools

# Baseline comparison
gzip --version

# Command-line benchmarking
cargo install hyperfine

# Memory profiling
pip install memory-profiler

🐳 Docker Deployment

Build & Run

# Build all services
docker-compose -f deployment/docker-compose.yml build

# Run interactive CLI
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive

# Run compression
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
  compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc

Environment Variables

# Set in docker-compose.yml
PYTHONUNBUFFERED=1      # Real-time output
TERM=xterm-256color     # Colored terminal
MIN_SUPPORT=3           # Template extraction threshold
ZSTD_LEVEL=15           # Compression level (1-22)

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

Additional semantic type patterns
New compression codecs
Query optimization
Schema visualization
Performance improvements

📄 License

MIT License - see LICENSE file for details.

🔗 Links

📞 Contact

Author: Adam Bouafia
Repository: https://github.com/adam-bouafia/logpress
Linkedin: https://www.linkedin.com/in/adam-bouafia

Status: ✅ Production Ready | 🧪 All Tests Passing (25/25) | 📊 Coverage: 42%

Built with ❤️ for research in log analysis and semantic compression.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.1

Dec 1, 2025

2.0.0

Dec 1, 2025

1.0.7

Nov 30, 2025

1.0.6

Nov 29, 2025

1.0.5

Nov 29, 2025

1.0.4

Nov 29, 2025

1.0.3

Nov 29, 2025

This version

0.1.0

Nov 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logpress-0.1.0.tar.gz (63.6 kB view details)

Uploaded Nov 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

logpress-0.1.0-py3-none-any.whl (72.3 kB view details)

Uploaded Nov 28, 2025 Python 3

File details

Details for the file logpress-0.1.0.tar.gz.

File metadata

Download URL: logpress-0.1.0.tar.gz
Upload date: Nov 28, 2025
Size: 63.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for logpress-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9bfc0a1ceb6dab2bdb5211607089142c23bf9258a1072f92f54163f40c2a57e1`
MD5	`3f374de831084ad8c0d35c6e39dc3397`
BLAKE2b-256	`20e8f1b039bce6b59573294e3de78724dd47b2b704a6c539713dcc399f0962ad`

See more details on using hashes here.

File details

Details for the file logpress-0.1.0-py3-none-any.whl.

File metadata

Download URL: logpress-0.1.0-py3-none-any.whl
Upload date: Nov 28, 2025
Size: 72.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for logpress-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bdc703c483309a79d56bd782d1078b8b76725ea2db697c28fc81710859e09b6`
MD5	`c7f4e384ea37e5b7aa31d132a9099b7e`
BLAKE2b-256	`368310945c8ce0c526f19fa4f663c2d76f183b4d36e899b3b32c4520033c44e8`

See more details on using hashes here.

LogPress 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LogPress - Semantic Log Compression System

🎯 Research Goals

🚀 Quick Start

Installation

Interactive Mode (Recommended)

Command-Line Usage

Docker Usage

📁 Project Structure (MCP Architecture)

🔬 Research Methodology

1. Schema Extraction Pipeline

2. Semantic-Aware Compression

3. Evaluation Metrics

🧪 Testing

Run Complete Test Suite

Pre-Production Validation

Performance Benchmarks

📚 Documentation

🎓 Research Context

Related Work

Key Differentiators

🛠️ Development

Setup Development Environment

Contribution Workflow

Adding New Semantic Type Patterns

Adding New Compression Codecs

📦 Dependencies

Core Libraries

Testing

Optional Tools

🐳 Docker Deployment

Build & Run

Environment Variables

🤝 Contributing

Areas for Contribution

📄 License

🔗 Links

📞 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes