Semantic-aware log compression with automatic schema extraction and queryable storage
Project description
LogPress - Semantic Log Compression System
Master's Thesis Research Project: Automatic schema extraction from unstructured system logs using constraint-based parsing and semantic-aware compression.
๐ฏ Research Goals
- Automatic Schema Discovery: Extract implicit log schemas without manual annotation
- Semantic-Aware Compression: Achieve 8-30ร compression while maintaining queryability
- Real-World Validation: Tested on diverse log sources (2M+ entries)
๐ Quick Start
Installation
# Clone repository
git clone https://github.com/adam-bouafia/LogPress.git
cd LogPress
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
Interactive Mode (Recommended)
# Beautiful terminal UI with dataset auto-discovery
python -m logpress.cli.interactive
Features:
- ๐ Auto-discovers datasets in
data/datasets/ - ๐ Real-time compression progress
- ๐จ Rich terminal UI with tables and progress bars
- โก Query compressed logs interactively
Command-Line Usage
# Compress logs
python -m logpress compress \
-i data/datasets/Apache/Apache_full.log \
-o evaluation/compressed/apache.lsc \
--min-support 3 \
-m
# Query compressed logs
python -m logpress query \
-c evaluation/compressed/apache.lsc \
--severity ERROR \
--limit 20
# Run full evaluation
python evaluation/run_full_evaluation.py
Docker Usage
# Interactive mode (Python rich UI)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive
# Bash menu (alternative)
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive-bash
# Run specific command
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc -m
๐ Project Structure (MCP Architecture)
LogPress/
โโโ logpress/ # Core Python package (Model-Context-Protocol)
โ โโโ models/ # Data structures (Token, LogTemplate, CompressedLog)
โ โโโ protocols/ # Abstract interfaces (EncoderProtocol, CompressorProtocol)
โ โโโ context/ # Business logpress
โ โ โโโ tokenization/ # Smart log tokenization (FSM-based)
โ โ โโโ extraction/ # Template generation (log alignment algorithm)
โ โ โโโ classification/# Semantic type recognition (pattern-based)
โ โ โโโ encoding/ # Compression codecs (delta, dictionary, varint)
โ โโโ services/ # High-level orchestration
โ โ โโโ compressor.py # 6-stage compression pipeline
โ โ โโโ query_engine.py# Queryable decompression
โ โ โโโ evaluator.py # Accuracy metrics vs ground truth
โ โโโ cli/ # User interfaces
โ โ โโโ interactive.py # Rich terminal UI
โ โ โโโ commands.py # Click-based CLI
โ โโโ tests/ # Test suite (25 tests, 100% passing)
โ โโโ unit/ # Component testing
โ โโโ integration/ # Workflow testing
โ โโโ e2e/ # End-to-end testing
โ โโโ performance/ # Benchmarks
โ
โโโ data/ # Input data
โ โโโ datasets/ # 5 real-world log sources (497K entries)
โ โ โโโ Apache/ # Web server logs (52K lines)
โ โ โโโ HealthApp/ # Android health tracking (212K lines)
โ โ โโโ Zookeeper/ # Distributed coordination (74K lines)
โ โ โโโ OpenStack/ # Cloud infrastructure (137K lines)
โ โ โโโ Proxifier/ # Network proxy (21K lines)
โ โโโ ground_truth/ # Manual annotations for validation
โ
โโโ evaluation/ # Outputs & results
โ โโโ compressed/ # .lsc compressed files
โ โโโ results/ # Evaluation metrics (JSON/Markdown)
โ โโโ schema_versions/ # Schema evolution tracking
โ
โโโ deployment/ # Infrastructure
โ โโโ Dockerfile # Container image
โ โโโ docker-compose.yml# Service orchestration
โ โโโ Makefile # Build automation
โ
โโโ documentation/ # Project documentation
โ โโโ README.md # Documentation index
โ โโโ TESTING.md # Test strategy
โ โโโ MCP_ARCHITECTURE.md # System design
โ โโโ API.md # Python API reference
โ
โโโ scripts/ # Automation scripts
โโโ logpress-interactive.sh # Bash interactive menu
โโโ run-tests.sh # Test suite runner
โโโ run-pre-production-tests.sh # Validation
See individual README files in each directory for detailed information.
๐ฌ Research Methodology
1. Schema Extraction Pipeline
6-Stage Process:
- Tokenization: FSM-based parser handles diverse log formats
- Semantic Classification: Pattern-based field type detection (timestamp, IP, severity, etc.)
- Field Grouping: Identify related fields (ip+port, user+action)
- Template Generation: Log alignment algorithm extracts schemas
- Schema Versioning: Track format evolution over time
- Validation: Compare against manual ground truth (precision/recall)
Example:
Raw Logs:
[Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP
[Thu Jun 09 06:07:05 2005] [notice] LDAP: SSL support unavailable
Extracted Template:
[TIMESTAMP] [SEVERITY] LDAP: [MESSAGE]
2. Semantic-Aware Compression
Category-Specific Codecs:
- Timestamps: Delta encoding (8-10ร compression)
- Severity/Status: Dictionary encoding (5-7ร compression)
- Metrics: Gorilla time-series compression (3-5ร compression)
- Messages: Token pool with references (variable)
- Stack traces: Reference tracking (store once, reuse pointer)
Queryable Index: Columnar storage enables filtering without full decompression.
3. Evaluation Metrics
Accuracy (vs manual annotations):
- Precision: % of extracted fields that are correct
- Recall: % of actual fields that were found
- F1-Score: Harmonic mean
- Target: >90% accuracy
Compression Performance:
- Compression ratio vs gzip baseline
- Query latency overhead
- Target: >10ร compression, <2ร query slowdown
๐งช Testing
Run Complete Test Suite
# All tests with coverage
bash scripts/run-tests.sh
# View coverage report
firefox htmlcov/index.html
Pre-Production Validation
# Validate before deployment
bash scripts/run-pre-production-tests.sh
Test Status: โ 25/25 tests passing (100%)
- Unit tests: 9 tests
- Integration tests: 8 tests
- E2E tests: 3 tests
- Performance benchmarks: 5 tests
Performance Benchmarks
# Run benchmarks
python -m pytest logpress/tests/performance/ --benchmark-only
# Expected results:
# - Compression: >500 ops/sec
# - Template extraction: >900 ops/sec
# - Linear scalability: 100 โ 10,000 logs
๐ Documentation
- Documentation Index - Complete documentation overview
- Testing Guide - Test strategy and commands
- MCP Architecture - System design details
- API Reference - Python API usage
- Docker Guide - Container deployment
๐ Research Context
Master's Thesis: Automatic Schema Extraction from Unstructured System Logs
Duration: 26 weeks (4 phases)
Target Venues: VLDB, SIGMOD, IEEE BigData
Novel Contribution: Semantic-aware compression adapting to log content types
Related Work
- Log Parsing: Drain, Spell, LogPai
- Schema Inference: Lakehouse formats (Parquet, ORC)
- Compression: Generic (gzip, zstd) vs specialized (LogShrink)
Key Differentiators
- โ No ML models (constraint-based approach)
- โ Semantic awareness (field-type-specific compression)
- โ Query preservation (columnar indexes)
- โ Schema evolution tracking
- โ Lossless compression (exact reconstruction)
๐ ๏ธ Development
Setup Development Environment
# Install test dependencies
pip install pytest pytest-cov pytest-benchmark pytest-mock
# Run tests on file changes (watch mode)
pip install pytest-watch
ptw logpress/tests/ -- -v
Contribution Workflow
- Create feature branch:
git checkout -b feature/new-encoder - Make changes and add tests
- Run validation:
bash scripts/run-pre-production-tests.sh - Submit PR (GitHub Actions runs full test suite)
Adding New Semantic Type Patterns
# logpress/context/classification/semantic_types.py
def recognize_custom_field(token: str) -> Tuple[str, float]:
"""
Add pattern for new field type.
Returns:
(field_type, confidence_score)
"""
if re.match(r'^[A-Z]{3}-\d{4}$', token):
return ('ERROR_CODE', 0.95) # High confidence
return ('UNKNOWN', 0.0)
Adding New Compression Codecs
# logpress/context/encoding/custom_encoder.py
from logpress.protocols import EncoderProtocol
class CustomEncoder(EncoderProtocol):
def encode(self, values: List[Any]) -> bytes:
# Your encoding logpress
pass
def decode(self, data: bytes) -> List[Any]:
# Your decoding logpress
pass
๐ฆ Dependencies
Core Libraries
msgpack>=1.0.0 # Serialization
zstandard>=0.21.0 # Compression baseline
python-dateutil>=2.8.0 # Timestamp parsing
regex>=2023.0.0 # Advanced pattern matching
rich>=13.0.0 # Terminal UI
click>=8.1.0 # CLI framework
Testing
pytest>=7.4.0 # Test framework
pytest-cov>=4.1.0 # Coverage reporting
pytest-benchmark>=4.0.0 # Performance testing
pytest-mock>=3.12.0 # Mocking utilities
Optional Tools
# Baseline comparison
gzip --version
# Command-line benchmarking
cargo install hyperfine
# Memory profiling
pip install memory-profiler
๐ณ Docker Deployment
Build & Run
# Build all services
docker-compose -f deployment/docker-compose.yml build
# Run interactive CLI
docker-compose -f deployment/docker-compose.yml run --rm logpress-interactive
# Run compression
docker-compose -f deployment/docker-compose.yml run --rm logpress-cli \
compress -i /app/data/datasets/Apache/Apache_full.log -o /app/evaluation/compressed/apache.lsc
Environment Variables
# Set in docker-compose.yml
PYTHONUNBUFFERED=1 # Real-time output
TERM=xterm-256color # Colored terminal
MIN_SUPPORT=3 # Template extraction threshold
ZSTD_LEVEL=15 # Compression level (1-22)
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Areas for Contribution
- Additional semantic type patterns
- New compression codecs
- Query optimization
- Schema visualization
- Performance improvements
๐ License
MIT License - see LICENSE file for details.
๐ Links
๐ Contact
- Author: Adam Bouafia
- Repository: https://github.com/adam-bouafia/logpress
- Linkedin: https://www.linkedin.com/in/adam-bouafia
Status: โ Production Ready | ๐งช All Tests Passing (25/25) | ๐ Coverage: 42%
Built with โค๏ธ for research in log analysis and semantic compression.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file logpress-0.1.0.tar.gz.
File metadata
- Download URL: logpress-0.1.0.tar.gz
- Upload date:
- Size: 63.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bfc0a1ceb6dab2bdb5211607089142c23bf9258a1072f92f54163f40c2a57e1
|
|
| MD5 |
3f374de831084ad8c0d35c6e39dc3397
|
|
| BLAKE2b-256 |
20e8f1b039bce6b59573294e3de78724dd47b2b704a6c539713dcc399f0962ad
|
File details
Details for the file logpress-0.1.0-py3-none-any.whl.
File metadata
- Download URL: logpress-0.1.0-py3-none-any.whl
- Upload date:
- Size: 72.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bdc703c483309a79d56bd782d1078b8b76725ea2db697c28fc81710859e09b6
|
|
| MD5 |
c7f4e384ea37e5b7aa31d132a9099b7e
|
|
| BLAKE2b-256 |
368310945c8ce0c526f19fa4f663c2d76f183b4d36e899b3b32c4520033c44e8
|