Production-grade logging for Spark data platforms (Fabric & Databricks)
Project description
LakeTrace Logger
Production-grade logging for Spark data platforms (Microsoft Fabric & Databricks)
LakeTrace is a cross-platform Python logging module designed specifically for Spark data platforms. It provides safe, performant logging with structured output, local file rotation, and optional lakehouse storage integration.
Installation
pip install laketrace
✅ Why LakeTrace (vs SparkLogger)
LakeTrace is built for Spark data platforms and removes the common failure modes of basic Spark logging:
- Driver-safe: no executor logging, no distributed file writes.
- No remote appends: all logging stays local with rotation and retention.
- Structured JSON: consistent records with runtime metadata and bound context.
- Fabric + Databricks aware: platform detection built in.
- Crash‑safe logging: optional
catchprevents formatter errors from breaking jobs. - Scalable I/O: enqueue mode for high-throughput workloads.
✨ Key Features
- Cross‑platform: Fabric notebooks, Fabric Spark jobs, Databricks notebooks/jobs
- Structured JSON with context binding and runtime metadata
- Local rotation, retention, and compression
- Stdout emission for job logs
- Optional end‑of‑run lakehouse upload
- Thread‑safe and notebook re‑execution safe
🚀 Quick Start
from laketrace import get_logger
logger = get_logger("my_job")
logger.info("Starting data processing")
stage = logger.bind(stage="extract", dataset="sales")
stage.info("Extracting sales data")
logger.upload_log_to_lakehouse("Files/logs/my_job.log")
⚙️ Configuration Highlights
logger = get_logger(
"my_job",
config={
"log_dir": "/tmp/laketrace_logs",
"rotation": "500 MB",
"retention": "7 days",
"compression": "gz",
"level": "INFO",
"json": True,
"stdout": True,
"serialize": True,
"enqueue": False,
"filter": None,
"formatter": None,
"catch": True,
}
)
� Supported Features
LakeTrace provides comprehensive logging capabilities organized by feature category:
Core Features (Proven & Stable)
- Rotation: Size-based (MB), time-based (hourly/daily/weekly/monthly), interval-based, and callable rotation strategies
- Retention: File count-based and time-based cleanup policies
- Compression: Gzip, bzip2, and ZIP archive support for rotated logs
- Handler Management: Track and manage multiple log file handlers with unique IDs
- Async I/O: Enqueue mode for high-throughput workloads with background thread writing
Advanced Features
- Custom Formatters: Apply custom message formatting rules
- Custom Filters: Control which records get logged
- Callbacks: Hook into log lifecycle events
- Multiprocessing Safety: Thread-safe operations across distributed Spark environments
- Error Catching: Optional exception handler prevents formatter errors from breaking jobs
Performance Features
- Throughput Optimization: Handle high-volume logging without performance degradation
- Memory Efficiency: Minimal overhead in memory usage during execution
- Concurrency Support: Safe operation with concurrent logging from multiple threads
Security Features
- Message Sanitization: Remove or mask sensitive data from logs
- PII Masking: Automatic detection and redaction of personally identifiable information
- Format String Escaping: Prevent format string vulnerabilities
- Newline Escaping: Sanitize log content to prevent log injection attacks
- Secure Permissions: Control file access in shared environments
�🔄 Migration Guides
From SparkLogger
SparkLogger often leads to executor logging overhead and cross-partition serialization issues. LakeTrace moves all logging to the driver:
Before (SparkLogger):
from pyspark.taskcontext import TaskContext
from delta.tables import DeltaTable
# Problem: Executors attempt to log, causing distributed serialization
for partition in range(num_partitions):
df.filter(...).collect() # Executor logs serialized back to driver
After (LakeTrace):
from laketrace import get_logger
logger = get_logger("my_job") # Driver only
# Log from driver, use print() in executors
df = spark.read.parquet(path)
logger.info(f"Loaded {df.count()} rows") # Clean, structured, driver-safe
From notebookutils.fs.append
Using notebookutils.fs.append() for logging causes performance degradation and can hang Spark jobs due to repeated remote I/O per log line. LakeTrace uses local rotation instead:
Before (notebookutils.fs.append):
from notebookutils.mssparkutils import fs
# Problem: Each log line triggers remote I/O → job hangs
for i in range(1000):
fs.append("/mnt/logs/job.log", f"Processing {i}\n") # Remote write per line
process_data(i)
After (LakeTrace):
from laketrace import get_logger
logger = get_logger("my_job")
# Local rotation, zero remote I/O during execution
for i in range(1000):
logger.info(f"Processing {i}") # Fast local write, no hangs
process_data(i)
# Upload once at the end
logger.upload_log_to_lakehouse("/Files/logs/job.log")
🧪 Testing
Run the unified test suite:
python tests/run_tests.py # Full suite (~30 seconds)
python tests/run_tests.py --quick # Quick feedback (~2 seconds)
For complete testing guide, see docs/TESTING.md
🔐 Security
LakeTrace provides built-in security features to prevent data leaks:
- Field Whitelisting - Control what fields get logged
- PII Masking - Auto-detect and mask sensitive data
- Data Leak Detection - Monitor for suspicious patterns
- Log Integrity - Verify logs haven't been tampered with
- Secure Permissions - Control file access in shared environments
For complete security guide, see docs/SECURITY.md
✅ Safety Notes
- Driver only: use
print()in executors. - Single upload: call
upload_log_to_lakehouse()once at the end. - No remote append: avoids Spark job hangs and retries.
📝 License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file laketrace-1.0.2.tar.gz.
File metadata
- Download URL: laketrace-1.0.2.tar.gz
- Upload date:
- Size: 32.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
588109bb0c353c5a51e730129bac9a2b7abbac994d4940484ce7afe27db1aea6
|
|
| MD5 |
c207308f5c2d2cb4cedecd2c48a40c97
|
|
| BLAKE2b-256 |
061ac9903ed790ba4c6d76e094562bfdcfc0ed824bb29420577594be9bdc8d31
|
Provenance
The following attestation bundles were made for laketrace-1.0.2.tar.gz:
Publisher:
publish-to-pypi.yml on Keayoub/laketrace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
laketrace-1.0.2.tar.gz -
Subject digest:
588109bb0c353c5a51e730129bac9a2b7abbac994d4940484ce7afe27db1aea6 - Sigstore transparency entry: 919697980
- Sigstore integration time:
-
Permalink:
Keayoub/laketrace@bbf3597b026de1c0c5d55ea49b7b2d8cdde32ace -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/Keayoub
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@bbf3597b026de1c0c5d55ea49b7b2d8cdde32ace -
Trigger Event:
push
-
Statement type:
File details
Details for the file laketrace-1.0.2-py3-none-any.whl.
File metadata
- Download URL: laketrace-1.0.2-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca5eed8be78d73be5d87b05e2093e7f6b916c656b190ed983df126305be65ee6
|
|
| MD5 |
e81ef22330fc96586a94c80266f80263
|
|
| BLAKE2b-256 |
7eecc02b63b0c3904252f8208c41b7dea72e9098cf8741b2cbded8fb0548a82c
|
Provenance
The following attestation bundles were made for laketrace-1.0.2-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on Keayoub/laketrace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
laketrace-1.0.2-py3-none-any.whl -
Subject digest:
ca5eed8be78d73be5d87b05e2093e7f6b916c656b190ed983df126305be65ee6 - Sigstore transparency entry: 919697994
- Sigstore integration time:
-
Permalink:
Keayoub/laketrace@bbf3597b026de1c0c5d55ea49b7b2d8cdde32ace -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/Keayoub
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@bbf3597b026de1c0c5d55ea49b7b2d8cdde32ace -
Trigger Event:
push
-
Statement type: