Skip to main content

Production-grade logging for Spark data platforms (Fabric & Databricks)

Project description

LakeTrace Logger

Production-grade logging for Spark data platforms (Microsoft Fabric & Databricks)

LakeTrace is a cross-platform Python logging module designed specifically for Spark data platforms. It provides safe, performant logging with structured output, local file rotation, and optional lakehouse storage integration.

Installation

pip install laketrace

✅ Why LakeTrace (vs SparkLogger)

LakeTrace is built for Spark data platforms and removes the common failure modes of basic Spark logging:

  • Driver-safe: no executor logging, no distributed file writes.
  • No remote appends: all logging stays local with rotation and retention.
  • Structured JSON: consistent records with runtime metadata and bound context.
  • Fabric + Databricks aware: platform detection built in.
  • Crash‑safe logging: optional catch prevents formatter errors from breaking jobs.
  • Scalable I/O: enqueue mode for high-throughput workloads.

✨ Key Features

  • Cross‑platform: Fabric notebooks, Fabric Spark jobs, Databricks notebooks/jobs
  • Structured JSON with context binding and runtime metadata
  • Local rotation, retention, and compression
  • Stdout emission for job logs
  • Optional end‑of‑run lakehouse upload
  • Thread‑safe and notebook re‑execution safe

🚀 Quick Start

from laketrace import get_logger

logger = get_logger("my_job")
logger.info("Starting data processing")

stage = logger.bind(stage="extract", dataset="sales")
stage.info("Extracting sales data")

logger.upload_log_to_lakehouse("Files/logs/my_job.log")

⚙️ Configuration Highlights

logger = get_logger(
    "my_job",
    config={
        "log_dir": "/tmp/laketrace_logs",
        "rotation": "500 MB",
        "retention": "7 days",
        "compression": "gz",
        "level": "INFO",
        "json": True,
        "stdout": True,
        "serialize": True,
        "enqueue": False,
        "filter": None,
        "formatter": None,
        "catch": True,
    }
)

� Supported Features

LakeTrace provides comprehensive logging capabilities organized by feature category:

Core Features (Proven & Stable)

  • Rotation: Size-based (MB), time-based (hourly/daily/weekly/monthly), interval-based, and callable rotation strategies
  • Retention: File count-based and time-based cleanup policies
  • Compression: Gzip, bzip2, and ZIP archive support for rotated logs
  • Handler Management: Track and manage multiple log file handlers with unique IDs
  • Async I/O: Enqueue mode for high-throughput workloads with background thread writing

Advanced Features

  • Custom Formatters: Apply custom message formatting rules
  • Custom Filters: Control which records get logged
  • Callbacks: Hook into log lifecycle events
  • Multiprocessing Safety: Thread-safe operations across distributed Spark environments
  • Error Catching: Optional exception handler prevents formatter errors from breaking jobs

Performance Features

  • Throughput Optimization: Handle high-volume logging without performance degradation
  • Memory Efficiency: Minimal overhead in memory usage during execution
  • Concurrency Support: Safe operation with concurrent logging from multiple threads

Security Features

  • Message Sanitization: Remove or mask sensitive data from logs
  • PII Masking: Automatic detection and redaction of personally identifiable information
  • Format String Escaping: Prevent format string vulnerabilities
  • Newline Escaping: Sanitize log content to prevent log injection attacks
  • Secure Permissions: Control file access in shared environments

�🔄 Migration Guides

From SparkLogger

SparkLogger often leads to executor logging overhead and cross-partition serialization issues. LakeTrace moves all logging to the driver:

Before (SparkLogger):

from pyspark.taskcontext import TaskContext
from delta.tables import DeltaTable

# Problem: Executors attempt to log, causing distributed serialization
for partition in range(num_partitions):
    df.filter(...).collect()  # Executor logs serialized back to driver

After (LakeTrace):

from laketrace import get_logger

logger = get_logger("my_job")  # Driver only

# Log from driver, use print() in executors
df = spark.read.parquet(path)
logger.info(f"Loaded {df.count()} rows")  # Clean, structured, driver-safe

From notebookutils.fs.append

Using notebookutils.fs.append() for logging causes performance degradation and can hang Spark jobs due to repeated remote I/O per log line. LakeTrace uses local rotation instead:

Before (notebookutils.fs.append):

from notebookutils.mssparkutils import fs

# Problem: Each log line triggers remote I/O → job hangs
for i in range(1000):
    fs.append("/mnt/logs/job.log", f"Processing {i}\n")  # Remote write per line
    process_data(i)

After (LakeTrace):

from laketrace import get_logger

logger = get_logger("my_job")

# Local rotation, zero remote I/O during execution
for i in range(1000):
    logger.info(f"Processing {i}")  # Fast local write, no hangs
    process_data(i)

# Upload once at the end
logger.upload_log_to_lakehouse("/Files/logs/job.log")

🧪 Testing

Run the unified test suite:

python tests/run_tests.py          # Full suite (~30 seconds)
python tests/run_tests.py --quick  # Quick feedback (~2 seconds)

For complete testing guide, see docs/TESTING.md

🔐 Security

LakeTrace provides built-in security features to prevent data leaks:

  • Field Whitelisting - Control what fields get logged
  • PII Masking - Auto-detect and mask sensitive data
  • Data Leak Detection - Monitor for suspicious patterns
  • Log Integrity - Verify logs haven't been tampered with
  • Secure Permissions - Control file access in shared environments

For complete security guide, see docs/SECURITY.md

✅ Safety Notes

  • Driver only: use print() in executors.
  • Single upload: call upload_log_to_lakehouse() once at the end.
  • No remote append: avoids Spark job hangs and retries.

📝 License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laketrace-1.0.2.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

laketrace-1.0.2-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file laketrace-1.0.2.tar.gz.

File metadata

  • Download URL: laketrace-1.0.2.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for laketrace-1.0.2.tar.gz
Algorithm Hash digest
SHA256 588109bb0c353c5a51e730129bac9a2b7abbac994d4940484ce7afe27db1aea6
MD5 c207308f5c2d2cb4cedecd2c48a40c97
BLAKE2b-256 061ac9903ed790ba4c6d76e094562bfdcfc0ed824bb29420577594be9bdc8d31

See more details on using hashes here.

Provenance

The following attestation bundles were made for laketrace-1.0.2.tar.gz:

Publisher: publish-to-pypi.yml on Keayoub/laketrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file laketrace-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: laketrace-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for laketrace-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ca5eed8be78d73be5d87b05e2093e7f6b916c656b190ed983df126305be65ee6
MD5 e81ef22330fc96586a94c80266f80263
BLAKE2b-256 7eecc02b63b0c3904252f8208c41b7dea72e9098cf8741b2cbded8fb0548a82c

See more details on using hashes here.

Provenance

The following attestation bundles were made for laketrace-1.0.2-py3-none-any.whl:

Publisher: publish-to-pypi.yml on Keayoub/laketrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page