Skip to main content

Production-grade logging for Spark data platforms (Fabric & Databricks)

Project description

LakeTrace Logger

Production-grade logging for Spark data platforms (Microsoft Fabric & Databricks)

LakeTrace is a cross-platform Python logging module designed specifically for Spark data platforms. It provides safe, performant logging with structured output, local file rotation, and optional lakehouse storage integration.

Installation

pip install laketrace

✅ Why LakeTrace (vs SparkLogger)

LakeTrace is built for Spark data platforms and removes the common failure modes of basic Spark logging:

  • Driver-safe: no executor logging, no distributed file writes.
  • No remote appends: all logging stays local with rotation and retention.
  • Structured JSON: consistent records with runtime metadata and bound context.
  • Fabric + Databricks aware: platform detection built in.
  • Crash‑safe logging: optional catch prevents formatter errors from breaking jobs.
  • Scalable I/O: enqueue mode for high-throughput workloads.

✨ Key Features

  • Cross‑platform: Fabric notebooks, Fabric Spark jobs, Databricks notebooks/jobs
  • Structured JSON with context binding and runtime metadata
  • Local rotation, retention, and compression
  • Stdout emission for job logs
  • Optional end‑of‑run lakehouse upload
  • Thread‑safe and notebook re‑execution safe

🚀 Quick Start

from laketrace import get_logger

logger = get_logger("my_job")
logger.info("Starting data processing")

stage = logger.bind(stage="extract", dataset="sales")
stage.info("Extracting sales data")

logger.upload_log_to_lakehouse("Files/logs/my_job.log")

⚙️ Configuration Highlights

logger = get_logger(
    "my_job",
    config={
        "log_dir": "/tmp/laketrace_logs",
        "rotation": "500 MB",
        "retention": "7 days",
        "compression": "gz",
        "level": "INFO",
        "json": True,
        "stdout": True,
        "serialize": True,
        "enqueue": False,
        "filter": None,
        "formatter": None,
        "catch": True,
    }
)

� Supported Features

LakeTrace provides comprehensive logging capabilities organized by feature category:

Core Features (Proven & Stable)

  • Rotation: Size-based (MB), time-based (hourly/daily/weekly/monthly), interval-based, and callable rotation strategies
  • Retention: File count-based and time-based cleanup policies
  • Compression: Gzip, bzip2, and ZIP archive support for rotated logs
  • Handler Management: Track and manage multiple log file handlers with unique IDs
  • Async I/O: Enqueue mode for high-throughput workloads with background thread writing

Advanced Features

  • Custom Formatters: Apply custom message formatting rules
  • Custom Filters: Control which records get logged
  • Callbacks: Hook into log lifecycle events
  • Multiprocessing Safety: Thread-safe operations across distributed Spark environments
  • Error Catching: Optional exception handler prevents formatter errors from breaking jobs

Performance Features

  • Throughput Optimization: Handle high-volume logging without performance degradation
  • Memory Efficiency: Minimal overhead in memory usage during execution
  • Concurrency Support: Safe operation with concurrent logging from multiple threads

Security Features

  • Message Sanitization: Remove or mask sensitive data from logs
  • PII Masking: Automatic detection and redaction of personally identifiable information
  • Format String Escaping: Prevent format string vulnerabilities
  • Newline Escaping: Sanitize log content to prevent log injection attacks
  • Secure Permissions: Control file access in shared environments

�🔄 Migration Guides

From SparkLogger

SparkLogger often leads to executor logging overhead and cross-partition serialization issues. LakeTrace moves all logging to the driver:

Before (SparkLogger):

from pyspark.taskcontext import TaskContext
from delta.tables import DeltaTable

# Problem: Executors attempt to log, causing distributed serialization
for partition in range(num_partitions):
    df.filter(...).collect()  # Executor logs serialized back to driver

After (LakeTrace):

from laketrace import get_logger

logger = get_logger("my_job")  # Driver only

# Log from driver, use print() in executors
df = spark.read.parquet(path)
logger.info(f"Loaded {df.count()} rows")  # Clean, structured, driver-safe

From notebookutils.fs.append

Using notebookutils.fs.append() for logging causes performance degradation and can hang Spark jobs due to repeated remote I/O per log line. LakeTrace uses local rotation instead:

Before (notebookutils.fs.append):

from notebookutils.mssparkutils import fs

# Problem: Each log line triggers remote I/O → job hangs
for i in range(1000):
    fs.append("/mnt/logs/job.log", f"Processing {i}\n")  # Remote write per line
    process_data(i)

After (LakeTrace):

from laketrace import get_logger

logger = get_logger("my_job")

# Local rotation, zero remote I/O during execution
for i in range(1000):
    logger.info(f"Processing {i}")  # Fast local write, no hangs
    process_data(i)

# Upload once at the end
logger.upload_log_to_lakehouse("/Files/logs/job.log")

🧪 Workload Test Runner

Run the consolidated workload tests:

python tests/run_workloads.py

Workload groups live under:

✅ Safety Notes

  • Driver only: use print() in executors.
  • Single upload: call upload_log_to_lakehouse() once at the end.
  • No remote append: avoids Spark job hangs and retries.

📝 License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laketrace-1.0.1.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

laketrace-1.0.1-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file laketrace-1.0.1.tar.gz.

File metadata

  • Download URL: laketrace-1.0.1.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for laketrace-1.0.1.tar.gz
Algorithm Hash digest
SHA256 bb727ce21712cbf891dbc4ecfda8a87843a1b8d450924e35f334cd8af3f64bdb
MD5 7d00e9f45ca6111e549901989c348eb9
BLAKE2b-256 a9792248593b614525793f3c52eb9eeb9893e2694c6ced487873feea0c053bed

See more details on using hashes here.

Provenance

The following attestation bundles were made for laketrace-1.0.1.tar.gz:

Publisher: publish-to-pypi.yml on Keayoub/laketrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file laketrace-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: laketrace-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for laketrace-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ea74ee28317954ac9e28cdcdbe2e2f5a94e009977b0ebb3bbccd89b2bc42e8a2
MD5 46ccdc6f2e7c35010b82108d78415c8d
BLAKE2b-256 bd5a2ec41427234f6822ede25e590a4c505a6e0badf513ff9b565bb2c3330e5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for laketrace-1.0.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on Keayoub/laketrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page