Skip to main content

Production-grade utilities for Delta Lake management

Project description

delta-lake-utils

Production-grade utilities for Delta Lake table management and optimization on Databricks.

What Does This Package Do?

Automates Delta Lake table optimization, health monitoring, and pipeline generation for Databricks data engineers.

Main Features:

  1. Smart OPTIMIZE - Automatically consolidates small files and improves query performance

    • Detects when your Delta table has too many small files
    • Intelligently chooses which columns to Z-ORDER by
    • Reduces query time by up to 10x
  2. Health Checker - Diagnoses table problems before they impact production

    • Identifies small file problems
    • Detects data skew across partitions
    • Finds configuration issues
  3. Performance Profiler - Measures how fast your Delta operations run

    • Track read/write speeds
    • Identify bottlenecks
    • Compare before/after optimization
  4. Medallion Generator - Auto-creates Bronze/Silver/Gold pipeline code

    • Generates production-ready notebooks
    • Follows best practices
    • Saves hours of boilerplate coding
  5. Unity Catalog Auditor - Manages permissions and access control

    • Audits table permissions
    • Generates permission scripts
    • Ensures security compliance

Installation

pip install delta-lake-utils

Quick Start

from pyspark.sql import SparkSession
from delta_utils import DeltaOptimizer

spark = SparkSession.builder.getOrCreate()
optimizer = DeltaOptimizer(spark)

# Optimize a table - reduces files, improves performance
result = optimizer.auto_optimize('/mnt/delta/my_table')
print(f"Optimized! Removed {result.files_removed} files")

Use Cases

Use Case 1: Your queries are slow

Problem: Delta table has 5000 small files, queries take 10 minutes
Solution: Run optimizer, consolidates to 50 files, queries now take 1 minute

Use Case 2: Starting a new data pipeline

Problem: Need to build Bronze/Silver/Gold architecture from scratch
Solution: Use medallion generator, get complete pipeline in 30 seconds

Use Case 3: Data quality issues

Problem: Not sure if table is healthy, production keeps failing
Solution: Run health checker, get specific recommendations to fix issues

Use Case 4: Permission audit required

Problem: Need to verify all tables have correct access controls
Solution: Use catalog auditor to check and fix permissions

Documentation

Requirements

  • Python 3.8+
  • PySpark 3.2+
  • Delta Lake 2.0+
  • Databricks Runtime 11.0+ recommended

Author

Nalini Panwar GitHub: @panwarnalini-hub

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delta_lake_utils-1.0.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

delta_lake_utils-1.0.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file delta_lake_utils-1.0.0.tar.gz.

File metadata

  • Download URL: delta_lake_utils-1.0.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for delta_lake_utils-1.0.0.tar.gz
Algorithm Hash digest
SHA256 be946bf4b966fc4cd3eec2cd9b1c7b0fcdc448b2b3ea0e58a151c1a0690e8c08
MD5 11357e2b80e91c67de7562f4058afbc9
BLAKE2b-256 f11d428b4fcf6ef283669d212b99994649377498ce2605baecf7d47e1d133816

See more details on using hashes here.

File details

Details for the file delta_lake_utils-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for delta_lake_utils-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f78744dbf77ad52838e27c9c79ca6592b8e78b373b4a880de18a3f99acdeb3e5
MD5 fe84cac6b7de25994b503e78b679eeb3
BLAKE2b-256 99f29a73cf2aeb8d9cb1650c1daeca2589070eec998c8a1290431f3ddc978afb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page