Skip to main content

Easily manage incremental progress using watermarks in your Databricks data pipelines

Project description

dbx-marker

Easily manage incremental progress using watermarks in your Databricks data pipelines.

Overview

dbx-marker is a Python library that helps you manage watermarks in your Databricks data pipelines using Delta tables.

It provides a simple interface to track and manage pipeline progress, making it easier to implement incremental processing and resume operations.

Features

  • Simple API for managing pipeline watermarks
  • Persistent storage using Delta tables
  • Thread-safe operations
  • Comprehensive error handling
  • Built for Databricks environments

Installation

Install using pip:

pip install dbx-marker

Quick Start

from dbx_marker import DbxMarker
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a marker manager
manager = DbxMarker(
    delta_table_path="/path/to/markers",
    spark=spark
)

# Update a marker (will upsert if it doesn't exist)
manager.update_marker("my_pipeline", "2024-01-21")

# Get the current marker
current_marker = manager.get_marker("my_pipeline")

# Delete a marker when needed
manager.delete_marker("my_pipeline")

Usage

Initialization

Create a DbxMarkerManager instance by specifying the Delta table path where markers will be stored:

from dbx_marker import DbxMarker
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

manager = DbxMarker(
    delta_table_path="/path/to/markers",
    spark=spark  # Optional: will create new session if not provided
)

Managing Markers

Update a Marker

manager.update_marker("pipeline_name", "marker_value")

Get Current Marker

current_value = manager.get_marker("pipeline_name")

Delete a Marker

manager.delete_marker("pipeline_name")

Error Handling

The library provides specific exceptions for different scenarios:

  • MarkerExistsError: When trying to create a duplicate marker
  • MarkerNotFoundError: When a requested marker doesn't exist
  • MarkerUpdateError: When marker update fails
  • MarkerDeleteError: When marker deletion fails

Requirements

  • Python >= 3.13
  • PySpark >= 3.5.4
  • Delta-Spark >= 3.3.0
  • Loguru >= 0.7.3

Development

  1. Clone the repository
  2. Install development dependencies:
pdm install -G dev
  1. Run tests:
pdm run test
  1. Run all checks:
pdm run all-checks

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbx_marker-1.0.6.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbx_marker-1.0.6-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file dbx_marker-1.0.6.tar.gz.

File metadata

  • Download URL: dbx_marker-1.0.6.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for dbx_marker-1.0.6.tar.gz
Algorithm Hash digest
SHA256 2cc876b9af6a44e483ce40fde1aa602846637120153f8984553ab5758c2130bc
MD5 4e5ca8475c5f725c16806734c11ca29e
BLAKE2b-256 499ec36b10bfa9cdc6f4221a8cf0d1b9b05eda74b2ef36b654012fd2bfb36974

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_marker-1.0.6.tar.gz:

Publisher: release.yml on jelther/dbx-marker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dbx_marker-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: dbx_marker-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for dbx_marker-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 4221485845c24a3ef4181421949426207b9b496919e213a6b53b3f590d0e7bbe
MD5 6134e327434835c157f107a0301ff6e0
BLAKE2b-256 b78dc74d89863cdb493a25ae7e4e7779677bde5b5c3408ada54990b53b039cf7

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_marker-1.0.6-py3-none-any.whl:

Publisher: release.yml on jelther/dbx-marker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page