Skip to main content

Easily manage incremental progress using watermarks in your Databricks data pipelines

Project description

dbx-marker

Easily manage incremental progress using watermarks in your Databricks data pipelines.

Overview

dbx-marker is a Python library that helps you manage watermarks in your Databricks data pipelines using Delta tables.

It provides a simple interface to track and manage pipeline progress, making it easier to implement incremental processing and resume operations.

Features

  • Simple API for managing pipeline watermarks
  • Persistent storage using Delta tables
  • Thread-safe operations
  • Comprehensive error handling
  • Built for Databricks environments

Installation

Install using pip:

pip install dbx-marker

Quick Start

from dbx_marker import DbxMarker
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a marker manager
manager = DbxMarker(
    delta_table_path="/path/to/markers",
    spark=spark
)

# Update a marker (will upsert if it doesn't exist)
manager.update_marker("my_pipeline", "2024-01-21")

# Get the current marker
current_marker = manager.get_marker("my_pipeline")

# Delete a marker when needed
manager.delete_marker("my_pipeline")

Usage

Initialization

Create a DbxMarkerManager instance by specifying the Delta table path where markers will be stored:

from dbx_marker import DbxMarker
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

manager = DbxMarker(
    delta_table_path="/path/to/markers",
    spark=spark  # Optional: will create new session if not provided
)

Managing Markers

Update a Marker

manager.update_marker("pipeline_name", "marker_value")

Get Current Marker

current_value = manager.get_marker("pipeline_name")

Delete a Marker

manager.delete_marker("pipeline_name")

Error Handling

The library provides specific exceptions for different scenarios:

  • MarkerExistsError: When trying to create a duplicate marker
  • MarkerNotFoundError: When a requested marker doesn't exist
  • MarkerUpdateError: When marker update fails
  • MarkerDeleteError: When marker deletion fails

Requirements

  • Python >= 3.13
  • PySpark >= 3.5.4
  • Delta-Spark >= 3.3.0
  • Loguru >= 0.7.3

Development

  1. Clone the repository
  2. Install development dependencies:
pdm install -G dev
  1. Run tests:
pdm run test
  1. Run all checks:
pdm run all-checks

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbx_marker-1.0.3.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbx_marker-1.0.3-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file dbx_marker-1.0.3.tar.gz.

File metadata

  • Download URL: dbx_marker-1.0.3.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for dbx_marker-1.0.3.tar.gz
Algorithm Hash digest
SHA256 c6caf1ae301c811def47a0017663466b60eb76cf051944fa8b501b6c3fe469c4
MD5 2ff2c325d0a4344459bfc2a7191e0199
BLAKE2b-256 cf74bc5677cb3740230d82621fad72a4cf9bf31ce295d4922b916dd5313b4c23

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_marker-1.0.3.tar.gz:

Publisher: release.yml on jelther/dbx-marker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dbx_marker-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: dbx_marker-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for dbx_marker-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 098b4b35eaab49a5eca09b74fc9901b2acd7ac0f0d76a3cf11957166887d2704
MD5 8b433f5c8799d0320e88ae8117d5befb
BLAKE2b-256 c431ec9faa4a371ae2a36b1b7ab9c39210abd921cfae6a1cb514dc34bdd4e21b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_marker-1.0.3-py3-none-any.whl:

Publisher: release.yml on jelther/dbx-marker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page