Skip to main content

A package to detect changes in Delta Lake tables.

Project description

Delta Table Change Detection

This Python script is designed to detect changes in Delta Lake tables by examining the Delta log files. It provides detailed information about changes to specific columns in the table, including before/after values as well as the operations that caused said changes.

Features

  • Parses Delta Lake log files to extract metadata about changes.
  • Detects changes to specified column(s) based on a given identifier.
  • Returns detailed records of changes, including the version, operation, and timestamps.
  • Provides detailed information about original and modified records, including the file paths and modes.

Requirements

  • Python 3.x
  • deltalake package
  • pyarrow package

Installation

To use this script, ensure you have Python installed and the required packages. You can install the packages using pip:

Module:

pip install delta-change-detector

Dependencies:

pip install deltalake pyarrow

Usage

The main function provided by this script is detect_changes, which analyzes changes to a specific column for a given ID in a Delta Lake table.

Function: detect_changes

Parameters:

  • delta_path (str): Path to Delta table.
  • id_column (str): Column used as an identifier to match records.
  • column_names (list): Column(s) whose changes you want to track.
  • id_value (str or int): Value of identifier to search for in table.

Returns:

  • A dictionary containing error or information messages.
  • A list of records detailing changes if any were detected, including:
    • id_column: Identifier column.
    • original_record: Boolean indicating if record is original.
    • modified_record: Boolean indicating if record is modified.
    • old_value: Previous value of column.
    • new_value: New value of column.
    • parquet_file_path: Path to Parquet file containing record.
    • delta_log_path: Path to Delta log file.
    • operation: Operation that caused the change.
    • mode: Mode of operation.
    • timestamp: Timestamp of change.
    • version: Version number of Delta table.

Example:

delta_path = "/path/to/delta/table"
id_column = "user_id"
column_names = ["first_name", "last_name"]
id_value = 12345

changes = detect_changes(delta_path, id_column, column_names, id_value)
print(changes)

Example Output:

2024-08-05 00:40:08,784 - INFO - Attempting to open Delta table at: /path/to/delta/table
2024-08-05 00:40:08,822 - INFO - Successfully opened Delta table. History length: 40
2024-08-05 09:23:30,419 - INFO - Found matching record: {'first_name': 'Nicholas', 'last_name': 'Piesco'}
Changes detected:
Version: 14
Operation: WRITE
Mode: Append
ID Column: id
Old Values: {'first_name': 'Nick', 'last_name': 'Piesco'}
New Values: {}
Timestamp: 2024-08-02 16:30:44
Parquet File Path: 14-a06fd51b-2253-4436-aa03-7caf3311115d-0.parquet
Delta Log Path: /path/to/delta/table/_delta_log/00000000000000000014.json
Original Record: True
Modified Record: False
---
Version: 15
Operation: WRITE
Mode: Overwrite
ID Column: id
Old Values: {'first_name': 'Nick', 'last_name': 'Piesco'}
New Values: {'first_name': 'Nicholas', 'last_name': 'Piesco'}
Timestamp: 2024-08-02 16:30:44
Parquet File Path: 15-2365a517-edd4-4a3c-b070-8c44bd24944d-0.parquet
Delta Log Path: /path/to/delta/table/_delta_log/00000000000000000015.json
Original Record: False
Modified Record: True
---
None

Functions

parse_delta_log(delta_log_path)

  • Parses Delta Lake log file and returns log entries.

extract_log_info(log_entries)

  • Extracts added paths, removed paths, and mode from log entries.

detect_changes(delta_path, id_column, column_name, id_value)

  • Analyzes Delta table's history to detect changes for specific column(s) and identifier.

Error Handling

The script includes error handling to manage file reading issues and log parsing errors. It will provide error messages when issues are encountered during execution.

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please submit a pull request or open an issue for suggestions or bug reports.

Author

Nicholas G. Piesco

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

delta_change_detector-0.2.5-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file delta_change_detector-0.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for delta_change_detector-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bcb7f2fe1fa5911fa520a29c84a01b24b1dbf013efda7854a82605c5e372b1de
MD5 9c39b3f104dce1c0da268e96032e9945
BLAKE2b-256 71c07e7b07d088238b6cf6ef07aba7599340e41b6b99d299c903f84dea58ec44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page