Skip to main content

A set of easy to use convenient tools for deltalake tables.

Project description

deltalake-tools logo

deltalake-tools license deltalake-tools pipelines deltalake-tools pipelines deltalake-tools Documentation Status

Deltalake Tools

Introduction

A set of easy to use commands for deltalake, with a command line interface.

You probably don't need this, especially if you're already using delta-rs (deltalake). Just like you don't need awscli when you already have a boto3 client. It's useful as a delta cli ie. as a cron job.

Also working on a Rust implementation, with python bindings using PYO3/maturin.

Delta Table Commands currently supported

  • compact
  • vacuum
  • create-checkpoint
  • table-version
  • delete-table
  • create-test-table
  • parquet-to-delta
  • ...

Storage services currently supported

  • local storage
  • AWS S3 (virtual and path addressing style)
  • IBM COS (virtual and path addressing style)
  • Azure blob storage
  • Google cloud storage
  • ...

Platforms supported

arm64 amd64
linux x x
mac x x
windows

Minimal Python version

3.10

Getting started

Install

pip install deltalake-tools

check out astral's rye, uv and ruff projects

(uv is a blazingly fast drop-in replacement for pip.)

uv pip install deltalake-tools

If you prefer rye:

rye add deltalake-tools

Usage

help

$ deltalake-tools -h
Usage: deltalake-tools [OPTIONS] COMMAND [ARGS]...

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  compact
  create-checkpoint
  table-version
  vacuum
$

Consider the following test table:

/tmp/test_delta_table
├── 0-ed28f60b-7569-47fc-90fa-9cbaad8ccd27-0.parquet
├── 1-dde22814-9070-4df6-be8c-5ec564c8cfd3-0.parquet
├── 10-eafcb45c-6fdd-467b-b35d-d8bf127ae243-0.parquet
├── 2-7bcb5e5d-d2e6-4975-a2d6-a399413b2883-0.parquet
├── 3-7412d097-1b4c-4d45-9ac1-7bc312f20f62-0.parquet
├── 4-9a5bd960-08fb-4b8f-8e94-8f81574799e5-0.parquet
├── 5-a2182ca3-c334-43be-8935-210fc839ff77-0.parquet
├── 6-f9f4597c-9709-4029-a0ee-a11d757072bf-0.parquet
├── 7-a3c8fbef-27eb-4ee1-ae2a-2a30c9964fbc-0.parquet
├── 8-8221beb7-9a14-4ce5-a205-c5b862cbce0d-0.parquet
├── 9-63ec9854-d58c-4d48-b3cf-9c18c43e2d17-0.parquet
└── _delta_log
    ├── 00000000000000000000.json
    ├── 00000000000000000001.json
    ├── 00000000000000000002.json
    ├── 00000000000000000003.json
    ├── 00000000000000000004.json
    ├── 00000000000000000005.json
    ├── 00000000000000000006.json
    ├── 00000000000000000007.json
    ├── 00000000000000000008.json
    ├── 00000000000000000009.json
    └── 00000000000000000010.json

table-version

$ deltalake-tools table-version /tmp/test_delta_table
10

compact

  • increments version
  • rewrites the date into 1 file. Here, 11 files are replaced by 1. This has a considerable beneficial effect on read performance.
$ deltalake-tools table-version /tmp/test_delta_table
10
$ deltalake-tools compact /tmp/test_delta_table
{'numFilesAdded': 1, 'numFilesRemoved': 11, 'filesAdded': '{"avg":1034.0,"max":1034,"min":1034,"totalFiles":1,"totalSize":1034}', 'filesRemoved': '{"avg":909.3636363636364,"max":913,"min":873,"totalFiles":11,"totalSize":10003}', 'partitionsOptimized': 1, 'numBatches': 11, 'totalConsideredFiles': 11, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}
$ deltalake-tools table-version /tmp/test_delta_table
11

vacuum

  • increments version
  • by default, the minimal retention hours is 168. This can be overridden, but read the docs first.

arguments:

  • --retention-hours: how long do you want to keep, for time travelling. Default: 168 (1 week)
  • --disable-retention-duration: disable the safety check
  • --force: by default, this is a dry-run operation. Use this to actually perform the vacuum command.

These safety features are implemented intentionally. Read the docs for more information.

$ deltalake-tools table-version /tmp/test_delta_table
11
$deltalake-tools vacuum /tmp/test_delta_table --retention-hours 0 --disable-retention-duration --force
['3-7412d097-1b4c-4d45-9ac1-7bc312f20f62-0.parquet', '6-f9f4597c-9709-4029-a0ee-a11d757072bf-0.parquet', '5-a2182ca3-c334-43be-8935-210fc839ff77-0.parquet', '2-7bcb5e5d-d2e6-4975-a2d6-a399413b2883-0.parquet', '1-dde22814-9070-4df6-be8c-5ec564c8cfd3-0.parquet', '8-8221beb7-9a14-4ce5-a205-c5b862cbce0d-0.parquet', '9-63ec9854-d58c-4d48-b3cf-9c18c43e2d17-0.parquet', '7-a3c8fbef-27eb-4ee1-ae2a-2a30c9964fbc-0.parquet', '0-ed28f60b-7569-47fc-90fa-9cbaad8ccd27-0.parquet', '10-eafcb45c-6fdd-467b-b35d-d8bf127ae243-0.parquet', '4-9a5bd960-08fb-4b8f-8e94-8f81574799e5-0.parquet']
$ deltalake-tools table-version /tmp/test_delta_table
12

create-checkpoint

  • does not increment version
  • use this if your deltalake clients require this _last_checkpoint file, as it is not written by default. Not by spark, not by delta-rs.
$ deltalake-tools table-version /tmp/test_delta_table
13
$ deltalake-tools create-checkpoint /tmp/test_delta_table
Checkpoint created successfully.
$ deltalake-tools table-version /tmp/test_delta_table
13

Contribute

Test

  • pytest
  • moto[server]: needs to be started before initializing any clients (boto3, delta)

Running tests:

rye run pytest tests

management

changelog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltalake_tools-0.2.0.tar.gz (368.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deltalake_tools-0.2.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file deltalake_tools-0.2.0.tar.gz.

File metadata

  • Download URL: deltalake_tools-0.2.0.tar.gz
  • Upload date:
  • Size: 368.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for deltalake_tools-0.2.0.tar.gz
Algorithm Hash digest
SHA256 329a32079beaf2296c39342c2ec426d522380a23aebcda79ff7f9ac65b11082a
MD5 8fa81c53e29877eaf420dfdc42e0d470
BLAKE2b-256 dc5864c7cf463f42fd29da2ffd1d7c55fa78e5e8c1b569925687c0dbd14fa7c7

See more details on using hashes here.

File details

Details for the file deltalake_tools-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for deltalake_tools-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 681ef35851bce71a991da7471870dbebe037b126e60ed703fc0cc60e62262335
MD5 7ab00f4326c1ca6c75d641c13a87b821
BLAKE2b-256 ea5f8600cfe4441955d4cb6fe996b69192aabd2a211e936e747eec3814c6a4a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page