Skip to main content

A set of easy to use convenient tools for deltalake tables.

Project description

deltalake-tools logo

deltalake-tools license deltalake-tools pipelines deltalake-tools pipelines deltalake-tools Documentation Status

Deltalake Tools

Introduction

A set of easy to use commands for deltalake, with a command line interface.

You probably don't need this, especially if you're already using delta-rs (deltalake). Just like you don't need awscli when you already have a boto3 client. It's useful as a delta cli ie. as a cron job.

Also working on a Rust implementation, with python bindings using PYO3/maturin.

Delta Table Commands currently supported:

  • compact
  • vacuum
  • create-checkpoint
  • table-version
  • delete-table
  • create-test-table
  • convert-to-table
  • ...

Storage services currently supported:

  • local storage
  • AWS S3 (virtual and path addressing style)
  • IBM COS (virtual and path addressing style)
  • Azure blob storage
  • Google cloud storage
  • ...

Platforms supported

arm64 amd64
linux x x
mac x x
windows

Minimal Python version

3.10

Getting started

Install

pip install deltalake-tools

check out astral's rye, uv and ruff projects

(uv is a blazingly fast drop-in replacement for pip.)

$ uv install deltalake-tools

If you prefer rye:

$ rye add deltalake-tools

Usage

help

$ deltalake-tools -h
Usage: deltalake-tools [OPTIONS] COMMAND [ARGS]...

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  compact
  create-checkpoint
  table-version
  vacuum
$

Consider the following test table:

/tmp/test_delta_table
├── 0-ed28f60b-7569-47fc-90fa-9cbaad8ccd27-0.parquet
├── 1-dde22814-9070-4df6-be8c-5ec564c8cfd3-0.parquet
├── 10-eafcb45c-6fdd-467b-b35d-d8bf127ae243-0.parquet
├── 2-7bcb5e5d-d2e6-4975-a2d6-a399413b2883-0.parquet
├── 3-7412d097-1b4c-4d45-9ac1-7bc312f20f62-0.parquet
├── 4-9a5bd960-08fb-4b8f-8e94-8f81574799e5-0.parquet
├── 5-a2182ca3-c334-43be-8935-210fc839ff77-0.parquet
├── 6-f9f4597c-9709-4029-a0ee-a11d757072bf-0.parquet
├── 7-a3c8fbef-27eb-4ee1-ae2a-2a30c9964fbc-0.parquet
├── 8-8221beb7-9a14-4ce5-a205-c5b862cbce0d-0.parquet
├── 9-63ec9854-d58c-4d48-b3cf-9c18c43e2d17-0.parquet
└── _delta_log
    ├── 00000000000000000000.json
    ├── 00000000000000000001.json
    ├── 00000000000000000002.json
    ├── 00000000000000000003.json
    ├── 00000000000000000004.json
    ├── 00000000000000000005.json
    ├── 00000000000000000006.json
    ├── 00000000000000000007.json
    ├── 00000000000000000008.json
    ├── 00000000000000000009.json
    └── 00000000000000000010.json

table-version

$ deltalake-tools table-version /tmp/test_delta_table       
10

compact

  • increments version
  • rewrites the date into 1 file. Here, 11 files are replaced by 1. This has a considerable beneficial effect on read performance.
$ deltalake-tools table-version /tmp/test_delta_table
10
$ deltalake-tools compact /tmp/test_delta_table
{'numFilesAdded': 1, 'numFilesRemoved': 11, 'filesAdded': '{"avg":1034.0,"max":1034,"min":1034,"totalFiles":1,"totalSize":1034}', 'filesRemoved': '{"avg":909.3636363636364,"max":913,"min":873,"totalFiles":11,"totalSize":10003}', 'partitionsOptimized': 1, 'numBatches': 11, 'totalConsideredFiles': 11, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}
$ deltalake-tools table-version /tmp/test_delta_table
11

vacuum

  • increments version
  • by default, the minimal retention hours is 168. This can be overridden, but read the docs first.

arguments:

  • --retention-hours: how long do you want to keep, for time travelling. Default: 168 (1 week)
  • --disable-retention-duration: disable the safety check
  • --force: by default, this is a dry-run operation. Use this to actually perform the vacuum command.

These safety features are implemented intentionally. Read the docs for more information.

$ deltalake-tools table-version /tmp/test_delta_table
11
$deltalake-tools vacuum /tmp/test_delta_table --retention-hours 0 --disable-retention-duration --force 
['3-7412d097-1b4c-4d45-9ac1-7bc312f20f62-0.parquet', '6-f9f4597c-9709-4029-a0ee-a11d757072bf-0.parquet', '5-a2182ca3-c334-43be-8935-210fc839ff77-0.parquet', '2-7bcb5e5d-d2e6-4975-a2d6-a399413b2883-0.parquet', '1-dde22814-9070-4df6-be8c-5ec564c8cfd3-0.parquet', '8-8221beb7-9a14-4ce5-a205-c5b862cbce0d-0.parquet', '9-63ec9854-d58c-4d48-b3cf-9c18c43e2d17-0.parquet', '7-a3c8fbef-27eb-4ee1-ae2a-2a30c9964fbc-0.parquet', '0-ed28f60b-7569-47fc-90fa-9cbaad8ccd27-0.parquet', '10-eafcb45c-6fdd-467b-b35d-d8bf127ae243-0.parquet', '4-9a5bd960-08fb-4b8f-8e94-8f81574799e5-0.parquet']
$ deltalake-tools table-version /tmp/test_delta_table                                                  
12

create-checkpoint

  • does not increment version
  • use this if your deltalake clients require this _last_checkpoint file, as it is not written by default. Not by spark, not by delta-rs.
$ deltalake-tools table-version /tmp/test_delta_table
13
$ deltalake-tools create-checkpoint /tmp/test_delta_table
Checkpoint created successfully.
$ deltalake-tools table-version /tmp/test_delta_table    
13

Contribute

Test

  • pytest
  • moto[server]: needs to be started before initializing any clients (boto3, delta)

Running tests:

$ rye run pytest tests

management

changelog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltalake_tools-0.1.21.tar.gz (366.8 kB view details)

Uploaded Source

Built Distribution

deltalake_tools-0.1.21-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file deltalake_tools-0.1.21.tar.gz.

File metadata

  • Download URL: deltalake_tools-0.1.21.tar.gz
  • Upload date:
  • Size: 366.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for deltalake_tools-0.1.21.tar.gz
Algorithm Hash digest
SHA256 56f342ac276ce0705865ca4cd36b949188c9be6764b22e98c5de23d97a6ed88d
MD5 9b803e2ad4d72a2a8c94f8f7bbc1198f
BLAKE2b-256 696eaaa43de8657b95fc777be35bbef605ba144b49fe17205a8d35adf092007a

See more details on using hashes here.

File details

Details for the file deltalake_tools-0.1.21-py3-none-any.whl.

File metadata

File hashes

Hashes for deltalake_tools-0.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 4a2f65cd0941839a79bf60c6822a74ccd3c74df309fe6637815db6a3289447c1
MD5 2df0eeaa3526ec200341fdeca02d7867
BLAKE2b-256 52675608fa451ff7b12cf39ab4e6040236622e3e4229011d4749aeda03320067

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page