Skip to main content

A set of easy to use convenient tools for deltalake tables.

Project description

deltalake-tools logo

deltalake-tools license deltalake-tools pipelines deltalake-tools pipelines deltalake-tools Documentation Status

Deltalake Tools

Introduction

A set of easy to use commands for deltalake, with a command line interface.

You probably don't need this, especially if you're already using delta-rs (deltalake). Just like you don't need awscli when you already have a boto3 client. It's useful as a delta cli ie. as a cron job.

Also working on a Rust implementation, with python bindings using PYO3/maturin.

Delta Table Commands currently supported

  • compact
  • vacuum
  • create-checkpoint
  • table-version
  • delete-table
  • create-test-table
  • parquet-to-delta 1
  • ...

1: currently limited to inplace and local storage

Storage services currently supported

  • local storage
  • AWS S3 (virtual and path addressing style)
  • IBM COS (virtual and path addressing style)
  • Azure blob storage
  • Google cloud storage
  • ...

Platforms supported

arm64 amd64
linux x x
mac x x
windows

Minimal Python version

3.10

Getting started

Install

pip install deltalake-tools

check out astral's rye, uv and ruff projects

(uv is a blazingly fast drop-in replacement for pip.)

uv install deltalake-tools

If you prefer rye:

rye add deltalake-tools

Usage

help

$ deltalake-tools -h
Usage: deltalake-tools [OPTIONS] COMMAND [ARGS]...

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  compact
  create-checkpoint
  table-version
  vacuum
$

Consider the following test table:

/tmp/test_delta_table
├── 0-ed28f60b-7569-47fc-90fa-9cbaad8ccd27-0.parquet
├── 1-dde22814-9070-4df6-be8c-5ec564c8cfd3-0.parquet
├── 10-eafcb45c-6fdd-467b-b35d-d8bf127ae243-0.parquet
├── 2-7bcb5e5d-d2e6-4975-a2d6-a399413b2883-0.parquet
├── 3-7412d097-1b4c-4d45-9ac1-7bc312f20f62-0.parquet
├── 4-9a5bd960-08fb-4b8f-8e94-8f81574799e5-0.parquet
├── 5-a2182ca3-c334-43be-8935-210fc839ff77-0.parquet
├── 6-f9f4597c-9709-4029-a0ee-a11d757072bf-0.parquet
├── 7-a3c8fbef-27eb-4ee1-ae2a-2a30c9964fbc-0.parquet
├── 8-8221beb7-9a14-4ce5-a205-c5b862cbce0d-0.parquet
├── 9-63ec9854-d58c-4d48-b3cf-9c18c43e2d17-0.parquet
└── _delta_log
    ├── 00000000000000000000.json
    ├── 00000000000000000001.json
    ├── 00000000000000000002.json
    ├── 00000000000000000003.json
    ├── 00000000000000000004.json
    ├── 00000000000000000005.json
    ├── 00000000000000000006.json
    ├── 00000000000000000007.json
    ├── 00000000000000000008.json
    ├── 00000000000000000009.json
    └── 00000000000000000010.json

table-version

$ deltalake-tools table-version /tmp/test_delta_table
10

compact

  • increments version
  • rewrites the date into 1 file. Here, 11 files are replaced by 1. This has a considerable beneficial effect on read performance.
$ deltalake-tools table-version /tmp/test_delta_table
10
$ deltalake-tools compact /tmp/test_delta_table
{'numFilesAdded': 1, 'numFilesRemoved': 11, 'filesAdded': '{"avg":1034.0,"max":1034,"min":1034,"totalFiles":1,"totalSize":1034}', 'filesRemoved': '{"avg":909.3636363636364,"max":913,"min":873,"totalFiles":11,"totalSize":10003}', 'partitionsOptimized': 1, 'numBatches': 11, 'totalConsideredFiles': 11, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}
$ deltalake-tools table-version /tmp/test_delta_table
11

vacuum

  • increments version
  • by default, the minimal retention hours is 168. This can be overridden, but read the docs first.

arguments:

  • --retention-hours: how long do you want to keep, for time travelling. Default: 168 (1 week)
  • --disable-retention-duration: disable the safety check
  • --force: by default, this is a dry-run operation. Use this to actually perform the vacuum command.

These safety features are implemented intentionally. Read the docs for more information.

$ deltalake-tools table-version /tmp/test_delta_table
11
$deltalake-tools vacuum /tmp/test_delta_table --retention-hours 0 --disable-retention-duration --force
['3-7412d097-1b4c-4d45-9ac1-7bc312f20f62-0.parquet', '6-f9f4597c-9709-4029-a0ee-a11d757072bf-0.parquet', '5-a2182ca3-c334-43be-8935-210fc839ff77-0.parquet', '2-7bcb5e5d-d2e6-4975-a2d6-a399413b2883-0.parquet', '1-dde22814-9070-4df6-be8c-5ec564c8cfd3-0.parquet', '8-8221beb7-9a14-4ce5-a205-c5b862cbce0d-0.parquet', '9-63ec9854-d58c-4d48-b3cf-9c18c43e2d17-0.parquet', '7-a3c8fbef-27eb-4ee1-ae2a-2a30c9964fbc-0.parquet', '0-ed28f60b-7569-47fc-90fa-9cbaad8ccd27-0.parquet', '10-eafcb45c-6fdd-467b-b35d-d8bf127ae243-0.parquet', '4-9a5bd960-08fb-4b8f-8e94-8f81574799e5-0.parquet']
$ deltalake-tools table-version /tmp/test_delta_table
12

create-checkpoint

  • does not increment version
  • use this if your deltalake clients require this _last_checkpoint file, as it is not written by default. Not by spark, not by delta-rs.
$ deltalake-tools table-version /tmp/test_delta_table
13
$ deltalake-tools create-checkpoint /tmp/test_delta_table
Checkpoint created successfully.
$ deltalake-tools table-version /tmp/test_delta_table
13

Contribute

Test

  • pytest
  • moto[server]: needs to be started before initializing any clients (boto3, delta)

Running tests:

rye run pytest tests

management

changelog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltalake_tools-0.1.22.tar.gz (366.9 kB view details)

Uploaded Source

Built Distribution

deltalake_tools-0.1.22-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file deltalake_tools-0.1.22.tar.gz.

File metadata

  • Download URL: deltalake_tools-0.1.22.tar.gz
  • Upload date:
  • Size: 366.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for deltalake_tools-0.1.22.tar.gz
Algorithm Hash digest
SHA256 3cfdf7a0c02a6e7f4bb02378c407937337baa8439248a1423f870d2de35431a3
MD5 37ec2b281a0aa9aa489bbc8ab2587b11
BLAKE2b-256 b8e9a51ac3773466d0deb6edeea2b847746a337e7501ba9892c1fa63cbb3c09a

See more details on using hashes here.

File details

Details for the file deltalake_tools-0.1.22-py3-none-any.whl.

File metadata

File hashes

Hashes for deltalake_tools-0.1.22-py3-none-any.whl
Algorithm Hash digest
SHA256 4d4b9042967fe11df201f05bd5e08f83d4ac25e22c2c176990324d693660acc5
MD5 b7a2b0f8679edb55271ef6c433fe08f3
BLAKE2b-256 b9c2b119f38d6406310250a208886a7579370ee72708281930f10b4534805927

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page