Safe compaction of Hive external tables on on-premises Kerberized Hadoop clusters

These details have not been verified by PyPI

Project links

Project description

Lakekeeper

Safe compaction of Hive external tables on on-premises Kerberized Hadoop clusters.

The problem

On Hadoop clusters using Hive external tables with PySpark, data pipelines accumulate thousands of small files over time (e.g. 65,000 files for 3 GB of data). This pattern degrades read performance, overloads the HDFS NameNode, and slows down all downstream queries.

The root cause is that Spark writes one file per partition task by default, and incremental pipelines append rather than rewrite. Common tools like INSERT OVERWRITE or saveAsTable solve the small-file problem but destroy metadata cataloging properties (Apache Atlas lineage, table location in the Hive Metastore), making them unsuitable for production use on managed clusters.

Lakekeeper solves this without touching the table's Metastore location.

Solution

Lakekeeper compacts Hive external tables safely:

No saveAsTable — the table's Metastore location never changes, preserving lineage and catalog properties (Apache Atlas and compatible systems)
Zero-copy backups — DDL clone (SHOW CREATE TABLE) pointing to the original location, no data duplication
External tables only — MANAGED tables are detected and skipped automatically
Per-partition compaction — only compacts partitions that exceed the small-file threshold, untouched partitions are skipped
Dynamic target file count — computed from actual data size and configured HDFS block size
Row count verification — aborts and rolls back automatically if counts do not match after compaction

Requirements

Python >= 3.9
Apache Spark (PySpark) accessible on the cluster
Hive Metastore with external table support
HDFS as the underlying storage

Tested on Cloudera CDP 7.1.9. Compatible with any on-premises Hadoop distribution (Hortonworks HDP, Apache Ambari, vanilla Hadoop) that exposes a standard Hive Metastore and HDFS filesystem.

Installation

pip install lakekeeper

For development:

git clone https://github.com/ab2dridi/Lakekeeper.git
cd BeeKeeper
pip install ".[dev]"

End-to-end usage

Scenario 1 — Local cluster (no Kerberos)

Suitable for development environments or clusters without Kerberos authentication.

# 1. Install
pip install lakekeeper

# 2. Analyze — see which tables need compaction (no writes, safe to run anytime)
lakekeeper analyze --database mydb

# 3. Compact a specific table
lakekeeper compact --table mydb.events

# 4. If something went wrong, rollback to the original state
lakekeeper rollback --table mydb.events

# 5. Once you're confident, remove the backup to free up disk space
lakekeeper cleanup --table mydb.events

Scenario 2 — On-premises Kerberized cluster (YAML config)

On a Kerberized cluster, configure spark_submit in a YAML file. The lakekeeper CLI automatically builds and executes the spark-submit command — no need to write it manually.

Step 1 — Create the Python environment to ship to the cluster

conda create -n lakekeeper_env python=3.9 -y
conda activate lakekeeper_env
pip install lakekeeper
conda-pack -o lakekeeper_env.tar.gz

Step 2 — Write a config file

# lakekeeper.yaml
block_size_mb: 128
compaction_ratio_threshold: 10.0
log_level: INFO

spark_submit:
  enabled: true
  master: yarn
  deploy_mode: client
  principal: myuser@MY.REALM.COM
  keytab: /etc/security/keytabs/myuser.keytab
  queue: data-engineering
  archives: /opt/lakekeeper_env.tar.gz#lakekeeper_env
  python_env: ./lakekeeper_env/bin/python
  executor_memory: 4g
  num_executors: 10
  executor_cores: 2
  driver_memory: 2g
  script_path: /opt/lakekeeper/run_lakekeeper.py
  extra_conf:
    spark.yarn.kerberos.relogin.period: 1h

Step 3 — Run

# Analyze (dry-run, no writes)
lakekeeper --config-file lakekeeper.yaml analyze --database mydb

# Compact a single table
lakekeeper --config-file lakekeeper.yaml compact --table mydb.events

# Compact multiple tables
lakekeeper --config-file lakekeeper.yaml compact --tables mydb.events,mydb.users

# Compact an entire database
lakekeeper --config-file lakekeeper.yaml compact --database mydb

# Rollback if needed
lakekeeper --config-file lakekeeper.yaml rollback --table mydb.events

# Cleanup backups older than 7 days
lakekeeper --config-file lakekeeper.yaml cleanup --database mydb --older-than 7d

Under the hood, Lakekeeper builds and executes:

spark-submit --master yarn --deploy-mode client \
  --principal myuser@MY.REALM.COM \
  --keytab /etc/security/keytabs/myuser.keytab \
  --conf spark.yarn.queue=data-engineering \
  --archives /opt/lakekeeper_env.tar.gz#lakekeeper_env \
  --conf spark.pyspark.python=./lakekeeper_env/bin/python \
  --executor-memory 4g --num-executors 10 \
  /opt/lakekeeper/run_lakekeeper.py compact --table mydb.events

Scenario 3 — spark-submit manually

For one-off runs or when Lakekeeper is not installed on the edge node.

spark-submit \
  --master yarn \
  --deploy-mode client \
  --principal myuser@MY.REALM.COM \
  --keytab /etc/security/keytabs/myuser.keytab \
  --conf spark.yarn.queue=my-queue \
  --archives lakekeeper_env.tar.gz#lakekeeper_env \
  --conf spark.pyspark.python=./lakekeeper_env/bin/python \
  run_lakekeeper.py compact --database mydb --block-size 128

CLI reference

lakekeeper [OPTIONS] COMMAND [ARGS]...

Options:
  -c, --config-file PATH  YAML configuration file.
  --version               Show version and exit.
  --help                  Show help and exit.

Commands:
  analyze   Analyze tables and report compaction needs (dry-run, no writes).
  compact   Compact Hive external tables.
  rollback  Rollback a table to its pre-compaction state.
  cleanup   Remove backup tables and reclaim HDFS space.

--config-file placement: Pass it before the subcommand name: lakekeeper --config-file lakekeeper.yaml compact --table mydb.events It can also be placed after the subcommand for backward compatibility.

analyze

lakekeeper analyze --database mydb
lakekeeper analyze --table mydb.events
lakekeeper analyze --tables mydb.events,mydb.users
lakekeeper analyze --table mydb.events --block-size 256 --ratio-threshold 5

compact

lakekeeper compact --database mydb
lakekeeper compact --table mydb.events
lakekeeper compact --tables mydb.events,mydb.users
lakekeeper compact --database mydb --block-size 256 --ratio-threshold 5
lakekeeper compact --database mydb --dry-run   # analyze only, no writes

rollback

lakekeeper rollback --table mydb.events

cleanup

lakekeeper cleanup --table mydb.events              # remove all backups for a table
lakekeeper cleanup --database mydb --older-than 7d  # remove backups older than 7 days

Configuration reference

Lakekeeper parameters

Parameter	Default	CLI flag	Description
`block_size_mb`	`128`	`--block-size`	Target HDFS block size in MB
`compaction_ratio_threshold`	`10.0`	`--ratio-threshold`	Compact if avg file size < block_size / ratio
`backup_prefix`	`__bkp`	—	Prefix for backup table names
`dry_run`	`false`	`--dry-run`	Analyze only, no writes
`log_level`	`INFO`	`--log-level`	`DEBUG`, `INFO`, `WARNING`, `ERROR`

spark_submit parameters

Parameter	Default	Description
`enabled`	`false`	Enable automatic spark-submit launch
`master`	`yarn`	Spark master URL
`deploy_mode`	`client`	`client` or `cluster`
`principal`	—	Kerberos principal (e.g. `user@REALM.COM`)
`keytab`	—	Path to the Kerberos keytab file
`queue`	—	YARN queue name (`spark.yarn.queue`)
`archives`	—	`--archives` for the conda-packed Python env
`python_env`	—	Python path inside the archive (`spark.pyspark.python`)
`executor_memory`	—	`--executor-memory` (e.g. `4g`)
`num_executors`	—	`--num-executors`
`executor_cores`	—	`--executor-cores`
`driver_memory`	—	`--driver-memory`
`script_path`	`run_lakekeeper.py`	Path to the entry-point script passed to spark-submit
`extra_conf`	`{}`	Additional `--conf key=value` pairs
`extra_files`	`[]`	Files distributed to executors via `--files` (e.g. `hive-site.xml`)

How it works

Compaction strategy — HDFS rename swap

Lakekeeper uses HDFS directory renames rather than ALTER TABLE SET LOCATION to swap data. The table's Metastore location never changes — only the contents of the HDFS directory are replaced in place. Lineage and cataloging properties (Apache Atlas and compatible systems) are fully preserved.

Non-partitioned table

Given a table mydb.events at hdfs:///warehouse/mydb/events/:

Step 1 — Backup
  Metastore: mydb.__bkp_events_20240301_020000  →  hdfs:///warehouse/mydb/events/
             (external.table.purge=false)
  HDFS:      events/   (original files, untouched)

Step 2 — Write compacted data to a temp sibling directory
  HDFS:      events/                          ← original, still live
             events__compact_tmp_1709257200/  ← Spark writes here

Step 3 — Verify row count
  Counts differ → delete events__compact_tmp_1709257200/ and abort.
  Original data at events/ is never touched.

Step 4 — Atomic HDFS rename swap
  rename  events/                        →  events__old_1709257200/
  rename  events__compact_tmp_1709257200/ →  events/

Final state:
  events/                       ← compacted files (table still points here)
  events__old_1709257200/       ← original files (kept for rollback)
  __bkp_events_20240301_020000  ← backup table in Metastore

Partitioned table

The same rename swap is applied partition by partition, only for partitions that exceed the compaction threshold:

Before:
  events/year=2024/month=01/   10 000 files, 1 GB  ← needs compaction
  events/year=2024/month=02/   3 files, 300 MB     ← skipped

After:
  events/year=2024/month=01/              ← 8 compacted files
  events/year=2024/month=01__old_TS/      ← original (kept for rollback)
  events/year=2024/month=02/              ← untouched

Readers of already-compacted partitions see the new files immediately while readers of not-yet-processed partitions still see the original data. All reads remain consistent throughout the operation.

Rollback

lakekeeper rollback --table mydb.events

Finds the most recent backup table (__bkp_events_*)
Reads its Metastore location → events__old_TS/ (the original data)
Deletes events/ (the compacted data)
Renames events__old_TS/ back to events/
Drops the backup table

The table is restored to exactly its pre-compaction state.

Cleanup

lakekeeper cleanup --table mydb.events

Finds all __bkp_events_* backup tables
For each: deletes the __old_* HDFS directory it points to, then drops the backup table

Cleanup is irreversible. Once run, rollback is no longer possible for the cleaned backups.

Important considerations

⚠ Run during a maintenance window

Lakekeeper reads the table twice (once to count rows, once to write). Any rows written by an active pipeline between those two reads will not appear in the compacted output and will be lost after the rename swap.

Always run Lakekeeper while source pipelines are stopped, or schedule it in a maintenance window.

⚠ 2× disk space required

During compaction, both the original and compacted data exist on HDFS simultaneously:

events/ — original files (until the rename swap)
events__compact_tmp_TS/ — compacted files being written

Ensure the HDFS parent directory quota allows at least 2× the table size before starting.

⚠ Do not delete `__old_*` directories manually

After a successful compaction, events__old_TS/ is the rollback safety net. Deleting it manually makes rollback impossible. Use lakekeeper cleanup instead.

⚠ Do not drop backup tables manually

Backup tables are created with TBLPROPERTIES ('external.table.purge'='false') to prevent the Hive Metastore setting external.table.purge=true from deleting the underlying HDFS data on DROP TABLE. Dropping a backup table manually removes the Metastore pointer to events__old_TS/ and prevents rollback.

Cloudera CDP note: CDP clusters commonly set external.table.purge=true globally. The purge=false property on backup tables overrides this default.

⚠ Leftover staging directories block the next run

If a previous compaction crashed, it may have left a events__compact_tmp_TS/ or events__old_TS/ directory behind. Lakekeeper refuses to start if either path already exists. Resolve manually before retrying:

Inspect the leftover directory contents.
If it contains valid compacted data, check whether the rename swap completed and restore accordingly.
If it is stale or incomplete, delete it: hdfs dfs -rm -r <path>.

Development

git clone https://github.com/ab2dridi/Lakekeeper.git
cd BeeKeeper
pip install ".[dev]"

# Lint
ruff check src/ tests/
ruff format --check src/ tests/

# Tests with coverage
pytest tests/ -v --cov=lakekeeper --cov-report=term-missing

License

MIT — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.6

Feb 25, 2026

0.0.5

Feb 24, 2026

0.0.4

Feb 24, 2026

This version

0.0.3

Feb 24, 2026

0.0.2

Feb 24, 2026

0.0.1

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakekeeper-0.0.3.tar.gz (30.4 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakekeeper-0.0.3-py3-none-any.whl (31.4 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file lakekeeper-0.0.3.tar.gz.

File metadata

Download URL: lakekeeper-0.0.3.tar.gz
Upload date: Feb 24, 2026
Size: 30.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for lakekeeper-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`8d2c3420379160620664a22bacdfce6d17ebbb455a723ed4fe12f657adcf8838`
MD5	`659c84831589b97e7d1c3ceae54ade04`
BLAKE2b-256	`306afd85040e84f34e13c9ed44de2419454edfb103a7ffd4998dee8bf18f5466`

See more details on using hashes here.

File details

Details for the file lakekeeper-0.0.3-py3-none-any.whl.

File metadata

Download URL: lakekeeper-0.0.3-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 31.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for lakekeeper-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37779e501a1662782c44fa514296ac05900fca97e0bde7d105402b3b3ad490e5`
MD5	`b927a2708fe669a52df1f6766db3f104`
BLAKE2b-256	`39683af6105c15bf2f3fdfcc8e087f75f49b93b8f74e9eb8a74b4addec8b6e3c`

See more details on using hashes here.

lakekeeper 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Lakekeeper

The problem

Solution

Requirements

Installation

End-to-end usage

Scenario 1 — Local cluster (no Kerberos)

Scenario 2 — On-premises Kerberized cluster (YAML config)

Scenario 3 — spark-submit manually

CLI reference

analyze

compact

rollback

cleanup

Configuration reference

Lakekeeper parameters

spark_submit parameters

How it works

Compaction strategy — HDFS rename swap

Non-partitioned table

Partitioned table

Rollback

Cleanup

Important considerations

⚠ Run during a maintenance window

⚠ 2× disk space required

⚠ Do not delete __old_* directories manually

⚠ Do not drop backup tables manually

⚠ Leftover staging directories block the next run

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

⚠ Do not delete `__old_*` directories manually