Safe compaction of Hive external tables on on-premises Kerberized Hadoop clusters
Project description
Lakekeeper
Safe compaction of Hive external tables on on-premises Kerberized Hadoop clusters.
The problem
On Hadoop clusters using Hive external tables with PySpark, data pipelines accumulate thousands of small files over time (e.g. 65,000 files for 3 GB of data). This pattern degrades read performance, overloads the HDFS NameNode, and slows down all downstream queries.
The root cause is that Spark writes one file per partition task by default,
and incremental pipelines append rather than rewrite. Common tools like
INSERT OVERWRITE or saveAsTable solve the small-file problem but destroy
metadata cataloging properties (Apache Atlas lineage, table location in the
Hive Metastore), making them unsuitable for production use on managed clusters.
Lakekeeper solves this without touching the table's Metastore location.
Solution
Lakekeeper compacts Hive external tables safely:
- No
saveAsTable— the table's Metastore location never changes, preserving lineage and catalog properties (Apache Atlas and compatible systems) - Zero-copy backups — DDL clone (
SHOW CREATE TABLE) pointing to the original location, no data duplication - External tables only — MANAGED tables are detected and skipped automatically
- Per-partition compaction — only compacts partitions that exceed the small-file threshold, untouched partitions are skipped
- Dynamic target file count — computed from actual data size and configured HDFS block size
- Row count verification — aborts and rolls back automatically if counts do not match after compaction
Requirements
- Python >= 3.9
- Apache Spark (PySpark) accessible on the cluster
- Hive Metastore with external table support
- HDFS as the underlying storage
Tested on Cloudera CDP 7.1.9. Compatible with any on-premises Hadoop distribution (Hortonworks HDP, Apache Ambari, vanilla Hadoop) that exposes a standard Hive Metastore and HDFS filesystem.
Installation
pip install lakekeeper
For development:
git clone https://github.com/ab2dridi/Lakekeeper.git
cd BeeKeeper
pip install ".[dev]"
End-to-end usage
Scenario 1 — Local cluster (no Kerberos)
Suitable for development environments or clusters without Kerberos authentication.
# 1. Install
pip install lakekeeper
# 2. Analyze — see which tables need compaction (no writes, safe to run anytime)
lakekeeper analyze --database mydb
# 3. Compact a specific table
lakekeeper compact --table mydb.events
# 4. If something went wrong, rollback to the original state
lakekeeper rollback --table mydb.events
# 5. Once you're confident, remove the backup to free up disk space
lakekeeper cleanup --table mydb.events
Scenario 2 — On-premises Kerberized cluster (YAML config)
On a Kerberized cluster, configure spark_submit in a YAML file. The
lakekeeper CLI automatically builds and executes the spark-submit command —
no need to write it manually.
Step 1 — Create the Python environment to ship to the cluster
conda create -n lakekeeper_env python=3.9 -y
conda activate lakekeeper_env
pip install lakekeeper
conda-pack -o lakekeeper_env.tar.gz
Step 2 — Write a config file
# lakekeeper.yaml
block_size_mb: 128
compaction_ratio_threshold: 10.0
log_level: INFO
spark_submit:
enabled: true
master: yarn
deploy_mode: client
principal: myuser@MY.REALM.COM
keytab: /etc/security/keytabs/myuser.keytab
queue: data-engineering
archives: /opt/lakekeeper_env.tar.gz#lakekeeper_env
python_env: ./lakekeeper_env/bin/python
executor_memory: 4g
num_executors: 10
executor_cores: 2
driver_memory: 2g
script_path: /opt/lakekeeper/run_lakekeeper.py
extra_conf:
spark.yarn.kerberos.relogin.period: 1h
Step 3 — Run
# Analyze (dry-run, no writes)
lakekeeper --config-file lakekeeper.yaml analyze --database mydb
# Compact a single table
lakekeeper --config-file lakekeeper.yaml compact --table mydb.events
# Compact multiple tables
lakekeeper --config-file lakekeeper.yaml compact --tables mydb.events,mydb.users
# Compact an entire database
lakekeeper --config-file lakekeeper.yaml compact --database mydb
# Rollback if needed
lakekeeper --config-file lakekeeper.yaml rollback --table mydb.events
# Cleanup backups older than 7 days
lakekeeper --config-file lakekeeper.yaml cleanup --database mydb --older-than 7d
Under the hood, Lakekeeper builds and executes:
spark-submit --master yarn --deploy-mode client \
--principal myuser@MY.REALM.COM \
--keytab /etc/security/keytabs/myuser.keytab \
--conf spark.yarn.queue=data-engineering \
--archives /opt/lakekeeper_env.tar.gz#lakekeeper_env \
--conf spark.pyspark.python=./lakekeeper_env/bin/python \
--executor-memory 4g --num-executors 10 \
/opt/lakekeeper/run_lakekeeper.py compact --table mydb.events
Scenario 3 — spark-submit manually
For one-off runs or when Lakekeeper is not installed on the edge node.
spark-submit \
--master yarn \
--deploy-mode client \
--principal myuser@MY.REALM.COM \
--keytab /etc/security/keytabs/myuser.keytab \
--conf spark.yarn.queue=my-queue \
--archives lakekeeper_env.tar.gz#lakekeeper_env \
--conf spark.pyspark.python=./lakekeeper_env/bin/python \
run_lakekeeper.py compact --database mydb --block-size 128
CLI reference
lakekeeper [OPTIONS] COMMAND [ARGS]...
Options:
-c, --config-file PATH YAML configuration file.
--version Show version and exit.
--help Show help and exit.
Commands:
analyze Analyze tables and report compaction needs (dry-run, no writes).
compact Compact Hive external tables.
rollback Rollback a table to its pre-compaction state.
cleanup Remove backup tables and reclaim HDFS space.
--config-fileplacement: Pass it before the subcommand name:lakekeeper --config-file lakekeeper.yaml compact --table mydb.eventsIt can also be placed after the subcommand for backward compatibility.
analyze
lakekeeper analyze --database mydb
lakekeeper analyze --table mydb.events
lakekeeper analyze --tables mydb.events,mydb.users
lakekeeper analyze --table mydb.events --block-size 256 --ratio-threshold 5
compact
lakekeeper compact --database mydb
lakekeeper compact --table mydb.events
lakekeeper compact --tables mydb.events,mydb.users
lakekeeper compact --database mydb --block-size 256 --ratio-threshold 5
lakekeeper compact --database mydb --dry-run # analyze only, no writes
rollback
lakekeeper rollback --table mydb.events
cleanup
lakekeeper cleanup --table mydb.events # remove all backups for a table
lakekeeper cleanup --database mydb --older-than 7d # remove backups older than 7 days
Configuration reference
Lakekeeper parameters
| Parameter | Default | CLI flag | Description |
|---|---|---|---|
block_size_mb |
128 |
--block-size |
Target HDFS block size in MB |
compaction_ratio_threshold |
10.0 |
--ratio-threshold |
Compact if avg file size < block_size / ratio |
backup_prefix |
__bkp |
— | Prefix for backup table names |
dry_run |
false |
--dry-run |
Analyze only, no writes |
log_level |
INFO |
--log-level |
DEBUG, INFO, WARNING, ERROR |
spark_submit parameters
| Parameter | Default | Description |
|---|---|---|
enabled |
false |
Enable automatic spark-submit launch |
master |
yarn |
Spark master URL |
deploy_mode |
client |
client or cluster |
principal |
— | Kerberos principal (e.g. user@REALM.COM) |
keytab |
— | Path to the Kerberos keytab file |
queue |
— | YARN queue name (spark.yarn.queue) |
archives |
— | --archives for the conda-packed Python env |
python_env |
— | Python path inside the archive (spark.pyspark.python) |
executor_memory |
— | --executor-memory (e.g. 4g) |
num_executors |
— | --num-executors |
executor_cores |
— | --executor-cores |
driver_memory |
— | --driver-memory |
script_path |
run_lakekeeper.py |
Path to the entry-point script passed to spark-submit |
extra_conf |
{} |
Additional --conf key=value pairs |
extra_files |
[] |
Files distributed to executors via --files (e.g. hive-site.xml) |
How it works
Compaction strategy — HDFS rename swap
Lakekeeper uses HDFS directory renames rather than ALTER TABLE SET LOCATION
to swap data. The table's Metastore location never changes — only the contents
of the HDFS directory are replaced in place. Lineage and cataloging properties
(Apache Atlas and compatible systems) are fully preserved.
Non-partitioned table
Given a table mydb.events at hdfs:///warehouse/mydb/events/:
Step 1 — Backup
Metastore: mydb.__bkp_events_20240301_020000 → hdfs:///warehouse/mydb/events/
(external.table.purge=false)
HDFS: events/ (original files, untouched)
Step 2 — Write compacted data to a temp sibling directory
HDFS: events/ ← original, still live
events__compact_tmp_1709257200/ ← Spark writes here
Step 3 — Verify row count
Counts differ → delete events__compact_tmp_1709257200/ and abort.
Original data at events/ is never touched.
Step 4 — Atomic HDFS rename swap
rename events/ → events__old_1709257200/
rename events__compact_tmp_1709257200/ → events/
Final state:
events/ ← compacted files (table still points here)
events__old_1709257200/ ← original files (kept for rollback)
__bkp_events_20240301_020000 ← backup table in Metastore
Partitioned table
The same rename swap is applied partition by partition, only for partitions that exceed the compaction threshold:
Before:
events/year=2024/month=01/ 10 000 files, 1 GB ← needs compaction
events/year=2024/month=02/ 3 files, 300 MB ← skipped
After:
events/year=2024/month=01/ ← 8 compacted files
events/year=2024/month=01__old_TS/ ← original (kept for rollback)
events/year=2024/month=02/ ← untouched
Readers of already-compacted partitions see the new files immediately while readers of not-yet-processed partitions still see the original data. All reads remain consistent throughout the operation.
Rollback
lakekeeper rollback --table mydb.events
- Finds the most recent backup table (
__bkp_events_*) - Reads its Metastore location →
events__old_TS/(the original data) - Deletes
events/(the compacted data) - Renames
events__old_TS/back toevents/ - Drops the backup table
The table is restored to exactly its pre-compaction state.
Cleanup
lakekeeper cleanup --table mydb.events
- Finds all
__bkp_events_*backup tables - For each: deletes the
__old_*HDFS directory it points to, then drops the backup table
Cleanup is irreversible. Once run, rollback is no longer possible for the cleaned backups.
Important considerations
⚠ Run during a maintenance window
Lakekeeper reads the table twice (once to count rows, once to write). Any rows written by an active pipeline between those two reads will not appear in the compacted output and will be lost after the rename swap.
Always run Lakekeeper while source pipelines are stopped, or schedule it in a maintenance window.
⚠ 2× disk space required
During compaction, both the original and compacted data exist on HDFS simultaneously:
events/— original files (until the rename swap)events__compact_tmp_TS/— compacted files being written
Ensure the HDFS parent directory quota allows at least 2× the table size before starting.
⚠ Do not delete __old_* directories manually
After a successful compaction, events__old_TS/ is the rollback safety net.
Deleting it manually makes rollback impossible. Use lakekeeper cleanup instead.
⚠ Do not drop backup tables manually
Backup tables are created with TBLPROPERTIES ('external.table.purge'='false')
to prevent the Hive Metastore setting external.table.purge=true from deleting
the underlying HDFS data on DROP TABLE. Dropping a backup table manually
removes the Metastore pointer to events__old_TS/ and prevents rollback.
Cloudera CDP note: CDP clusters commonly set
external.table.purge=trueglobally. Thepurge=falseproperty on backup tables overrides this default.
⚠ Leftover staging directories block the next run
If a previous compaction crashed, it may have left a events__compact_tmp_TS/
or events__old_TS/ directory behind. Lakekeeper refuses to start if either
path already exists. Resolve manually before retrying:
- Inspect the leftover directory contents.
- If it contains valid compacted data, check whether the rename swap completed and restore accordingly.
- If it is stale or incomplete, delete it:
hdfs dfs -rm -r <path>.
Development
git clone https://github.com/ab2dridi/Lakekeeper.git
cd BeeKeeper
pip install ".[dev]"
# Lint
ruff check src/ tests/
ruff format --check src/ tests/
# Tests with coverage
pytest tests/ -v --cov=lakekeeper --cov-report=term-missing
License
MIT — see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lakekeeper-0.0.3.tar.gz.
File metadata
- Download URL: lakekeeper-0.0.3.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d2c3420379160620664a22bacdfce6d17ebbb455a723ed4fe12f657adcf8838
|
|
| MD5 |
659c84831589b97e7d1c3ceae54ade04
|
|
| BLAKE2b-256 |
306afd85040e84f34e13c9ed44de2419454edfb103a7ffd4998dee8bf18f5466
|
File details
Details for the file lakekeeper-0.0.3-py3-none-any.whl.
File metadata
- Download URL: lakekeeper-0.0.3-py3-none-any.whl
- Upload date:
- Size: 31.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37779e501a1662782c44fa514296ac05900fca97e0bde7d105402b3b3ad490e5
|
|
| MD5 |
b927a2708fe669a52df1f6766db3f104
|
|
| BLAKE2b-256 |
39683af6105c15bf2f3fdfcc8e087f75f49b93b8f74e9eb8a74b4addec8b6e3c
|