Deploy and benchmark lakehouse stacks on Kubernetes
Project description
Lakebench
CLI tool for deploying and benchmarking lakehouse architectures on Kubernetes.
Note: This package is published as
lakebench-k8son PyPI. Install withpip install lakebench-k8s. The CLI command islakebench.
Choosing between Hive and Polaris, Iceberg and Delta, or sizing Spark for 100 GB vs 10 TB shouldn't require weeks of manual setup. Lakebench deploys a complete lakehouse stack from a single YAML file, generates realistic data at any scale, runs the pipeline, benchmarks query performance, and tears everything down --so you can focus on comparing architectures, not plumbing.
Installation
pip install lakebench-k8s
Or with pipx: pipx install lakebench-k8s
Pre-built binaries (no Python required) are available on GitHub Releases.
Prerequisites
- Python 3.10+
kubectlandhelmon PATH- Kubernetes cluster (1.26+, minimum 8 CPU / 32 GB RAM for scale 1)
- S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
- Kubeflow Spark Operator 2.4.0+ (or set
spark.operator.install: truein config to auto-install) - Stackable Hive Operator if using a Hive recipe (the default). Not needed for Polaris recipes.
See Getting Started for detailed install instructions.
Quick Start
A recipe selects the catalog + table format + query engine combination
(e.g. hive-iceberg-trino). The scale factor controls data volume:
1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB.
# 1. Generate config (interactive prompts for S3 details)
lakebench init --interactive
# 2. Validate config and cluster connectivity
lakebench validate lakebench.yaml
# 3. Deploy infrastructure
lakebench deploy lakebench.yaml
# 4. Generate test data
lakebench generate lakebench.yaml --wait
# 5. Run the pipeline + benchmark
lakebench run lakebench.yaml
# 6. View results
lakebench report
# 7. Tear down
lakebench destroy lakebench.yaml
Deploy and generate will prompt for confirmation. Add
--yesto skip (e.g.lakebench deploy lakebench.yaml --yes).
Commands
| Command | Description |
|---|---|
lakebench init |
Generate a starter configuration file |
lakebench validate <config> |
Validate config and test connectivity |
lakebench info <config> |
Show configuration summary |
lakebench recommend |
Recommend cluster sizing for a scale factor |
lakebench deploy <config> |
Deploy all infrastructure |
lakebench generate <config> |
Generate synthetic data to bronze bucket |
lakebench run <config> |
Execute the medallion pipeline with metrics |
lakebench benchmark <config> |
Run 8-query benchmark against the active engine |
lakebench query <config> |
Execute SQL queries against the active engine |
lakebench status [config] |
Show deployment status |
lakebench logs <component> [config] |
Stream logs from a component |
lakebench report |
Generate HTML benchmark report |
lakebench destroy <config> |
Tear down all resources |
How It Works
Lakebench deploys a three-layer stack on Kubernetes:
- Platform -- Kubernetes namespace, S3 secrets, PostgreSQL (metadata store)
- Data architecture -- catalog (Hive or Polaris), table format (Iceberg or Delta), query engine (Trino, Spark Thrift, or DuckDB), all wired together via recipes
- Observability -- optional Prometheus + Grafana stack for platform metrics
Once deployed, the pipeline runs three Spark jobs in sequence:
Raw Parquet (S3) --> Bronze (validate, deduplicate)
--> Silver (normalize, enrich -- Iceberg table)
--> Gold (aggregate -- Iceberg table)
--> Benchmark (8 analytical queries via query engine)
The benchmark produces an HTML report with query latencies, throughput scores, and optional platform metrics (CPU, memory, S3 I/O per pod). See the Architecture doc for the full picture.
Component Versions
| Component | Version |
|---|---|
| Apache Spark | 3.5.4 |
| Spark Operator | 2.4.0 (Kubeflow) |
| Apache Iceberg | 1.10.1 |
| Delta Lake | 3.0.0 |
| Hive Metastore | 3.1.3 (Stackable 25.7.0) |
| Apache Polaris | 1.3.0-incubating |
| Trino | 479 |
| DuckDB | bundled (Python 3.11) |
| PostgreSQL | 17 |
All versions are configurable. See Supported Components for the full matrix of components, recipes, and override options.
Documentation
Full documentation is in the docs/ directory:
- Getting Started -- prerequisites, install, first deployment
- Configuration -- full YAML reference
- CLI Reference -- all commands and flags
- Recipes -- supported component combinations
- Supported Components -- versions, images, and recipe matrix
- Deployment -- deploy lifecycle and status checks
- Data Generation -- scale factors, parallelism, and monitoring
- Running Pipelines -- batch and streaming modes
- Benchmarking -- query suite and scoring
- Polaris Quick Start -- use Apache Polaris instead of Hive
- Architecture -- system design and component layers
- Troubleshooting -- common errors and fixes
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lakebench_k8s-1.0.3.tar.gz.
File metadata
- Download URL: lakebench_k8s-1.0.3.tar.gz
- Upload date:
- Size: 276.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57b3c852fc28033da5da9276cd600bf1c47a52ae131fc5e1f8d2f2524b498acc
|
|
| MD5 |
40e3a6e7d28892df0174f94a07d92c93
|
|
| BLAKE2b-256 |
5e91cdcca39396c3111183c3bb48c2dccaa2c2dbc174f60bfd70f65406973809
|
Provenance
The following attestation bundles were made for lakebench_k8s-1.0.3.tar.gz:
Publisher:
release.yml on PureStorage-OpenConnect/lakebench-k8s
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lakebench_k8s-1.0.3.tar.gz -
Subject digest:
57b3c852fc28033da5da9276cd600bf1c47a52ae131fc5e1f8d2f2524b498acc - Sigstore transparency entry: 962188757
- Sigstore integration time:
-
Permalink:
PureStorage-OpenConnect/lakebench-k8s@d9d11d3ddb9f6287894df3deeac843a79b1cc0a6 -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/PureStorage-OpenConnect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d9d11d3ddb9f6287894df3deeac843a79b1cc0a6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lakebench_k8s-1.0.3-py3-none-any.whl.
File metadata
- Download URL: lakebench_k8s-1.0.3-py3-none-any.whl
- Upload date:
- Size: 231.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f19f537080680aa14b6e72e3cfa44e217837cb625e03a3d50e9787ec2573fc1d
|
|
| MD5 |
8f78cb36744745818a514f77c6f02b62
|
|
| BLAKE2b-256 |
0282cfc0582558fb91d23e6abc779d9670b114d77809672e6985b7e51a3f7cb1
|
Provenance
The following attestation bundles were made for lakebench_k8s-1.0.3-py3-none-any.whl:
Publisher:
release.yml on PureStorage-OpenConnect/lakebench-k8s
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lakebench_k8s-1.0.3-py3-none-any.whl -
Subject digest:
f19f537080680aa14b6e72e3cfa44e217837cb625e03a3d50e9787ec2573fc1d - Sigstore transparency entry: 962188758
- Sigstore integration time:
-
Permalink:
PureStorage-OpenConnect/lakebench-k8s@d9d11d3ddb9f6287894df3deeac843a79b1cc0a6 -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/PureStorage-OpenConnect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d9d11d3ddb9f6287894df3deeac843a79b1cc0a6 -
Trigger Event:
push
-
Statement type: