Deploy and benchmark lakehouse stacks on Kubernetes
Project description
Lakebench
A/B testing for lakehouse architectures on Kubernetes.
Deploy a complete lakehouse stack from a single YAML, run a medallion pipeline at any scale, and get a scorecard you can compare across configurations.
Why Lakebench?
- Compare stacks. Swap catalogs (Hive, Polaris), query engines (Trino, Spark Thrift, DuckDB), and table formats -- same data, same queries, different architecture. Side-by-side scorecard comparison.
- Test at scale. Run the same workload at 10 GB, 100 GB, and 1 TB to find where throughput plateaus or resources saturate on your hardware.
- Measure freshness. Continuous mode streams data through the pipeline and benchmarks query performance under sustained ingest load.
Quick Start
pip install lakebench-k8s
Pre-built binaries (no Python required) are available on GitHub Releases.
lakebench init --interactive # generate config with S3 prompts
lakebench validate lakebench.yaml # check config + cluster connectivity
lakebench deploy lakebench.yaml # deploy the stack
lakebench run lakebench.yaml --generate # generate data + run pipeline + benchmark
lakebench report # view HTML scorecard
lakebench destroy lakebench.yaml # tear down everything
The recipe field selects your architecture in one line. The scale field
controls data volume.
# lakebench.yaml (minimal)
deployment_name: my-test
recipe: hive-iceberg-spark-trino # or polaris-iceberg-spark-duckdb, etc.
scale: 10 # 1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB
s3:
endpoint: http://s3.example.com:80
access_key: ...
secret_key: ...
Eight recipes are available -- see Recipes for the full list.
What You Get
After lakebench run completes, the terminal prints a scorecard:
─ Pipeline Complete ──────────────────────────────
bronze-verify 142.0 s
silver-build 891.0 s
gold-finalize 234.0 s
benchmark 87.0 s
Scores
Time to Value: 1354.0 s
Throughput: 0.782 GB/s
Efficiency: 3.41 GB/core-hr
Scale: 100.0% verified
QpH: 2847.3
Full report: lakebench report
──────────────────────────────────────────────────
lakebench report generates an HTML report with per-query latencies,
bottleneck analysis, and optional platform metrics (CPU, memory, S3 I/O per
pod).
How It Works
┌──────────────────────────────────┐
│ lakebench.yaml │
└────────────┬─────────────────────┘
│
┌────────────▼─────────────────────┐
│ deploy (Kubernetes namespace, │
│ S3 secrets, PostgreSQL, catalog, │
│ query engine, observability) │
└────────────┬─────────────────────┘
│
Raw Parquet ──► Bronze (validate) ──► Silver (enrich) ──► Gold (aggregate)
S3 Spark Spark Spark
│
┌───────────▼──────────┐
│ 8-query benchmark │
│ (Trino / DuckDB / │
│ Spark Thrift) │
└──────────────────────┘
Prerequisites
kubectlandhelmon PATH- Kubernetes 1.26+ (minimum 8 CPU / 32 GB RAM for scale 1)
- S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
- Kubeflow Spark Operator 2.4.0+
(or set
spark.operator.install: true) - Stackable Hive Operator for Hive recipes (not needed for Polaris)
Commands
| Command | Description |
|---|---|
init |
Generate a starter config file |
validate |
Check config and cluster connectivity |
info |
Show deployment configuration summary |
deploy |
Deploy all infrastructure components |
generate |
Generate synthetic data at the configured scale |
run |
Execute the medallion pipeline and benchmark |
benchmark |
Run the 8-query benchmark standalone |
query |
Execute ad-hoc SQL against the active engine |
status |
Show deployment status |
report |
Generate HTML scorecard report |
recommend |
Recommend cluster sizing for a scale factor |
destroy |
Tear down all deployed resources |
See CLI Reference for flags and options.
Component Versions
| Component | Version |
|---|---|
| Apache Spark | 3.5.4 |
| Spark Operator | 2.4.0 (Kubeflow) |
| Apache Iceberg | 1.10.1 |
| Hive Metastore | 3.1.3 (Stackable 25.7.0) |
| Apache Polaris | 1.3.0-incubating |
| Trino | 479 |
| DuckDB | bundled (Python 3.11) |
| PostgreSQL | 17 |
All versions are overridable in the YAML config. See Supported Components.
Documentation
- Getting Started -- prerequisites, install, first run
- Configuration -- full YAML reference
- Recipes -- catalog + format + engine combinations
- Running Pipelines -- batch and continuous modes
- Benchmarking -- scorecard and query benchmark
- Architecture -- system design
- Troubleshooting -- common errors and fixes
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lakebench_k8s-1.0.7.tar.gz.
File metadata
- Download URL: lakebench_k8s-1.0.7.tar.gz
- Upload date:
- Size: 311.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db8a6d114f956040953ebca9b735e64c875834c0067e74e127dbf69a188c5ebf
|
|
| MD5 |
57025f95fd136cd83b81d33561908ca7
|
|
| BLAKE2b-256 |
c3841834889aed0a7406e94b0b7e679ba10324fffaf1b948c0326c70f49c546b
|
Provenance
The following attestation bundles were made for lakebench_k8s-1.0.7.tar.gz:
Publisher:
release.yml on PureStorage-OpenConnect/lakebench-k8s
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lakebench_k8s-1.0.7.tar.gz -
Subject digest:
db8a6d114f956040953ebca9b735e64c875834c0067e74e127dbf69a188c5ebf - Sigstore transparency entry: 975642484
- Sigstore integration time:
-
Permalink:
PureStorage-OpenConnect/lakebench-k8s@765bc1c82a7aa77b86eb9bee95384e048e548326 -
Branch / Tag:
refs/tags/v1.0.7 - Owner: https://github.com/PureStorage-OpenConnect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@765bc1c82a7aa77b86eb9bee95384e048e548326 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lakebench_k8s-1.0.7-py3-none-any.whl.
File metadata
- Download URL: lakebench_k8s-1.0.7-py3-none-any.whl
- Upload date:
- Size: 263.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43080bbae7fc01d35c0b76f51569f2ba5faf8efdef7f0f170371fb90374b8816
|
|
| MD5 |
ad6ef50193062befb484391990e8e434
|
|
| BLAKE2b-256 |
f6d41350e002d2eac35ff6ec4a60acde22346ecc81d207b5e56caf8cab28baf0
|
Provenance
The following attestation bundles were made for lakebench_k8s-1.0.7-py3-none-any.whl:
Publisher:
release.yml on PureStorage-OpenConnect/lakebench-k8s
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lakebench_k8s-1.0.7-py3-none-any.whl -
Subject digest:
43080bbae7fc01d35c0b76f51569f2ba5faf8efdef7f0f170371fb90374b8816 - Sigstore transparency entry: 975642485
- Sigstore integration time:
-
Permalink:
PureStorage-OpenConnect/lakebench-k8s@765bc1c82a7aa77b86eb9bee95384e048e548326 -
Branch / Tag:
refs/tags/v1.0.7 - Owner: https://github.com/PureStorage-OpenConnect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@765bc1c82a7aa77b86eb9bee95384e048e548326 -
Trigger Event:
push
-
Statement type: