Skip to main content

Deploy and benchmark lakehouse stacks on Kubernetes

Project description

Lakebench

Python 3.10+ License

A/B testing for lakehouse architectures on Kubernetes.

Deploy a complete lakehouse stack from a single YAML, run a medallion pipeline at any scale, and get a scorecard you can compare across configurations.

Why Lakebench?

  • Compare stacks. Swap catalogs (Hive, Polaris), query engines (Trino, Spark Thrift, DuckDB), and table formats -- same data, same queries, different architecture. Side-by-side scorecard comparison.
  • Test at scale. Run the same workload at 10 GB, 100 GB, and 1 TB to find where throughput plateaus or resources saturate on your hardware.
  • Measure freshness. Sustained mode streams data through the pipeline and benchmarks query performance under sustained ingest load.

Quick Start

pip install lakebench-k8s

Pre-built binaries (no Python required) are available on GitHub Releases.

lakebench init --interactive             # generate config with S3 prompts
lakebench validate lakebench.yaml        # check config + cluster connectivity
lakebench deploy lakebench.yaml          # deploy the stack
lakebench run lakebench.yaml --generate  # generate data + run pipeline + benchmark
lakebench report                         # view HTML scorecard
lakebench destroy lakebench.yaml         # tear down everything

The recipe field selects your architecture in one line. The scale field controls data volume.

# lakebench.yaml (minimal)
deployment_name: my-test
recipe: hive-iceberg-spark-trino   # or polaris-iceberg-spark-duckdb, etc.
scale: 10                          # 1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB
s3:
  endpoint: http://s3.example.com:80
  access_key: ...
  secret_key: ...

Eleven recipes are available -- see Recipes for the full list. v1.2 adds Delta Lake support via three new Hive+Delta recipes.

For all recipes, see examples/ or run lakebench init --interactive.

What You Get

After lakebench run completes, the terminal prints a scorecard:

 ─ Pipeline Complete ──────────────────────────────
  bronze-verify         142.0 s
  silver-build          891.0 s
  gold-finalize         234.0 s
  benchmark              87.0 s

  Scores
    Time to Value:        1354.0 s
    Throughput:           0.782 GB/s
    Efficiency:           3.41 GB/core-hr
    Scale:                100.0% verified
    QpH:                  2847.3

  Full report: lakebench report
 ──────────────────────────────────────────────────

lakebench report generates an HTML report with per-query latencies, bottleneck analysis, and optional platform metrics (CPU, memory, S3 I/O per pod).

How It Works

                    ┌──────────────────────────────────┐
                    │         lakebench.yaml           │
                    └────────────┬─────────────────────┘
                                 │
                    ┌────────────▼─────────────────────┐
                    │   deploy (Kubernetes namespace,   │
                    │   S3 secrets, PostgreSQL, catalog, │
                    │   query engine, observability)     │
                    └────────────┬─────────────────────┘
                                 │
     Raw Parquet ──► Bronze (validate) ──► Silver (enrich) ──► Gold (aggregate)
         S3              Spark                Spark               Spark
                                                                    │
                                                        ┌───────────▼──────────┐
                                                        │  8-query benchmark   │
                                                        │  (Trino / DuckDB /   │
                                                        │   Spark Thrift)      │
                                                        └──────────────────────┘

Prerequisites

  • kubectl and helm on PATH
  • Kubernetes 1.26+ (minimum 8 CPU / 32 GB RAM for scale 1)
  • S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
  • Kubeflow Spark Operator 2.4.0+ (or set spark.operator.install: true)
  • Stackable Hive Operator for Hive recipes (not needed for Polaris)

Commands

Command Description
init Generate a starter config file
validate Check config and cluster connectivity
info Show deployment configuration summary
deploy Deploy all infrastructure components
generate Generate synthetic data at the configured scale
run Execute the medallion pipeline and benchmark
benchmark Run the 8-query benchmark standalone
query Execute ad-hoc SQL against the active engine
status Show deployment status
report Generate HTML scorecard report
recommend Recommend cluster sizing for a scale factor
destroy Tear down all deployed resources

See CLI Reference for flags and options.

Component Versions

Component Version
Apache Spark 3.5.4, 4.0.2
Spark Operator 2.4.0 (Kubeflow)
Apache Iceberg 1.10.1
Delta Lake 4.0.0
Hive Metastore 3.1.3 (Stackable 25.7.0)
Apache Polaris 1.3.0-incubating
Trino 479
DuckDB bundled (Python 3.11)
PostgreSQL 16, 17, 18

All versions are overridable in the YAML config. See Supported Components.

Documentation

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakebench_k8s-1.2.0.tar.gz (368.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakebench_k8s-1.2.0-py3-none-any.whl (315.9 kB view details)

Uploaded Python 3

File details

Details for the file lakebench_k8s-1.2.0.tar.gz.

File metadata

  • Download URL: lakebench_k8s-1.2.0.tar.gz
  • Upload date:
  • Size: 368.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.2.0.tar.gz
Algorithm Hash digest
SHA256 0143ce0b30caf5fa6fab1c8b491aecf7d3a5c61de9b4c19dbb077046349c3555
MD5 562f9c5d33421973fc29f917a28bf47c
BLAKE2b-256 ad594685c277102ae07b1b4df0c2fe4cf29780991d9c770e8445594331dfd64e

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.2.0.tar.gz:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lakebench_k8s-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: lakebench_k8s-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 315.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 033a863a61618933a08f9a88751f94c7daac85400987b9d2e85976f25fcdbca2
MD5 d67ebb214223286008535e48dbea748b
BLAKE2b-256 ae5590ff8aaad05e44be8b5a4b1da311c0e92e806637577b9506ac1a03b4a838

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.2.0-py3-none-any.whl:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page