Skip to main content

Deploy and benchmark lakehouse stacks on Kubernetes

Project description

Lakebench

Python 3.10+ License

CLI tool for deploying and benchmarking lakehouse architectures on Kubernetes.

Note: This package is published as lakebench-k8s on PyPI. Install with pip install lakebench-k8s. The CLI command is lakebench.

Choosing between Hive and Polaris, sizing Spark for 100 GB vs 10 TB, or comparing batch and continuous pipelines shouldn't require weeks of manual setup. Lakebench deploys a complete lakehouse stack from a single YAML file, generates realistic data at any scale, runs the pipeline, benchmarks query performance, and tears everything down --so you can focus on comparing architectures, not plumbing.

Installation

pip install lakebench-k8s

Or with pipx: pipx install lakebench-k8s

Pre-built binaries (no Python required) are available on GitHub Releases.

Prerequisites

  • Python 3.10+
  • kubectl and helm on PATH
  • Kubernetes cluster (1.26+, minimum 8 CPU / 32 GB RAM for scale 1)
  • S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
  • Kubeflow Spark Operator 2.4.0+ (or set spark.operator.install: true in config to auto-install)
  • Stackable Hive Operator if using a Hive recipe (the default). Not needed for Polaris recipes.

See Getting Started for detailed install instructions.

Quick Start

A quick-recipe selects the catalog + table format + query engine combination in one line (e.g. recipe: hive-iceberg-spark-trino). The scale factor controls data volume: 1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB. Every recipe default can be overridden individually -- see the Configuration Reference for advanced configuration options.

# 1. Generate config (interactive prompts for S3 details)
lakebench init --interactive

# 2. Validate config and cluster connectivity
lakebench validate lakebench.yaml

# 3. Deploy infrastructure
lakebench deploy lakebench.yaml

# 4. Generate test data
lakebench generate lakebench.yaml --wait

# 5. Run the pipeline + benchmark
lakebench run lakebench.yaml

# 6. View results
lakebench report

# 7. Tear down
lakebench destroy lakebench.yaml

Deploy and generate will prompt for confirmation. Add --yes to skip (e.g. lakebench deploy lakebench.yaml --yes).

Commands

Command Description
lakebench init Generate a starter configuration file
lakebench validate <config> Validate config and test connectivity
lakebench info <config> Show configuration summary
lakebench recommend Recommend cluster sizing for a scale factor
lakebench deploy <config> Deploy all infrastructure
lakebench generate <config> Generate synthetic data to bronze bucket
lakebench run <config> Execute the medallion pipeline with metrics
lakebench benchmark <config> Run 8-query benchmark against the active engine
lakebench query <config> Execute SQL queries against the active engine
lakebench status [config] Show deployment status
lakebench logs <component> [config] Stream logs from a component
lakebench report Generate HTML benchmark report
lakebench destroy <config> Tear down all resources

How It Works

Lakebench deploys a three-layer stack on Kubernetes:

  1. Platform -- Kubernetes namespace, S3 secrets, PostgreSQL (metadata store)
  2. Data architecture -- catalog (Hive or Polaris), table format (Iceberg), query engine (Trino, Spark Thrift, or DuckDB), all wired together via recipes
  3. Observability -- optional Prometheus + Grafana stack for platform metrics

Once deployed, the pipeline runs three Spark jobs in sequence:

Raw Parquet (S3)  -->  Bronze (validate, deduplicate)
                  -->  Silver (normalize, enrich -- Iceberg table)
                  -->  Gold (aggregate -- Iceberg table)
                  -->  Benchmark (8 analytical queries via query engine)

The benchmark produces an HTML report with query latencies, throughput scores, and optional platform metrics (CPU, memory, S3 I/O per pod). See the Architecture doc for the full picture.

Component Versions

Component Version
Apache Spark 3.5.4
Spark Operator 2.4.0 (Kubeflow)
Apache Iceberg 1.10.1
Hive Metastore 3.1.3 (Stackable 25.7.0)
Apache Polaris 1.3.0-incubating
Trino 479
DuckDB bundled (Python 3.11)
PostgreSQL 17

All versions are configurable. See Supported Components for the full matrix of components, recipes, and override options.

Documentation

Full documentation is in the docs/ directory:

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakebench_k8s-1.0.5.tar.gz (304.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakebench_k8s-1.0.5-py3-none-any.whl (255.8 kB view details)

Uploaded Python 3

File details

Details for the file lakebench_k8s-1.0.5.tar.gz.

File metadata

  • Download URL: lakebench_k8s-1.0.5.tar.gz
  • Upload date:
  • Size: 304.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.0.5.tar.gz
Algorithm Hash digest
SHA256 9047f440155804d7c0f7ccb5e57c292fb26d89588868c1c8550f5d91ca067779
MD5 389293e2a2fd6f4de0cac207e182cc13
BLAKE2b-256 2ceaa1f15f5415b2856c4c1b13b491959341c495a56791885e883f9ef1045e77

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.0.5.tar.gz:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lakebench_k8s-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: lakebench_k8s-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 255.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 922ad40da6e3d08fd7d4ca094fd4a41d2ddc4a93dfae113b261e5e69ee194d33
MD5 b3dce0f6fd246818a31ad8ee89056fb4
BLAKE2b-256 f9f64464e7a84d1b15db8dfb5517cb795fbecaf26b9b7f71fb31a426ab59717a

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.0.5-py3-none-any.whl:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page