Deploy and benchmark lakehouse stacks on Kubernetes

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sillidata

These details have not been verified by PyPI

Project description

Lakebench

CLI tool for deploying and benchmarking lakehouse architectures on Kubernetes.

Note: This package is published as lakebench-k8s on PyPI. Install with pip install lakebench-k8s. The CLI command is lakebench.

Choosing between Hive and Polaris, sizing Spark for 100 GB vs 10 TB, or comparing batch and continuous pipelines shouldn't require weeks of manual setup. Lakebench deploys a complete lakehouse stack from a single YAML file, generates realistic data at any scale, runs the pipeline, benchmarks query performance, and tears everything down --so you can focus on comparing architectures, not plumbing.

Installation

pip install lakebench-k8s

Or with pipx: pipx install lakebench-k8s

Pre-built binaries (no Python required) are available on GitHub Releases.

Prerequisites

Python 3.10+
kubectl and helm on PATH
Kubernetes cluster (1.26+, minimum 8 CPU / 32 GB RAM for scale 1)
S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
Kubeflow Spark Operator 2.4.0+ (or set spark.operator.install: true in config to auto-install)
Stackable Hive Operator if using a Hive recipe (the default). Not needed for Polaris recipes.

See Getting Started for detailed install instructions.

Quick Start

A quick-recipe selects the catalog + table format + query engine combination in one line (e.g. recipe: hive-iceberg-spark-trino). The scale factor controls data volume: 1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB. Every recipe default can be overridden individually -- see the Configuration Reference for advanced configuration options.

# 1. Generate config (interactive prompts for S3 details)
lakebench init --interactive

# 2. Validate config and cluster connectivity
lakebench validate lakebench.yaml

# 3. Deploy infrastructure
lakebench deploy lakebench.yaml

# 4. Generate test data
lakebench generate lakebench.yaml --wait

# 5. Run the pipeline + benchmark
lakebench run lakebench.yaml

# 6. View results
lakebench report

# 7. Tear down
lakebench destroy lakebench.yaml

Deploy and generate will prompt for confirmation. Add --yes to skip (e.g. lakebench deploy lakebench.yaml --yes).

Commands

Command	Description
`lakebench init`	Generate a starter configuration file
`lakebench validate <config>`	Validate config and test connectivity
`lakebench info <config>`	Show configuration summary
`lakebench recommend`	Recommend cluster sizing for a scale factor
`lakebench deploy <config>`	Deploy all infrastructure
`lakebench generate <config>`	Generate synthetic data to bronze bucket
`lakebench run <config>`	Execute the medallion pipeline with metrics
`lakebench benchmark <config>`	Run 8-query benchmark against the active engine
`lakebench query <config>`	Execute SQL queries against the active engine
`lakebench status [config]`	Show deployment status
`lakebench logs <component> [config]`	Stream logs from a component
`lakebench report`	Generate HTML benchmark report
`lakebench destroy <config>`	Tear down all resources

How It Works

Lakebench deploys a three-layer stack on Kubernetes:

Platform -- Kubernetes namespace, S3 secrets, PostgreSQL (metadata store)
Data architecture -- catalog (Hive or Polaris), table format (Iceberg), query engine (Trino, Spark Thrift, or DuckDB), all wired together via recipes
Observability -- optional Prometheus + Grafana stack for platform metrics

Once deployed, the pipeline runs three Spark jobs in sequence:

Raw Parquet (S3)  -->  Bronze (validate, deduplicate)
                  -->  Silver (normalize, enrich -- Iceberg table)
                  -->  Gold (aggregate -- Iceberg table)
                  -->  Benchmark (8 analytical queries via query engine)

The benchmark produces an HTML report with query latencies, throughput scores, and optional platform metrics (CPU, memory, S3 I/O per pod). See the Architecture doc for the full picture.

Component Versions

Component	Version
Apache Spark	3.5.4
Spark Operator	2.4.0 (Kubeflow)
Apache Iceberg	1.10.1
Hive Metastore	3.1.3 (Stackable 25.7.0)
Apache Polaris	1.3.0-incubating
Trino	479
DuckDB	bundled (Python 3.11)
PostgreSQL	17

All versions are configurable. See Supported Components for the full matrix of components, recipes, and override options.

Documentation

Full documentation is in the docs/ directory:

Getting Started -- prerequisites, install, first deployment
Configuration -- full YAML reference
CLI Reference -- all commands and flags
Recipes -- supported component combinations
Supported Components -- versions, images, and recipe matrix
Deployment -- deploy lifecycle and status checks
Data Generation -- scale factors, parallelism, and monitoring
Running Pipelines -- batch and streaming modes
Scoring and Benchmarking -- pipeline scorecard and query engine benchmark
Polaris Quick Start -- use Apache Polaris instead of Hive
Architecture -- system design and component layers
Troubleshooting -- common errors and fixes

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sillidata

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.3.1

Apr 5, 2026

1.3.0

Apr 1, 2026

1.2.0

Mar 27, 2026

1.1.0

Mar 9, 2026

1.0.12

Mar 5, 2026

1.0.11

Mar 3, 2026

1.0.10

Mar 3, 2026

1.0.9

Feb 23, 2026

1.0.8

Feb 22, 2026

1.0.7

Feb 21, 2026

1.0.6

Feb 21, 2026

This version

1.0.5

Feb 21, 2026

1.0.3

Feb 18, 2026

1.0.2

Feb 16, 2026

1.0.1

Feb 13, 2026

1.0.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakebench_k8s-1.0.5.tar.gz (304.4 kB view details)

Uploaded Feb 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakebench_k8s-1.0.5-py3-none-any.whl (255.8 kB view details)

Uploaded Feb 21, 2026 Python 3

File details

Details for the file lakebench_k8s-1.0.5.tar.gz.

File metadata

Download URL: lakebench_k8s-1.0.5.tar.gz
Upload date: Feb 21, 2026
Size: 304.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`9047f440155804d7c0f7ccb5e57c292fb26d89588868c1c8550f5d91ca067779`
MD5	`389293e2a2fd6f4de0cac207e182cc13`
BLAKE2b-256	`2ceaa1f15f5415b2856c4c1b13b491959341c495a56791885e883f9ef1045e77`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.0.5.tar.gz:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lakebench_k8s-1.0.5.tar.gz
- Subject digest: 9047f440155804d7c0f7ccb5e57c292fb26d89588868c1c8550f5d91ca067779
- Sigstore transparency entry: 975610108
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: PureStorage-OpenConnect/lakebench-k8s@a892027d14e93a42211c2e704b9b6acdf02df1f7
- Branch / Tag: refs/tags/v1.0.4
- Owner: https://github.com/PureStorage-OpenConnect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a892027d14e93a42211c2e704b9b6acdf02df1f7
- Trigger Event: push

File details

Details for the file lakebench_k8s-1.0.5-py3-none-any.whl.

File metadata

Download URL: lakebench_k8s-1.0.5-py3-none-any.whl
Upload date: Feb 21, 2026
Size: 255.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`922ad40da6e3d08fd7d4ca094fd4a41d2ddc4a93dfae113b261e5e69ee194d33`
MD5	`b3dce0f6fd246818a31ad8ee89056fb4`
BLAKE2b-256	`f9f64464e7a84d1b15db8dfb5517cb795fbecaf26b9b7f71fb31a426ab59717a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.0.5-py3-none-any.whl:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lakebench_k8s-1.0.5-py3-none-any.whl
- Subject digest: 922ad40da6e3d08fd7d4ca094fd4a41d2ddc4a93dfae113b261e5e69ee194d33
- Sigstore transparency entry: 975610111
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: PureStorage-OpenConnect/lakebench-k8s@a892027d14e93a42211c2e704b9b6acdf02df1f7
- Branch / Tag: refs/tags/v1.0.4
- Owner: https://github.com/PureStorage-OpenConnect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a892027d14e93a42211c2e704b9b6acdf02df1f7
- Trigger Event: push

lakebench-k8s 1.0.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Lakebench

Installation

Prerequisites

Quick Start

Commands

How It Works

Component Versions

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance