Deploy and benchmark lakehouse stacks on Kubernetes

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sillidata

These details have not been verified by PyPI

Project description

Lakebench

A/B testing for lakehouse architectures on Kubernetes.

Deploy a complete lakehouse stack from a single YAML, run a medallion pipeline at any scale, and get a scorecard you can compare across configurations.

Why Lakebench?

Compare stacks. Swap catalogs (Hive, Polaris), query engines (Trino, Spark Thrift, DuckDB), and table formats -- same data, same queries, different architecture. Side-by-side scorecard comparison.
Test at scale. Run the same workload at 10 GB, 100 GB, and 1 TB to find where throughput plateaus or resources saturate on your hardware.
Measure freshness. Sustained mode streams data through the pipeline and benchmarks query performance under sustained ingest load.

Quick Start

pip install lakebench-k8s

Pre-built binaries (no Python required) are available on GitHub Releases.

lakebench init --interactive             # generate config with S3 prompts
lakebench validate lakebench.yaml        # check config + cluster connectivity
lakebench deploy lakebench.yaml          # deploy the stack
lakebench run lakebench.yaml --generate  # generate data + run pipeline + benchmark
lakebench report                         # view HTML scorecard
lakebench destroy lakebench.yaml         # tear down everything

The recipe field selects your architecture in one line. The scale field controls data volume.

# lakebench.yaml (minimal)
deployment_name: my-test
recipe: hive-iceberg-spark-trino   # or polaris-iceberg-spark-duckdb, etc.
scale: 10                          # 1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB
s3:
  endpoint: http://s3.example.com:80
  access_key: ...
  secret_key: ...

Eleven recipes are available -- see Recipes for the full list. v1.2 adds Delta Lake support via three new Hive+Delta recipes.

For all recipes, see examples/ or run lakebench init --interactive.

What You Get

After lakebench run completes, the terminal prints a scorecard:

 ─ Pipeline Complete ──────────────────────────────
  bronze-verify         142.0 s
  silver-build          891.0 s
  gold-finalize         234.0 s
  benchmark              87.0 s

  Scores
    Time to Value:        1354.0 s
    Throughput:           0.782 GB/s
    Efficiency:           3.41 GB/core-hr
    Scale:                100.0% verified
    QpH:                  2847.3

  Full report: lakebench report
 ──────────────────────────────────────────────────

lakebench report generates an HTML report with per-query latencies, bottleneck analysis, and optional platform metrics (CPU, memory, S3 I/O per pod).

How It Works

                    ┌──────────────────────────────────┐
                    │         lakebench.yaml           │
                    └────────────┬─────────────────────┘
                                 │
                    ┌────────────▼─────────────────────┐
                    │   deploy (Kubernetes namespace,   │
                    │   S3 secrets, PostgreSQL, catalog, │
                    │   query engine, observability)     │
                    └────────────┬─────────────────────┘
                                 │
     Raw Parquet ──► Bronze (validate) ──► Silver (enrich) ──► Gold (aggregate)
         S3              Spark                Spark               Spark
                                                                    │
                                                        ┌───────────▼──────────┐
                                                        │  8-query benchmark   │
                                                        │  (Trino / DuckDB /   │
                                                        │   Spark Thrift)      │
                                                        └──────────────────────┘

Prerequisites

kubectl and helm on PATH
Kubernetes 1.26+ (minimum 8 CPU / 32 GB RAM for scale 1)
S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
Kubeflow Spark Operator 2.4.0+ (or set spark.operator.install: true)
Stackable Hive Operator for Hive recipes (not needed for Polaris)

Commands

Command	Description
`init`	Generate a starter config file
`validate`	Check config and cluster connectivity
`info`	Show deployment configuration summary
`deploy`	Deploy all infrastructure components
`generate`	Generate synthetic data at the configured scale
`run`	Execute the medallion pipeline and benchmark
`benchmark`	Run the 8-query benchmark standalone
`query`	Execute ad-hoc SQL against the active engine
`status`	Show deployment status
`report`	Generate HTML scorecard report
`recommend`	Recommend cluster sizing for a scale factor
`destroy`	Tear down all deployed resources

See CLI Reference for flags and options.

Component Versions

Component	Version
Apache Spark	3.5.4, 4.0.2
Spark Operator	2.4.0 (Kubeflow)
Apache Iceberg	1.10.1
Delta Lake	4.0.0
Hive Metastore	3.1.3 (Stackable 25.7.0)
Apache Polaris	1.3.0-incubating
Trino	479
DuckDB	bundled (Python 3.11)
PostgreSQL	16, 17, 18

All versions are overridable in the YAML config. See Supported Components.

Documentation

Getting Started -- prerequisites, install, first run
Configuration -- full YAML reference
Recipes -- catalog + format + engine combinations
Compatibility Matrix -- Spark, Iceberg, and Delta version support
Running Pipelines -- batch and sustained modes
Benchmarking -- scorecard and query benchmark
Architecture -- system design
Troubleshooting -- common errors and fixes

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sillidata

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.3.1

Apr 5, 2026

1.3.0

Apr 1, 2026

This version

1.2.0

Mar 27, 2026

1.1.0

Mar 9, 2026

1.0.12

Mar 5, 2026

1.0.11

Mar 3, 2026

1.0.10

Mar 3, 2026

1.0.9

Feb 23, 2026

1.0.8

Feb 22, 2026

1.0.7

Feb 21, 2026

1.0.6

Feb 21, 2026

1.0.5

Feb 21, 2026

1.0.3

Feb 18, 2026

1.0.2

Feb 16, 2026

1.0.1

Feb 13, 2026

1.0.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakebench_k8s-1.2.0.tar.gz (368.6 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakebench_k8s-1.2.0-py3-none-any.whl (315.9 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file lakebench_k8s-1.2.0.tar.gz.

File metadata

Download URL: lakebench_k8s-1.2.0.tar.gz
Upload date: Mar 27, 2026
Size: 368.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0143ce0b30caf5fa6fab1c8b491aecf7d3a5c61de9b4c19dbb077046349c3555`
MD5	`562f9c5d33421973fc29f917a28bf47c`
BLAKE2b-256	`ad594685c277102ae07b1b4df0c2fe4cf29780991d9c770e8445594331dfd64e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.2.0.tar.gz:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lakebench_k8s-1.2.0.tar.gz
- Subject digest: 0143ce0b30caf5fa6fab1c8b491aecf7d3a5c61de9b4c19dbb077046349c3555
- Sigstore transparency entry: 1189447136
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: PureStorage-OpenConnect/lakebench-k8s@a87d9b5efa31b50c15bcf2541ab72e1f26886f3b
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/PureStorage-OpenConnect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a87d9b5efa31b50c15bcf2541ab72e1f26886f3b
- Trigger Event: push

File details

Details for the file lakebench_k8s-1.2.0-py3-none-any.whl.

File metadata

Download URL: lakebench_k8s-1.2.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 315.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakebench_k8s-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`033a863a61618933a08f9a88751f94c7daac85400987b9d2e85976f25fcdbca2`
MD5	`d67ebb214223286008535e48dbea748b`
BLAKE2b-256	`ae5590ff8aaad05e44be8b5a4b1da311c0e92e806637577b9506ac1a03b4a838`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakebench_k8s-1.2.0-py3-none-any.whl:

Publisher: release.yml on PureStorage-OpenConnect/lakebench-k8s

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lakebench_k8s-1.2.0-py3-none-any.whl
- Subject digest: 033a863a61618933a08f9a88751f94c7daac85400987b9d2e85976f25fcdbca2
- Sigstore transparency entry: 1189447141
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: PureStorage-OpenConnect/lakebench-k8s@a87d9b5efa31b50c15bcf2541ab72e1f26886f3b
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/PureStorage-OpenConnect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a87d9b5efa31b50c15bcf2541ab72e1f26886f3b
- Trigger Event: push

lakebench-k8s 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Lakebench

Why Lakebench?

Quick Start

What You Get

How It Works

Prerequisites

Commands

Component Versions

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance