A multi-modal Python library for benchmarking Azure lakehouse engines and ELT scenarios, supporting both industry-standard and novel benchmarks.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python
Topic
- System :: Benchmark

Project description

LakeBench

🌊 LakeBench is the first Python-based, multi-modal benchmarking framework designed to evaluate performance across multiple lakehouse compute engines and ELT scenarios. Supporting a variety of engines and both industry-standard and novel benchmarks, LakeBench enables comprehensive, apples-to-apples comparisons in a single, extensible Python library.

Most existing benchmarks (like TPC-DS and TPC-H) are too query-heavy and miss the reality that data engineers build complex ELT pipelines — not just run analytic queries. While these traditional benchmarks are helpful for testing bulk loading and complex SQL execution, they do not reflect the broader data lifecycle that lakehouse systems must support.

LakeBench bridges this gap by introducing novel benchmarks that aim to capture the growing spectrum of ELT workflows. In addition to supporting industry standards like TPC-DS and TPC-H, LakeBench includes scenarios that measure not only query performance, but also data loading, transformation, incremental processing, and maintenance operations. This holistic approach enables you to benchmark engines on the real-world tasks that matter most for modern data engineering.

LakeBench proposes that the entire end-to-end data lifecycle managed by data engineers is relevant: data loading, bulk and incremental transformations, maintenance jobs, and ad-hoc analytical queries. By benchmarking these stages, LakeBench delivers actionable insights into engine efficiency, performance, and operational trade-offs across the full data pipeline.

🧱 Key Features

Modular engine support (Spark, DuckDB, Polars, Daft)
Benchmark scenarios that reflect real-world ELT workflows
Atomic units of work that benchmark discrete lifecycle stages
Dataset Generation for all benchmarks
COMING SOON: Custom result logging and metrics capture (e.g. SparkMeasure)

🔍 Benchmark Scenarios

LakeBench currently supports three benchmarks with more to come:

ELTBench: An benchmark with various modes (light, full) that simulates typicaly ELT workloads:
- Raw data load (Parquet → Delta)
- Fact table generation
- Incremental merge processing
- Table maintenance (e.g. OPTIMIZE/VACUUM)
- Ad-hoc analytical queries
TPC-DS: An industry-standard benchmark for complex analytical queries, featuring 24 source tables and 99 queries. Designed to simulate decision support systems and analytics workloads.
TPC-H: Focuses on ad-hoc decision support with 8 tables and 22 queries, evaluating performance on business-oriented analytical workloads.

Coming Soon

AtomicELT: A derivative of ELTBench that focuses on the performance of individual ELT operations. Each operation type is executed only once, allowing for granular comparison of engine performance on specific tasks. Results should be interpreted per operation, not as a cumulative runtime.

🛠️ Engine Support Matrix

LakeBench supports multiple lakehouse compute engines. Each benchmark scenario declares which engines it supports via <BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY.

Engine	ELTBench	AtomicELT	TPC-DS	TPC-H
Spark (Fabric)	✅	🔜	✅	✅
DuckDB	✅	🔜	✅	✅
Polars	✅	🔜	✅	✅
Daft	✅	🔜	✅	✅

Legend:
✅ = Supported
🔜 = Coming Soon
(Blank) = Not currently supported

LakeBench is designed to be extensible—new engines can be added via subclassing an existing engine class, and benchmarks can register support for additional engines as they are implemented.

📦 Installation

Install from PyPi:

pip install lakebench[duckdb,polars,daft]

Note: in this initial beta version, all engines have only been tested inside Microsoft Fabric Python and Spark Notebooks.

Example Usage

To run any LakeBench benchmark, first do a one time generation of the data required for the benchmark and scale of interest. LakeBench provides datagen classes to quickly generate parquet datasets required by the benchmarks.

Data Generation

Data generation is provided via the DuckDB TPC-DS and TPC-H extensions. The LakeBench wrapper around DuckDB adds support for writing out parquet files with a provided row-group target file size as normally the files generated by DuckDB are atypically small (i.e. 10MB) and are most suitable for ultra-small scale scenarios. LakeBench defaults to target 128MB row groups but can be configured via the target_row_group_size_mb parameter of both TPC-H and TPC-DS DataGenerator classes.

Generating scale factor 1 data takes about 1 minute on a 2vCore VM.

TPC-H Data Generation

from lakebench.datagen.tpch import TPCHDataGenerator

datagen = TPCHDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpch_sf1'
)
datagen.run()

TPC-DS Data Generation

from lakebench.datagen.tpcds import TPCDSDataGenerator

datagen = TPCDSDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpcds_sf1'
)
datagen.run()

Fabric Spark

from lakebench.engines.fabric_spark import FabricSpark
from lakebench.benchmarks.elt_bench import ELTBench

engine = FabricSpark(
    lakehouse_workspace_name="workspace",
    lakehouse_name="lakehouse",
    lakehouse_schema_name="schema"
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="sf10",
    mode="light",
    tpcds_parquet_abfss_path="abfss://...",
    save_results=True,
    result_abfss_path="abfss://..."
)

benchmark.run()

Polars

from lakebench.engines.polars import Polars
from lakebench.benchmarks.elt_bench import ELTBench

engine = Polars( 
    delta_abfss_schema_path = 'abfss://...'
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="sf10",
    mode="light",
    tpcds_parquet_abfss_path="abfss://...",
    save_results=True,
    result_abfss_path="abfss://..."
)

benchmark.run()

🔌 Extensibility by Design

LakeBench is built to be plug-and-play for both benchmark types and compute engines:

You can register new engines without modifying core benchmark logic.
You can add new benchmarks that reuse existing engines and shared engine methods.
LakeBench extension libraries can be created to extend core LakeBench capabilities with additional custom benchmarks and engines (i.e. MyCustomSynapseSpark(Spark), MyOrgsELT(BaseBenchmark)).

This architecture encourages experimentation, benchmarking innovation, and easy adaptation to your needs.

Example:

# Automatically maps benchmark implementation to your custom engine class
from lakebench.engines.spark import Spark

class MyCustomSynapseSpark(Spark):
    ...

benchmark = AtomicELT(engine=MyCustomSynapseSpark(...))

All you need to do is subclass the relevant base class and it will auto-register provided that the referenced benchmark supports the base class. No changes to the framework internals required.

🔍 Philosophy

LakeBench is designed to host a suite of benchmarks that cover E2E data engineering and consumption workloads:

Loading data from raw storage
Transforming and enriching data
Applying incremental module building logic
Maintaining and optimizing datasets
Running complex analytical queries

The core aim is provide transparency into engine efficiency, performance, and costs across the data lifecycle..

📬 Feedback / Contributions

Got ideas? Found a bug? Want to contribute a benchmark or engine wrapper? PRs and issues are welcome!

Acknowledgement of Other LakeBench Projects

The LakeBench name is also used by two unrelated academic and research efforts:

RLGen/LAKEBENCH: A benchmark designed for evaluating vision-language models on multimodal tasks.
LakeBench: Benchmarks for Data Discovery over Lakes (paper link): A benchmark suite focused on improving data discovery and exploration over large data lakes.

While these projects target very different problem domains — such as machine learning and data discovery — they coincidentally share the same name. This project, focused on ELT benchmarking across lakehouse engines, is not affiliated with or derived from either.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python
Topic
- System :: Benchmark

Release history Release notifications | RSS feed

1.0.0

Feb 25, 2026

0.13.3

Feb 21, 2026

0.13.2

Jan 21, 2026

0.13.1

Jan 20, 2026

0.13.0

Dec 10, 2025

0.12.2

Nov 11, 2025

0.12.1

Oct 23, 2025

0.12.0

Oct 21, 2025

0.11.0 yanked

Oct 21, 2025

0.10.0 yanked

Oct 21, 2025

0.9.1

Sep 4, 2025

0.9.0

Aug 19, 2025

0.8.1

Jul 25, 2025

0.8.0

Jul 24, 2025

0.7.0

Jul 22, 2025

0.6.1

Jul 18, 2025

0.6.0

Jul 17, 2025

0.5.0

Jul 15, 2025

0.4.1 yanked

Jul 14, 2025

0.4.0 yanked

Jul 13, 2025

Reason this release was yanked:

Benchmarks will fail to import

0.3.0

Jul 11, 2025

0.2.0

Jul 8, 2025

0.1.9

Jul 2, 2025

This version

0.1.8

Jul 2, 2025

0.1.7

Jul 2, 2025

0.1.6

Jul 2, 2025

0.1.5

Jul 1, 2025

0.1.4

Jun 30, 2025

0.1.3

Jun 28, 2025

0.1.0

Jun 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakebench-0.1.8.tar.gz (65.3 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakebench-0.1.8-py3-none-any.whl (115.5 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file lakebench-0.1.8.tar.gz.

File metadata

Download URL: lakebench-0.1.8.tar.gz
Upload date: Jul 2, 2025
Size: 65.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for lakebench-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`610e865b7f17d0691e7d5c2ecbb30dd4aa308c8ed0ea2087a53fb9cbec8c150f`
MD5	`3eb34260b057fddf8a1b8b0bfc70ec60`
BLAKE2b-256	`6dd82643ba5caacd584a8286ef655068783ff997b189c581c302e404ba4558c0`

See more details on using hashes here.

File details

Details for the file lakebench-0.1.8-py3-none-any.whl.

File metadata

Download URL: lakebench-0.1.8-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 115.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for lakebench-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ee796c0cc6c6a572e9e01498eb8abe9e88606e40ac4d18b8db80bb3cb4b44ac9`
MD5	`31e854e007f83bc3dd28bb6104346a4c`
BLAKE2b-256	`843a6dabe51c23e9b2cecbdd2d99ab030c410fc9d27955dc5ba58831f0925d3b`

See more details on using hashes here.

lakebench 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LakeBench

🧱 Key Features

🔍 Benchmark Scenarios

🛠️ Engine Support Matrix

📦 Installation

Example Usage

Data Generation

TPC-H Data Generation

TPC-DS Data Generation

Fabric Spark

Polars

🔌 Extensibility by Design

🔍 Philosophy

📬 Feedback / Contributions

Acknowledgement of Other LakeBench Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes