A multi-modal Python library for benchmarking Azure lakehouse engines and ELT scenarios, supporting both industry-standard and novel benchmarks.

These details have not been verified by PyPI

Project links

Project description

LakeBench

🌊 LakeBench is the first Python-based, multi-modal benchmarking framework designed to evaluate performance across multiple lakehouse compute engines and ELT scenarios. Supporting a variety of engines and both industry-standard and novel benchmarks, LakeBench enables comprehensive, apples-to-apples comparisons in a single, extensible Python library.

🚀 The Mission of LakeBench

LakeBench exists to bring clarity, trust, accessibility, and relevance to engine benchmarking by focusing on four core pillars:

End-to-End ELT Workflows Matter

Most benchmarks focus solely on analytic queries. But in practice, data engineers manage full data pipelines — loading data, transforming it (in batch, incrementally, or even streaming), maintaining tables, and then querying.

LakeBench proposes that the entire end-to-end data lifecycle managed by data engineers is relevant, not just queries.
Variety in Benchmarks Is Essential

Real-world pipelines deal with different data shapes, sizes, and patterns. One-size-fits-all benchmarks miss this nuance.

LakeBench covers a variety of benchmarks that represent diverse workloads — from bulk loads to incremental merges to maintenance jobs to ad-hoc queries — providing a richer picture of engine behavior under different conditions.
Consistency Enables Trustworthy Comparisons

Somehow, every engine claims to be the fastest at the same benchmark, at the same time. Without a standardized framework, with support for many engines, comparisons are hard to trust and even more difficult to reproduce.

LakeBench ensures consistent methodology across engines, reducing the likelihood of implementation bias and enabling repeatable, trustworthy results. Engine subject matter experts are encouraged to submit PRs to tune code as needed so that their preferred engine is best represented.
Accessibility starts with pip install

Most benchmarking toolkits are highly inaccessible to the beginner data engineer, requiring the user to build the package or installation via a JAR, absent of Python bindings.

LakeBench is intentionally built as a Python-native library, installable via pip from PyPi, so it's easy for any engineer to get started—no JVM or compilation required. It's so lightweight and approachable, you could even use it just for generating high-quality sample data.

✅ Why LakeBench?

Multi-Engine: Benchmark Spark, DuckDB, Polars, and many more planned, side-by-side
Lifecycle Coverage: Ingest, transform, maintain, and query—just like real workloads
Diverse Workloads: Test performance across varied data shapes and operations
Consistent Execution: One framework, many engines
Extensible by Design: Add engines or additional benchmarks with minimal friction
Dataset Generation: Out-of-the box dataset generation for all benchmarks
Rich Logs: Automatically logged engine version, compute size, duration, estimated execution cost, etc.

LakeBench empowers data teams to make informed engine decisions based on real workloads, not just marketing claims.

💪 Benchmarks

LakeBench currently supports four benchmarks with more to come:

ELTBench: An benchmark with various modes (light, full) that simulates typicaly ELT workloads:
- Raw data load (Parquet → Delta)
- Fact table generation
- Incremental merge processing
- Table maintenance (e.g. OPTIMIZE/VACUUM)
- Ad-hoc analytical queries
TPC-DS: An industry-standard benchmark for complex analytical queries, featuring 24 source tables and 99 queries. Designed to simulate decision support systems and analytics workloads.
TPC-H: Focuses on ad-hoc decision support with 8 tables and 22 queries, evaluating performance on business-oriented analytical workloads.
ClickBench: A benchmark that simulates ad-hoc analytical and real-time queries on clickstream, traffic analysis, web analytics, machine-generated data, structured logs, and events data. The load phase (single flat table) is followed by 43 queries.

Planned

TPC-DI: An industry-standard benchmark for data integration workloads, evaluating end-to-end ETL/ELT performance across heterogeneous sources—including data ingestion, transformation, and loading processes.

⚙️ Engine Support Matrix

LakeBench supports multiple lakehouse compute engines. Each benchmark scenario declares which engines it supports via <BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY.

Engine	ELTBench	TPC-DS	TPC-H	ClickBench
Spark (Fabric)	✅	✅	✅	✅
DuckDB	✅	⚠️	✅	🔜
Polars	✅	⚠️	⚠️	🔜
Daft	✅	⚠️	⚠️	🔜

Legend:
✅ = Supported
⚠️ = Some queries fail due to syntax issues (i.e. Polars doesn't support SQL non-equi joins), fixes coming soon! 🔜 = Coming Soon
(Blank) = Not currently supported

🔌 Extensibility by Design

LakeBench is designed to be extensible, both for additional engines and benchmarks.

You can register new engines without modifying core benchmark logic.
You can add new benchmarks that reuse existing engines and shared engine methods.
LakeBench extension libraries can be created to extend core LakeBench capabilities with additional custom benchmarks and engines (i.e. MyCustomSynapseSpark(Spark), MyOrgsELT(BaseBenchmark)).

New engines can be added via subclassing an existing engine class. Existing benchmarks can then register support for additional engines via the below:

from lakebench.benchmarks import TPCDS
TPCDS.register_engine(MyNewEngine, None)

register_engine is a class method to update <BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY. It requires two inputs, the engine class that is being registered and the engine specific benchmark implementation class if required (otherwise specifying None will leverage methods in the generic engine class).

This architecture encourages experimentation, benchmarking innovation, and easy adaptation.

Example:

from lakebench.engines import BaseEngine

class MyCustomEngine(BaseEngine):
    ...

from lakebench.benchmarks.elt_bench import ELTBench
# registering the engine is only required if you aren't subclassing an existing registered engine
ELTBench.register_engine(MyCustomEngine, None)

benchmark = ELTBench(engine=MyCustomEngine(...))
benchmark.run()

Using LakeBench

📦 Installation

Install from PyPi:

pip install lakebench[duckdb,polars,daft,tpcds_datagen,tpch_datagen,sparkmeasure]

Note: in this initial beta version, all engines have only been tested inside Microsoft Fabric Python and Spark Notebooks.

Example Usage

To run any LakeBench benchmark, first do a one time generation of the data required for the benchmark and scale of interest. LakeBench provides datagen classes to quickly generate parquet datasets required by the benchmarks.

Data Generation

Data generation is provided via the DuckDB TPC-DS and TPC-H extensions. The LakeBench wrapper around DuckDB adds support for writing out parquet files with a provided row-group target file size as normally the files generated by DuckDB are atypically small (i.e. 10MB) and are most suitable for ultra-small scale scenarios. LakeBench defaults to target 128MB row groups but can be configured via the target_row_group_size_mb parameter of both TPC-H and TPC-DS DataGenerator classes.

Generating scale factor 1 data takes about 1 minute on a 2vCore VM.

TPC-H Data Generation

from lakebench.datagen import TPCHDataGenerator

datagen = TPCHDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpch_sf1'
)
datagen.run()

TPC-DS Data Generation

from lakebench.datagen import TPCDSDataGenerator

datagen = TPCDSDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpcds_sf1'
)
datagen.run()

Notes:

TPC-H data can be generated up to SF100 however I hit OOM issues when targeting generating SF1000 on a 64-vCore machine.
TPC-DS data up to SF1000 can be generated on a 32-vCore machine.
TPC-H and TPC-DS datasets up to SF10 will complete in minutes on a 2-vCore machine.
The ClickBench dataset (only 1 size) should download with partitioned files in ~ 1 minute and ~ 6 minutes as a single file.

Fabric Spark

from lakebench.engines import FabricSpark
from lakebench.benchmarks import ELTBench

engine = FabricSpark(
    lakehouse_workspace_name="workspace",
    lakehouse_name="lakehouse",
    lakehouse_schema_name="schema",
    spark_measure_telemetry=True
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="sf10",
    mode="light",
    tpcds_parquet_abfss_path="abfss://...",
    save_results=True,
    result_abfss_path="abfss://..."
)

benchmark.run()

Note: The spark_measure_telemetry flag can be enabled to capture stage metrics in the results. The sparkmeasure install option must be used when spark_measure_telemetry is enabled (%pip install lakebench[sparkmeasure]). Additionally, the Spark-Measure JAR must be installed from Maven: https://mvnrepository.com/artifact/ch.cern.sparkmeasure/spark-measure_2.13/0.24

Polars

from lakebench.engines import Polars
from lakebench.benchmarks import ELTBench

engine = Polars( 
    delta_abfss_schema_path = 'abfss://...'
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="sf10",
    mode="light",
    tpcds_parquet_abfss_path="abfss://...",
    save_results=True,
    result_abfss_path="abfss://..."
)

benchmark.run()

📬 Feedback / Contributions

Got ideas? Found a bug? Want to contribute a benchmark or engine wrapper? PRs and issues are welcome!

Acknowledgement of Other LakeBench Projects

The LakeBench name is also used by two unrelated academic and research efforts:

RLGen/LAKEBENCH: A benchmark designed for evaluating vision-language models on multimodal tasks.
LakeBench: Benchmarks for Data Discovery over Lakes (paper link): A benchmark suite focused on improving data discovery and exploration over large data lakes.

While these projects target very different problem domains — such as machine learning and data discovery — they coincidentally share the same name. This project, focused on ELT benchmarking across lakehouse engines, is not affiliated with or derived from either.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Feb 25, 2026

0.13.3

Feb 21, 2026

0.13.2

Jan 21, 2026

0.13.1

Jan 20, 2026

0.13.0

Dec 10, 2025

0.12.2

Nov 11, 2025

0.12.1

Oct 23, 2025

0.12.0

Oct 21, 2025

0.11.0 yanked

Oct 21, 2025

0.10.0 yanked

Oct 21, 2025

0.9.1

Sep 4, 2025

0.9.0

Aug 19, 2025

0.8.1

Jul 25, 2025

0.8.0

Jul 24, 2025

0.7.0

Jul 22, 2025

This version

0.6.1

Jul 18, 2025

0.6.0

Jul 17, 2025

0.5.0

Jul 15, 2025

0.4.1 yanked

Jul 14, 2025

0.4.0 yanked

Jul 13, 2025

Reason this release was yanked:

Benchmarks will fail to import

0.3.0

Jul 11, 2025

0.2.0

Jul 8, 2025

0.1.9

Jul 2, 2025

0.1.8

Jul 2, 2025

0.1.7

Jul 2, 2025

0.1.6

Jul 2, 2025

0.1.5

Jul 1, 2025

0.1.4

Jun 30, 2025

0.1.3

Jun 28, 2025

0.1.0

Jun 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakebench-0.6.1.tar.gz (78.9 kB view details)

Uploaded Jul 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakebench-0.6.1-py3-none-any.whl (143.3 kB view details)

Uploaded Jul 18, 2025 Python 3

File details

Details for the file lakebench-0.6.1.tar.gz.

File metadata

Download URL: lakebench-0.6.1.tar.gz
Upload date: Jul 18, 2025
Size: 78.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for lakebench-0.6.1.tar.gz
Algorithm	Hash digest
SHA256	`e4fb093dd273005216ec2b9340248fce02b539f0c882266b20550bb395fce0b2`
MD5	`e8dbbd49d3ca711588faf374b47a8f8a`
BLAKE2b-256	`dd6f5192aaa6b8ea55fd126d0466654a3f0f2486e168eda698ca2f4cdc715826`

See more details on using hashes here.

File details

Details for the file lakebench-0.6.1-py3-none-any.whl.

File metadata

Download URL: lakebench-0.6.1-py3-none-any.whl
Upload date: Jul 18, 2025
Size: 143.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for lakebench-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2ee92fc088e51360291c1a18f126fa5b7ea97e1be6542d22515218520a0fdf5b`
MD5	`f2d1f2ea5d80f02c77ce471c091098f6`
BLAKE2b-256	`f63e50343d3de60bab1c19f8c815f4500e23071b17af403713669eb4bb2451c8`

See more details on using hashes here.

lakebench 0.6.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LakeBench

🚀 The Mission of LakeBench

✅ Why LakeBench?

💪 Benchmarks

⚙️ Engine Support Matrix

🔌 Extensibility by Design

Using LakeBench

📦 Installation

Example Usage

Data Generation

TPC-H Data Generation

TPC-DS Data Generation

Fabric Spark

Polars

📬 Feedback / Contributions

Acknowledgement of Other LakeBench Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes