Run the TPC-DS benchmark on Databricks (Delta Lake).
Project description
Running TPCDS on Databricks
This document describes how to run TPCDS on Databricks. The TPCDS benchmark is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. The benchmark is the result of a partnership between the Transaction Processing Performance Council (TPC) and the decision support group (DS) of the Association for Computing Machinery (ACM).
Pre-requisites
- Databricks workspace
- Databricks metastore configured to workspace
- Databricks cluster (jobs/all purpose etc)
Install from PyPI
Install the package directly in a Databricks notebook:
%pip install databricks-tpcds
The package provides the DatabricksTPCDS library. You drive it from an entrypoint script like
the Delta Lake example below.
Delta Lake entrypoint example
Fill in the placeholder catalog_name, bucket_name, prefix, and schema_name with your own
values, then run it on your Databricks cluster.
from pyspark.sql import SparkSession
from databricks_tpcds.databricks_tpcds import DatabricksTPCDS
def main():
catalog_name = 'my_catalog'
bucket_name = 'my-bucket'
prefix = 'path/to/tpcds-datasets/1TB'
schema_name = 'my_schema'
# Initialize Spark session
spark = SparkSession.builder.appName("TPCDS Query Runner").getOrCreate()
# Enable/disable cache
spark.conf.set("spark.databricks.io.cache.enabled", "false")
databricks_tpcds = DatabricksTPCDS(spark, schema_name=schema_name, catalog_name=catalog_name)
# Create catalog
databricks_tpcds.create_catalog()
# Create schema
databricks_tpcds.create_schema()
# Create a single table, provide the table name
# databricks_tpcds.create_table(bucket_name, prefix, "call_center")
# Create multiple tables, provide the list of table names
# databricks_tpcds.create_tables(bucket_name, prefix, ["call_center", "catalog_page"])
# Create all tables, provide the bucket name and prefix, it'll create all the tables
databricks_tpcds.create_all_tables(bucket_name, prefix)
# Run all queries
for i in range(3):
time_taken_by_queries = databricks_tpcds.run_all_queries(should_warmup=False)
print("QUERY_NUMBER,TIME_TAKEN")
for query_no, time_taken in time_taken_by_queries.items():
print(f"{query_no},{time_taken}")
if __name__ == "__main__":
main()
Developing locally
- Modify the code if necessary in
src/databricks_tpcds/databricks_tpcds.py - Take a look or modify the queries in
src/resources/queries/ - Build the package:
cd tpcds/databricks
python3.10 -m build
- Upload the built
.whlto your Databricks workspace and install it in a notebook:
%pip install path/to/databricks_tpcds-0.1.0-py3-none-any.whl --force-reinstall
- Run the benchmark using the Delta Lake entrypoint example above.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databricks_tpcds-0.1.0.tar.gz.
File metadata
- Download URL: databricks_tpcds-0.1.0.tar.gz
- Upload date:
- Size: 38.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a212fa5bdbe477b13f2441d4c7928a318532a6435c20f97a7cb4d2db20fb080e
|
|
| MD5 |
93da24907ca04e375dab5f75b1610efc
|
|
| BLAKE2b-256 |
171227dba993dfcad3fdb8de0d33067dd777c4cd86ad157a715edc4c1bb068c1
|
File details
Details for the file databricks_tpcds-0.1.0-py3-none-any.whl.
File metadata
- Download URL: databricks_tpcds-0.1.0-py3-none-any.whl
- Upload date:
- Size: 69.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03cc90f0fb09ae02fdda23690a4bfdc22bd448f52c3506615cd7de50631776df
|
|
| MD5 |
a7772d56b5972f9f1d15042f58611524
|
|
| BLAKE2b-256 |
eb6f3e080258a20cc29590ca1666b31919df4ec95c3d0ce2393eb729ba03b468
|