Skip to main content

Database drift benchmarking for researchers, DB vendors, and new users: generate, validate, and run data/workload drift with CLI or MCP.

Project description

DriftBench logo

DriftBench

DriftBench is a toolkit for generating and replaying data drift and workload drift with DriftSpec.

Who uses DriftBench:

  • Researcher — design reproducible drift experiments and ablations.
  • Database Vendor / Performance Team — run drift regression checks across targets before release.
  • New User — start from validated examples and get first outputs quickly.

Version history: CHANGELOG · Production site: driftbench.com


Install

pip install -U driftbench-db

Or from source:

git clone https://github.com/Liuguanli/DriftBench.git
cd DriftBench
pip install -e .

Verify:

driftbench --help

Benchmark Adapters (driftbench.data)

Nine adapters generate real data files and SQL query workloads with no external dependencies (TPC-H mode="generate" auto-downloads and builds dbgen on first use).

Adapter Workload type Data format Tables Queries
tpch OLAP .tbl (pipe-delimited) 8 22 SQL via qgen
tpcds OLAP / Decision support .dat (pipe-delimited) 5 synthetic 99 query IDs
tpcc OLTP .csv 9 5 transaction types
tpcc_skew OLTP + hotspot .csv + weight manifest 9 5 transaction types
job OLAP / join-order .csv 11 (IMDB-like) 20 SQL templates
ycsb Key-value .csv 1 6 workload mixes (A–F)
dsb Decision support .csv 3 star-schema 3 SQL templates
pgbench TPC-B (OLTP) .csv 4 3 workloads
benchbase Multi-benchmark XML + shell script via live DB 10 benchmarks

Generate data and queries

from pathlib import Path
from driftbench.data.tpch import data as tpch_data, queries as tpch_queries
from driftbench.data.tpcds import data as tpcds_data, queries as tpcds_queries
from driftbench.data.tpcc import data as tpcc_data, queries as tpcc_queries
from driftbench.data.tpcc_skew import data as tpcc_skew_data, queries as tpcc_skew_queries
from driftbench.data.job import data as job_data, queries as job_queries
from driftbench.data.ycsb import data as ycsb_data, queries as ycsb_queries
from driftbench.data.dsb import data as dsb_data, queries as dsb_queries
from driftbench.data.pgbench import data as pgbench_data, queries as pgbench_queries
from driftbench.data.benchbase import data as bb_data, queries as bb_queries

out = Path("./artifacts")

# TPC-H — auto-builds dbgen on first use; converts .tbl to .csv with .as_csv()
tpch_data(scale_factor=1, mode="generate").generate(output_dir=out)
tpch_queries(query_ids=[1, 3, 5], queries_per_template=2).generate(output_dir=out)

# TPC-DS — synthetic .dat files; converts to .csv with .as_csv()
tpcds_data(scale_factor=10).generate(output_dir=out)
tpcds_queries().generate(output_dir=out)

# TPC-C — scale_factor = number of warehouses
tpcc_data(scale_factor=4).generate(output_dir=out)
tpcc_queries().generate(output_dir=out)

# TPC-C Skew — Zipf hot-warehouse access distribution
tpcc_skew_data(scale_factor=10, hot_warehouse_fraction=0.2, skew_factor=0.99).generate(output_dir=out)
tpcc_skew_queries(scale_factor=10, hot_warehouse_fraction=0.2).generate(output_dir=out)

# JOB, YCSB, DSB, pgbench
job_data(scale_factor=1).generate(output_dir=out)
ycsb_data(scale_factor=1).generate(output_dir=out)
ycsb_queries(workload="B").generate(output_dir=out)
dsb_data(scale_factor=10).generate(output_dir=out)
pgbench_data(scale_factor=1).generate(output_dir=out)
pgbench_queries(workload="tpcb").generate(output_dir=out)

# BenchBase — generates XML configs + shell scripts for a live database
bb_data(benchmark="tpcc", scale_factor=10).generate(output_dir=out)
bb_queries(benchmark="tpcc", terminals=8, duration=120).generate(output_dir=out)

Output layout

artifacts/
  tpch/data/sf_1/tables/   tpch/queries/
  tpcds/data/              tpcds/queries/
  tpcc/data/               tpcc/queries/
  tpcc_skew/data/          tpcc_skew/queries/
  job/data/                job/queries/
  ycsb/data/               ycsb/queries/
  dsb/data/                dsb/queries/
  pgbench/data/            pgbench/queries/
  benchbase/tpcc/data/     benchbase/tpcc/queries/

Each folder contains a *_manifest.json listing the generated files.

GenerationResult

generate() returns a GenerationResult:

result = tpch_data(scale_factor=1, mode="generate").generate(output_dir=out)
result.files      # list of generated file paths
result.metadata   # path to the manifest JSON

# Convert pipe-delimited .tbl / .dat to standard CSV (both kept on disk).
# Known TPC-H (8 tables) and TPC-DS (5 synthetic tables) get a proper
# header row, so the CSV is self-describing and usable directly by .drift().
csv_result = result.as_csv()

Second call reuses existing files automatically. Pass force=True to regenerate.

Applying drift to benchmark data

GenerationResult exposes .drift() and .drift_multi() to apply data drift directly — no manual schema extraction or generator setup needed.

Single-table drift:

from driftbench.data.tpch import TPCHData

result = TPCHData(scale_factor=1, source_dir="path/to/tbls").generate().as_csv()

# Inject outliers into lineitem.l_quantity
drifted = result.drift("lineitem", "outlier_injection", column="l_quantity", n=500)

# Skew the price/discount distribution
drifted = result.drift("lineitem", "value_skew",
                       columns=["l_extendedprice", "l_discount"], skewness=2)

drift() writes the drifted CSV to <output_dir>/<table>_<drift_type>.csv by default. Pass output_path= to override. Returns a new GenerationResult pointing at the drifted file.

Every .drift() call also emits a reproducible DriftSpec YAML (<output_stem>.driftspec.yaml) next to the CSV — kept out of result.files but recorded under the manifest's driftspec key. Running that YAML through driftbench.spec.core.run_all regenerates byte-identical output, so a Python-generated drift can be shared or automated as a spec without rework. The function-call path (fast, imperative) and the spec path (declarative, version-controllable, reproducible) are the same engine and produce identical results for the same seed and parameters.

Multi-table drift:

# FK relationships for tpch / job are wired automatically
drifted = result.drift_multi([
    {"op": "skew_column", "target": "lineitem", "column": "l_quantity",
     "fraction": 0.2, "skewness": 2},
    {"op": "delete_keys", "target": "orders", "key_column": "o_orderkey",
     "fraction": 0.05,
     "propagate": [{"relationship": "lineitem_orders", "policy": "drop"}]},
])

Pass relationships=[] or a custom list to override the built-in FK maps. Supported benchmarks with auto-wiring: tpch, job. tpcc and tpcc_skew require explicit relationship definitions because their joins use composite keys.

DriftSpec YAMLs — ready-to-run example specs for all five adapters are in driftspec/examples/:

  • tpch_lineitem_drift.yaml
  • tpcc_drift.yaml
  • job_drift.yaml
  • ycsb_drift.yaml
  • pgbench_drift.yaml

CLI Quickstart

# Validate a DriftSpec
python -m driftbench.cli validate-spec driftspec/examples/demo_data_single.yaml --json

# Dry-run (preview execution plan)
python -m driftbench.cli dry-run driftspec/examples/demo_data_single.yaml --json

# Execute
python -m driftbench.cli run-yaml driftspec/examples/demo_data_single.yaml

Python API

from driftbench import run_spec, trace_to_spec

run_spec("driftspec/examples/demo_data_single.yaml")
trace_to_spec("driftspec/trace_inputs/trace_data_mock.csv", "driftspec/generated/from_trace.yaml")

MCP Server

python3 -m driftbench_mcp.server

Core workflow via MCP: trace_to_specvalidate_specrun_speclist_outputs


Testing

python -m unittest discover -s test -p 'test_*.py' -v

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

driftbench_db-0.1.0b8.tar.gz (192.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

driftbench_db-0.1.0b8-py3-none-any.whl (205.1 kB view details)

Uploaded Python 3

File details

Details for the file driftbench_db-0.1.0b8.tar.gz.

File metadata

  • Download URL: driftbench_db-0.1.0b8.tar.gz
  • Upload date:
  • Size: 192.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for driftbench_db-0.1.0b8.tar.gz
Algorithm Hash digest
SHA256 bb0c880940667fccb932f57345a4adbbefb63981b1bc710733ca58d8f9400b25
MD5 210cae29dc47edc95c94cec448057fdb
BLAKE2b-256 75cc107d4015ed10d9ff8eda7bd3a87c3279e325dab44bee1a128ce1546270a0

See more details on using hashes here.

File details

Details for the file driftbench_db-0.1.0b8-py3-none-any.whl.

File metadata

  • Download URL: driftbench_db-0.1.0b8-py3-none-any.whl
  • Upload date:
  • Size: 205.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for driftbench_db-0.1.0b8-py3-none-any.whl
Algorithm Hash digest
SHA256 8dfd5ba52f4fbe96832ee409967d74de55854f400901c9543e22bdb02d3c8547
MD5 703db862ea54dcc198288ff40d55b2bf
BLAKE2b-256 f3300efc7b55d5f911717e3f7d143b2a4fa17bb02f09890cc7d1f8960d1c29f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page