Skip to main content

Database drift benchmarking for researchers, DB vendors, and new users: generate, validate, and run data/workload drift with CLI or MCP.

Project description

DriftBench logo

DriftBench

DriftBench is a toolkit for generating and replaying data drift and workload drift with DriftSpec.

This README is intentionally focused on how to use the latest DriftBench.

Version-by-version updates and service coverage:

Who typically uses DriftBench:

  • Researcher: design reproducible drift experiments and ablations.
  • Database Vendor / Performance Team: run drift regression checks across targets before release.
  • New User: start from validated examples and get first outputs quickly.

Web Frontend

  • Production site: driftbench.com
  • Frontend source repo: driftbench-web
  • Release branch note: pushes to release/** with user-facing DriftBench changes auto-dispatch a docs update event to driftbench-web.
  • Dispatch verification note (2026-05-10): this README line is used to validate cross-repo release notifications.
  • Dispatch verification note (retry): confirms the receiver workflow on driftbench-web is active after workflow fix.

Release Reproducibility

  • Workflow: .github/workflows/reproducible-drift-runs.yml
  • Trigger manually from GitHub Actions (workflow_dispatch) or call from other workflows (workflow_call).
  • Default run executes and validates:
    • driftspec/examples/demo_data_single.yaml
    • driftspec/examples/workload_census.yaml
  • Artifacts are uploaded as driftbench-reproducible-run-artifacts.

Install (Latest)

From PyPI (recommended)

python3 -m pip install -U driftbench-db

From source (latest main)

git clone https://github.com/Liuguanli/DriftBench.git
cd DriftBench
python3 -m pip install -e .

Verify installation

driftbench --help
driftbench-service --help
driftbench-mcp --help

CLI Quickstart

Use this flow for most users:

# 1) Validate a DriftSpec
python -m driftbench.cli validate-spec driftspec/examples/demo_data_single.yaml --json

# 2) Preview execution plan
python -m driftbench.cli dry-run driftspec/examples/demo_data_single.yaml --json

# 3) Execute
python -m driftbench.cli run-yaml driftspec/examples/demo_data_single.yaml

# 4) Inspect outputs
python -m driftbench.cli list-outputs --root output --glob "**/*" --limit 30 --json

Trace to DriftSpec

python -m driftbench.cli trace-to-spec \
  driftspec/trace_inputs/trace_data_mock.csv \
  driftspec/generated/from_trace.yaml \
  --trace-type data

Orchestrate Across Benchmark Targets (MVP)

Use one DriftSpec across multiple benchmark targets defined in benchmark_target.yaml.

python -m driftbench.cli orchestrate \
  --spec driftspec/examples/demo_data_single.yaml \
  --targets driftspec/examples/adapters/benchmark_targets_mvp.yaml \
  --manifest-out output/orchestrate_manifest.json \
  --json

Execute setup/run commands for each target:

python -m driftbench.cli orchestrate \
  --spec driftspec/examples/demo_data_single.yaml \
  --targets driftspec/examples/adapters/benchmark_targets_mvp.yaml \
  --manifest-out output/orchestrate_manifest.json \
  --execute \
  --json

Bootstrap Dataset (download/copy + checksum + schema extract)

Bootstrap from preset, local path, or URL:

python -m driftbench.cli bootstrap dataset \
  --source census_original \
  --output-dir output/bootstrap/datasets \
  --json

With checksum verification:

python -m driftbench.cli bootstrap dataset \
  --source /path/to/my_dataset.csv \
  --output-dir output/bootstrap/datasets \
  --checksum sha256:<hex> \
  --json

MCP Quickstart

Start MCP server (stdio):

python3 -m driftbench_mcp.server

Client config template:

  • docs/mcp_config_example.json

Minimal MCP guide:

  • docs/p0_mcp_server_minimal.md

Core MCP workflow:

  1. trace_to_spec
  2. validate_spec
  3. run_spec
  4. list_outputs

Spec sharing tools:

  • save_spec
  • list_public_specs
  • import_spec_and_run

MCP Chat Demo (Codex / Claude Code)

After MCP is configured, the best pattern is to give your assistant a case type plus what change you want to simulate.

Case A: Data Drift (data changes)

Use when you care about data size/distribution changes (scaling, skew, outliers, updates).

[Prompt: Data Drift]
Read docs/p0_integration_quickstart.md.
I want a DATA drift case on <my dataset path>.
Goal: <e.g., scale 2x + stronger skew on column amount>.
Please use MCP tools to:
1) build a DriftSpec (or trace_to_spec if needed),
2) validate it,
3) run it,
4) list outputs.
Then summarize what data files were generated and what changed.

Case B: Workload Drift (query changes)

Use when you care about query behavior changes (predicate distribution, selectivity, structure, payload).

[Prompt: Workload Drift]
I want a WORKLOAD drift case.
Query goal: <e.g., predicates shift from uniform to city-focused, selectivity from 10% to 60%>.
Please create/run a spec via MCP and report:
- generated workload files,
- how query distribution/selectivity changed,
- suggested next workload variant.

Temporal Overlay (applied on top of Case A or B)

Temporal drift is usually an overlay, not a standalone base case. Use it to add time evolution (uniform / periodic / trend / long-tail) on top of data drift or workload drift.

[Prompt: Temporal Overlay]
Take my <DATA or WORKLOAD> drift case and add TEMPORAL pattern <uniform|periodic|trend|long_tail>.
Please run the MCP workflow and summarize:
1) generated spec path,
2) output artifacts,
3) expected temporal behavior in plain language,
4) how temporal behavior changes the base (data/workload) case.

What users should expect

  1. The assistant executes MCP tools in order (trace_to_spec/build_spec -> validate_spec -> run_spec -> list_outputs).
  2. You get concrete artifact paths (generated YAML + output files).
  3. You get a short interpretation of what changed for your selected case (data/query), plus temporal overlay effects when requested.
  4. You usually get one or two suggested next iterations for deeper benchmarking.

Python API (Stable Entry Points)

Use top-level APIs instead of internal modules:

from driftbench import run_spec, trace_to_spec, get_schema_extractor

run_spec("driftspec/examples/demo_data_single.yaml")
trace_to_spec("driftspec/trace_inputs/trace_data_mock.csv", "driftspec/generated/from_trace.yaml")

Benchmark Objects (driftbench.data.xxx)

Use benchmark-specific objects to generate artifacts into a user-chosen directory. Seven benchmark adapters are available — see docs/benchmark_reference.md for full data/query details.

Adapter Type Tables Queries
tpch OLAP 8 22 templates
tpcds OLAP / Decision support 4 (synth) / 26 (full) Templates
tpcc OLTP 9 5 transaction types
tpcc_skew OLTP + access skew 9 + weight manifest 5 transaction types
job OLAP / join-order 8 (synth) / 21 (full IMDB) 20 representative
ycsb Key-value 1 6 workload mixes
dsb Decision support Configurable Templates

1) Choose an output directory

output_dir is required. DriftBench will write files only under this directory.

2) Generate data and queries

from pathlib import Path
from driftbench.data.tpch import data as tpch_data, queries as tpch_queries
from driftbench.data.tpcds import data as tpcds_data, queries as tpcds_queries
from driftbench.data.tpcc import data as tpcc_data, queries as tpcc_queries
from driftbench.data.tpcc_skew import data as tpcc_skew_data, queries as tpcc_skew_queries
from driftbench.data.job import data as job_data, queries as job_queries
from driftbench.data.ycsb import data as ycsb_data, queries as ycsb_queries
from driftbench.data.dsb import data as dsb_data, queries as dsb_queries

out = Path("./artifacts")

# TPC-H (OLAP)
tpch_data(scale_factor=1).generate(output_dir=out)
tpch_queries(query_ids=[1, 3, 5], queries_per_template=2, mode="qgen").generate(output_dir=out)
# TPC-C (OLTP, scale_factor == number of warehouses)
tpcc_data(scale_factor=4).generate(output_dir=out)
tpcc_queries().generate(output_dir=out)

# TPC-C Skew (OLTP with Zipf hot-warehouse drift)
tpcc_skew_data(scale_factor=10, hot_warehouse_fraction=0.2, skew_factor=0.99).generate(output_dir=out)
tpcc_skew_queries(scale_factor=10, hot_warehouse_fraction=0.2).generate(output_dir=out)

# JOB — Join Order Benchmark (IMDB, join-order sensitivity)
job_data(scale_factor=1).generate(output_dir=out)
job_queries().generate(output_dir=out)

# YCSB (key-value workloads A–F)
ycsb_data(scale_factor=1).generate(output_dir=out)
ycsb_queries(workload="B").generate(output_dir=out)

# TPC-DS and DSB
tpcds_data(scale_factor=10).generate(output_dir=out)
tpcds_queries().generate(output_dir=out)
dsb_data(scale_factor=10).generate(output_dir=out)
dsb_queries().generate(output_dir=out)

3) Find generated files

Artifacts are written to:

<output_dir>/
  tpch/data/    tpch/queries/
  tpcds/data/   tpcds/queries/
  tpcc/data/    tpcc/queries/
  tpcc_skew/data/  tpcc_skew/queries/
  job/data/     job/queries/
  ycsb/data/    ycsb/queries/
  dsb/data/     dsb/queries/

Each generation creates a manifest (*_manifest.json) in its folder.
Use the manifest files field to see exactly which files were generated.

4) Programmatic path retrieval

generate() returns a GenerationResult with:

  • result.files: generated file paths
  • result.metadata: manifest path

This is the recommended way to chain into downstream benchmarking scripts.


Where to find examples

  • Example specs: driftspec/examples/
  • Trace inputs: driftspec/trace_inputs/
  • Integration tests with runnable fixtures: test/fixtures/specs/

Core docs

  • API boundary: docs/p0_api_boundary_freeze.md
  • CLI/MCP command matrix: docs/p0_mcp_command_matrix.md
  • Integration quickstart: docs/p0_integration_quickstart.md
  • MCP examples script: docs/p0_mcp_examples.sh
  • Release branch/tag policy: docs/release_branch_policy.md

Testing

Run all tests:

python3 -m unittest discover -s test -p 'test_*.py' -v

License

MIT (see LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

driftbench_db-0.1.0b7.tar.gz (184.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

driftbench_db-0.1.0b7-py3-none-any.whl (202.3 kB view details)

Uploaded Python 3

File details

Details for the file driftbench_db-0.1.0b7.tar.gz.

File metadata

  • Download URL: driftbench_db-0.1.0b7.tar.gz
  • Upload date:
  • Size: 184.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for driftbench_db-0.1.0b7.tar.gz
Algorithm Hash digest
SHA256 fcb2e749be3fe1eb8c4651bae18a5cb9aa4afdaed61b9b3f3732cc0ec801de31
MD5 465e2368cc863edf64fb8ba0e4b73539
BLAKE2b-256 430efd0370661888c88d51c5763722269621e8e43f83f9316f9f01221b430b24

See more details on using hashes here.

Provenance

The following attestation bundles were made for driftbench_db-0.1.0b7.tar.gz:

Publisher: publish.yml on Liuguanli/DriftBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file driftbench_db-0.1.0b7-py3-none-any.whl.

File metadata

  • Download URL: driftbench_db-0.1.0b7-py3-none-any.whl
  • Upload date:
  • Size: 202.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for driftbench_db-0.1.0b7-py3-none-any.whl
Algorithm Hash digest
SHA256 bee8572c6a66009bb7fed641ae5df044e30d3d291a999638a23419364297afdd
MD5 4cd2b4cd4e9be27cce9b6fe709e8e5b7
BLAKE2b-256 b2341b3f36331e0d8ff0e4efc0d9b7a113244b712d48ff74d9760fe584a27e6c

See more details on using hashes here.

Provenance

The following attestation bundles were made for driftbench_db-0.1.0b7-py3-none-any.whl:

Publisher: publish.yml on Liuguanli/DriftBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page