Drift-aware database benchmarking — generate, share, and replay data and workload drift via DriftSpec.
Project description
DriftBench
DriftBench is a benchmarking toolkit that quantifies how data drift and workload drift influence database behavior. It powers the experiments in the DriftBench paper with reproducible drift generation, query synthesis, and downstream evaluation utilities.
Highlights
- Unified handling of data and workload drift with shared abstractions.
- Declarative experiment definitions through DriftSpec YAML files.
- Support for CSV, Parquet, and PostgreSQL sources plus downstream workloads.
- End-to-end assets (schemas, templates, workloads, plots) for paper reproduction.
Web Frontend
The deployable frontend (Home, Get Started, Drift Lab, Generator playbook, Case Studies) lives in a sibling repo: driftbench-web. Clone that repo to build / run the UI.
DriftSpec at a Glance
DriftSpec is the YAML contract that tells DriftBench which drift scenario to build. A single file captures the drift family, data source, variables, and optional workload generation hooks, making experiments portable and versionable.
Key Elements
type: declares whether the pattern targets data or workloads (single-table or multi-table).data_source: describes how to access the base data and optionally extract a schema.variables: parameterizes the drift operators, output paths, and workload knobs.- Optional sections cover temporal stamps, query generation, and downstream processors.
Minimal Example
pattern_id: census-cardinality-demo
seed: 42
type:
family: data
category: drift
subtype: single_table
data_source:
kind: csv
path: ./data/census_original.csv
variables:
base_table: census_original
drifts:
- name: vary_cardinality_scale_1
drift_type: vary_cardinality
scale: 1.0
output_path: ./output/data/cardinality/scale/census_original_cardinality_1.csv
Run any specification with:
python -m driftbench.cli run-yaml <path-to-yaml>
Additional operational CLI commands:
# Validate spec structure and handler availability
python -m driftbench.cli validate-spec driftspec/examples/demo_data_single.yaml --json
# Preview what a spec would execute (without running handlers)
python -m driftbench.cli dry-run driftspec/examples/demo_data_single.yaml --json
# Generate a spec from trace summary
python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_data_mock.csv driftspec/generated/trace_data_mock.yaml
# List generated outputs for inspection/automation
python -m driftbench.cli list-outputs --root output --glob "**/*.csv" --limit 20 --json
Python Integration API (P0)
For downstream project integration, prefer the stable top-level API:
from driftbench import run_spec, trace_to_spec, get_schema_extractor
run_spec("driftspec/examples/demo_data_single.yaml")
trace_to_spec("driftspec/trace_inputs/trace_data_mock.csv", "driftspec/generated/from_trace.yaml")
Public API details and boundary rules:
docs/p0_api_boundary_freeze.mddocs/p0_mcp_command_matrix.md
MCP runnable example script:
docs/p0_mcp_examples.sh
Minimal MCP server runtime (stdio):
driftbench_mcp/server.pyscripts/run_driftbench_mcp.shdocs/p0_mcp_server_minimal.mddocs/mcp_config_example.json
Spec sharing MCP tools are included:
save_speclist_public_specsimport_spec_and_run
Testing (P0 Foundation)
Run the full test suite:
python3 -m unittest discover -s test -p 'test_*.py' -v
Run focused P0 suites:
python3 -m unittest -v \
test.test_cli_commands \
test.test_spec_core_unit \
test.test_spec_execution_integration \
test.test_smoke_pipeline
Clean-environment bootstrap and verification:
./scripts/bootstrap_p0_env.sh
./scripts/verify_p0_clean_env.sh
Custom Deletion Filters (registry + DriftSpec)
DriftSpec cannot serialize Python callables, so use the filter registry to reference a filter by name.
- Register a filter in code:
# my_project/filters.py
from driftbench.core.data.filter_registry import register_filter
@register_filter("age_gt_60")
def age_gt_60(series, config):
return series > 60
- Import the module and reference it in YAML:
filter_registry_modules:
- my_project.filters
variables:
base_table: census_original
drifts:
- name: delete_age_gt_60
drift_type: selective_deletion
n: 5000
filter:
column: age
func_name: age_gt_60
output_path: ./output/data/cardinality/update/census_original_deletion_age_gt_60.csv
You can also use simple declarative filters without registration:
filter:
column: timestamp
min: "2025-07-02T00:00:00"
max: "2025-07-03T00:00:00"
Trace to DriftSpec (mock flow)
If you already parsed a real database trace into a compact CSV/JSON summary, you can generate a DriftSpec YAML directly:
python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_data_mock.csv driftspec/generated/trace_data_mock.yaml
python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_workload_mock.json driftspec/generated/trace_workload_mock.yaml
The mock inputs live in driftspec/trace_inputs/ and show the minimal fields the generator expects.
Explore complete templates in driftspec/examples/, including:
- Data drift patterns:
demo_data_single.yaml - PostgreSQL single-table:
demo_postgres.yaml - PostgreSQL multi-table:
demo_postgres_multi.yaml - Workload drift:
workload_census.yaml
Census Temporal Demos (Data)
These specs create timestamped census data and time growth scenarios under output/data/time_demo/.
demo_data_census_timestamp.yaml: add atimestampcolumn with uniform arrivals.demo_data_census_time_growth.yaml: combine a base day with an age-skewed day to show a distribution shift.demo_data_census_time_growth_3x.yaml: 3-day time growth (uniform + periodic + bursty).demo_data_census_time_growth_4x.yaml: 4-day time growth (adds a long-tail day).
Run them with:
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_timestamp.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth_3x.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth_4x.yaml
Behind the scenes, the runner loads type handlers registered in driftbench/spec/types/ and executes them through driftbench/spec/core.py.
Paper Artifacts
Everything required to reproduce the results in the DriftBench paper lives in this repository.
- Input datasets
data/census_original.csv: baseline census table.data/census_outliers.csv: injected outliers for case studies.data/PG_info.json: PostgreSQL connection metadata.
- Schemas and templates
output/intermediate/census_original_schema.json: extracted single-table schema.output/intermediate/census_original_templates.json: workload templates for the census data.output/intermediate/tpcds_schema.json: inferred multi-table schema from TPC-DS.
- Data drift outputs
output/data/cardinality/scale/: scaled datasets (e.g., ×0.1, ×1).output/data/cardinality/update/: selective deletion scenarios.output/data/distributional/column/: column distribution shifts.output/data/distributional/outlier/: rare value injections.
- Workload drift outputs
output/workload/parametric/distribution/: predicate distribution changes.output/workload/parametric/selectivity/: workloads with varying selectivity.output/workload/tpcds_sqls_default.csv: multi-table workload derived from TPC-DS.
Case Study Gallery
All visuals were generated from notebooks in driftbench/notebooks/ using the assets listed above.
-
Varying Cardinality
-
Selective Deletions
-
Column Distribution Shifts
-
Outlier Injection
-
Workload Drift
-
Q-Error Benchmarks
-
Join-Aware Drift Templates
Click to expand a generated multi-table template
{ "template_id": "T000", "cardinality": 1441548, "tables": { "base": "public.catalog_sales", "joins": [ { "type": "FULL JOIN", "table": "public.store_sales", "condition": "public.catalog_sales.cs_net_profit = public.store_sales.ss_net_profit" } ] }, "predicate": [ { "column": "public.catalog_sales.cs_warehouse_sk", "operator": "<=", "type": "numeric", "value": "", "range": { "min": 1, "max": 5 }, "selectivity": 0.1 } ], "payload": { "columns": [ "public.catalog_sales.cs_order_number" ], "aggregation": null, "order_by": "public.catalog_sales.cs_order_number", "limit": 100 } }
Installation
DriftBench requires Python 3.10 / 3.11 / 3.12 (3.13 not yet supported).
# from PyPI (after a release tag is published)
pip install driftbench-db
# from source (current state of main)
pip install git+https://github.com/Liuguanli/DriftBench.git
# editable / development
git clone https://github.com/Liuguanli/DriftBench.git
cd DriftBench
pip install -e .
The PyPI distribution is named
driftbench-dbto disambiguate from existing PyPI projects. The Python import name is stilldriftbench— same package, different convention, likescikit-learn/sklearn.
A single pip install brings in:
- The
driftbenchengine and its public Python API. - The
driftbenchCLI (entry point:driftbench). - The
driftbench-serviceHTTP server (entry point:driftbench-service). - The
driftbench-mcpMCP server (entry point:driftbench-mcp).
Optional: a PostgreSQL instance if you plan to run the postgres-backed examples (psycopg2-binary is already pulled in).
Quickstart Workflow
After install, the bundled CLI exposes the full P0 workflow:
# 1) Validate a spec without executing it
driftbench validate-spec driftspec/examples/demo_data_single.yaml --json
# 2) Preview the planned stages
driftbench dry-run driftspec/examples/demo_data_single.yaml --json
# 3) Execute the spec end-to-end
driftbench run-yaml driftspec/examples/demo_data_single.yaml
# 4) Convert a real trace into a runnable spec
driftbench trace-to-spec \
driftspec/trace_inputs/trace_data_mock.csv \
driftspec/generated/from_trace.yaml \
--trace-type data
# 5) List generated outputs for inspection / automation
driftbench list-outputs --root output --glob "**/*.csv" --limit 20 --json
Bring the HTTP service up with driftbench-service --port 8000 and the MCP server with driftbench-mcp (or via scripts/run_driftbench_mcp.sh for Cursor / Claude Code MCP clients — see docs/mcp_config_example.json).
Substitute any of the YAML files in driftspec/examples/ to explore alternative drift scenarios.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file driftbench_db-0.1.0b2.tar.gz.
File metadata
- Download URL: driftbench_db-0.1.0b2.tar.gz
- Upload date:
- Size: 143.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cdda26c1e74e5cf3b3ed6ecf07f309c3dd2b0a7d4329a1c1f270da419d3863d
|
|
| MD5 |
9c41afca397a4cbb78851a8aa19c9b94
|
|
| BLAKE2b-256 |
87e86572337a6d1bd7073bec54e45b1e16c2e31eb466148fe4eedab5e45290e0
|
Provenance
The following attestation bundles were made for driftbench_db-0.1.0b2.tar.gz:
Publisher:
publish.yml on Liuguanli/DriftBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
driftbench_db-0.1.0b2.tar.gz -
Subject digest:
0cdda26c1e74e5cf3b3ed6ecf07f309c3dd2b0a7d4329a1c1f270da419d3863d - Sigstore transparency entry: 1485370098
- Sigstore integration time:
-
Permalink:
Liuguanli/DriftBench@d5e8e25832ca73dcf317563b15dbb91d743d13d9 -
Branch / Tag:
refs/tags/v0.1.0b2 - Owner: https://github.com/Liuguanli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d5e8e25832ca73dcf317563b15dbb91d743d13d9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file driftbench_db-0.1.0b2-py3-none-any.whl.
File metadata
- Download URL: driftbench_db-0.1.0b2-py3-none-any.whl
- Upload date:
- Size: 143.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b84804bd6fb155b1a53b1a1ce275f6f813f622898299183da3dc795851a3f54
|
|
| MD5 |
dcfca3a81f2636cfa0f9d023c5b69763
|
|
| BLAKE2b-256 |
2fb9463324f78c5c723494af3fe5d4790f215dce9ffff49409f760104344170c
|
Provenance
The following attestation bundles were made for driftbench_db-0.1.0b2-py3-none-any.whl:
Publisher:
publish.yml on Liuguanli/DriftBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
driftbench_db-0.1.0b2-py3-none-any.whl -
Subject digest:
5b84804bd6fb155b1a53b1a1ce275f6f813f622898299183da3dc795851a3f54 - Sigstore transparency entry: 1485370111
- Sigstore integration time:
-
Permalink:
Liuguanli/DriftBench@d5e8e25832ca73dcf317563b15dbb91d743d13d9 -
Branch / Tag:
refs/tags/v0.1.0b2 - Owner: https://github.com/Liuguanli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d5e8e25832ca73dcf317563b15dbb91d743d13d9 -
Trigger Event:
push
-
Statement type: