Drift-aware database benchmarking — generate, share, and replay data and workload drift via DriftSpec.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

guanli

These details have not been verified by PyPI

Project description

DriftBench logo

DriftBench

DriftBench is a benchmarking toolkit that quantifies how data drift and workload drift influence database behavior. It powers the experiments in the DriftBench paper with reproducible drift generation, query synthesis, and downstream evaluation utilities.

Highlights

Unified handling of data and workload drift with shared abstractions.
Declarative experiment definitions through DriftSpec YAML files.
Support for CSV, Parquet, and PostgreSQL sources plus downstream workloads.
End-to-end assets (schemas, templates, workloads, plots) for paper reproduction.

Web Frontend

The deployable frontend (Home, Get Started, Drift Lab, Generator playbook, Case Studies) lives in a sibling repo: driftbench-web. Clone that repo to build / run the UI.

DriftSpec at a Glance

DriftSpec is the YAML contract that tells DriftBench which drift scenario to build. A single file captures the drift family, data source, variables, and optional workload generation hooks, making experiments portable and versionable.

Key Elements

type: declares whether the pattern targets data or workloads (single-table or multi-table).
data_source: describes how to access the base data and optionally extract a schema.
variables: parameterizes the drift operators, output paths, and workload knobs.
Optional sections cover temporal stamps, query generation, and downstream processors.

Minimal Example

pattern_id: census-cardinality-demo
seed: 42

type:
  family: data
  category: drift
  subtype: single_table

data_source:
  kind: csv
  path: ./data/census_original.csv

variables:
  base_table: census_original
  drifts:
    - name: vary_cardinality_scale_1
      drift_type: vary_cardinality
      scale: 1.0
      output_path: ./output/data/cardinality/scale/census_original_cardinality_1.csv

Run any specification with:

python -m driftbench.cli run-yaml <path-to-yaml>

Additional operational CLI commands:

# Validate spec structure and handler availability
python -m driftbench.cli validate-spec driftspec/examples/demo_data_single.yaml --json

# Preview what a spec would execute (without running handlers)
python -m driftbench.cli dry-run driftspec/examples/demo_data_single.yaml --json

# Generate a spec from trace summary
python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_data_mock.csv driftspec/generated/trace_data_mock.yaml

# List generated outputs for inspection/automation
python -m driftbench.cli list-outputs --root output --glob "**/*.csv" --limit 20 --json

Python Integration API (P0)

For downstream project integration, prefer the stable top-level API:

from driftbench import run_spec, trace_to_spec, get_schema_extractor

run_spec("driftspec/examples/demo_data_single.yaml")
trace_to_spec("driftspec/trace_inputs/trace_data_mock.csv", "driftspec/generated/from_trace.yaml")

Public API details and boundary rules:

docs/p0_api_boundary_freeze.md
docs/p0_mcp_command_matrix.md
docs/release_branch_policy.md (release branch + tag policy for PyPI)

MCP runnable example script:

docs/p0_mcp_examples.sh

Minimal MCP server runtime (stdio):

driftbench_mcp/server.py
scripts/run_driftbench_mcp.sh
docs/p0_mcp_server_minimal.md
docs/mcp_config_example.json

Spec sharing MCP tools are included:

save_spec
list_public_specs
import_spec_and_run

Testing (P0 Foundation)

Run the full test suite:

python3 -m unittest discover -s test -p 'test_*.py' -v

Run focused P0 suites:

python3 -m unittest -v \
  test.test_cli_commands \
  test.test_spec_core_unit \
  test.test_spec_execution_integration \
  test.test_smoke_pipeline

Clean-environment bootstrap and verification:

./scripts/bootstrap_p0_env.sh
./scripts/verify_p0_clean_env.sh

Custom Deletion Filters (registry + DriftSpec)

DriftSpec cannot serialize Python callables, so use the filter registry to reference a filter by name.

# my_project/filters.py
from driftbench.core.data.filter_registry import register_filter

@register_filter("age_gt_60")
def age_gt_60(series, config):
    return series > 60

Import the module and reference it in YAML:

filter_registry_modules:
  - my_project.filters

variables:
  base_table: census_original
  drifts:
    - name: delete_age_gt_60
      drift_type: selective_deletion
      n: 5000
      filter:
        column: age
        func_name: age_gt_60
      output_path: ./output/data/cardinality/update/census_original_deletion_age_gt_60.csv

You can also use simple declarative filters without registration:

filter:
  column: timestamp
  min: "2025-07-02T00:00:00"
  max: "2025-07-03T00:00:00"

Trace to DriftSpec (mock flow)

If you already parsed a real database trace into a compact CSV/JSON summary, you can generate a DriftSpec YAML directly:

python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_data_mock.csv driftspec/generated/trace_data_mock.yaml
python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_workload_mock.json driftspec/generated/trace_workload_mock.yaml

The mock inputs live in driftspec/trace_inputs/ and show the minimal fields the generator expects.

Explore complete templates in driftspec/examples/, including:

Data drift patterns: demo_data_single.yaml
PostgreSQL single-table: demo_postgres.yaml
PostgreSQL multi-table: demo_postgres_multi.yaml
Workload drift: workload_census.yaml

Census Temporal Demos (Data)

These specs create timestamped census data and time growth scenarios under output/data/time_demo/.

demo_data_census_timestamp.yaml: add a timestamp column with uniform arrivals.
demo_data_census_time_growth.yaml: combine a base day with an age-skewed day to show a distribution shift.
demo_data_census_time_growth_3x.yaml: 3-day time growth (uniform + periodic + bursty).
demo_data_census_time_growth_4x.yaml: 4-day time growth (adds a long-tail day).

Run them with:

python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_timestamp.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth_3x.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth_4x.yaml

Behind the scenes, the runner loads type handlers registered in driftbench/spec/types/ and executes them through driftbench/spec/core.py.

Paper Artifacts

Everything required to reproduce the results in the DriftBench paper lives in this repository.

Input datasets
- data/census_original.csv: baseline census table.
- data/census_outliers.csv: injected outliers for case studies.
- data/PG_info.json: PostgreSQL connection metadata.
Schemas and templates
- output/intermediate/census_original_schema.json: extracted single-table schema.
- output/intermediate/census_original_templates.json: workload templates for the census data.
- output/intermediate/tpcds_schema.json: inferred multi-table schema from TPC-DS.
Data drift outputs
- output/data/cardinality/scale/: scaled datasets (e.g., ×0.1, ×1).
- output/data/cardinality/update/: selective deletion scenarios.
- output/data/distributional/column/: column distribution shifts.
- output/data/distributional/outlier/: rare value injections.
Workload drift outputs
- output/workload/parametric/distribution/: predicate distribution changes.
- output/workload/parametric/selectivity/: workloads with varying selectivity.
- output/workload/tpcds_sqls_default.csv: multi-table workload derived from TPC-DS.

Case Study Gallery

All visuals were generated from notebooks in driftbench/notebooks/ using the assets listed above.

Varying Cardinality
Selective Deletions
Column Distribution Shifts
Outlier Injection
Workload Drift
Q-Error Benchmarks

Join-Aware Drift Templates

Click to expand a generated multi-table template

{
  "template_id": "T000",
  "cardinality": 1441548,
  "tables": {
    "base": "public.catalog_sales",
    "joins": [
      {
        "type": "FULL JOIN",
        "table": "public.store_sales",
        "condition": "public.catalog_sales.cs_net_profit = public.store_sales.ss_net_profit"
      }
    ]
  },
  "predicate": [
    {
      "column": "public.catalog_sales.cs_warehouse_sk",
      "operator": "<=",
      "type": "numeric",
      "value": "",
      "range": {
        "min": 1,
        "max": 5
      },
      "selectivity": 0.1
    }
  ],
  "payload": {
    "columns": [
      "public.catalog_sales.cs_order_number"
    ],
    "aggregation": null,
    "order_by": "public.catalog_sales.cs_order_number",
    "limit": 100
  }
}

Installation

DriftBench requires Python 3.10 / 3.11 / 3.12 (3.13 not yet supported).

# from PyPI (after a release tag is published)
pip install driftbench-db

# from source (current state of main)
pip install git+https://github.com/Liuguanli/DriftBench.git

# editable / development
git clone https://github.com/Liuguanli/DriftBench.git
cd DriftBench
pip install -e .

The PyPI distribution is named driftbench-db to disambiguate from existing PyPI projects. The Python import name is still driftbench — same package, different convention, like scikit-learn / sklearn.

A single pip install brings in:

The driftbench engine and its public Python API.
The driftbench CLI (entry point: driftbench).
The driftbench-service HTTP server (entry point: driftbench-service).
The driftbench-mcp MCP server (entry point: driftbench-mcp).

Optional: a PostgreSQL instance if you plan to run the postgres-backed examples (psycopg2-binary is already pulled in).

Quickstart Workflow

After install, the bundled CLI exposes the full P0 workflow:

# 1) Validate a spec without executing it
driftbench validate-spec driftspec/examples/demo_data_single.yaml --json

# 2) Preview the planned stages
driftbench dry-run driftspec/examples/demo_data_single.yaml --json

# 3) Execute the spec end-to-end
driftbench run-yaml driftspec/examples/demo_data_single.yaml

# 4) Convert a real trace into a runnable spec
driftbench trace-to-spec \
  driftspec/trace_inputs/trace_data_mock.csv \
  driftspec/generated/from_trace.yaml \
  --trace-type data

# 5) List generated outputs for inspection / automation
driftbench list-outputs --root output --glob "**/*.csv" --limit 20 --json

Bring the HTTP service up with driftbench-service --port 8000 and the MCP server with driftbench-mcp. See docs/mcp_config_example.json for an MCP client configuration template.

Substitute any of the YAML files in driftspec/examples/ to explore alternative drift scenarios.

Tutorials

Three end-to-end walkthroughs you can copy-paste verbatim. Each assumes pip install driftbench-db is already done (Python 3.10 / 3.11 / 3.12).

Tutorial 1 — Drift your own CSV in 60 seconds (no clone, no agent)

The wheel does not bundle the driftspec/examples/ YAML fixtures, so the self-contained way is to write a tiny spec inline against a CSV you already have.

mkdir -p drift_demo && cd drift_demo

# 1) The data we want to drift
cat > sales.csv <<'CSV'
sale_id,store_id,amount,quantity,sale_date
1,3,49.99,2,2024-01-15
2,1,12.50,1,2024-01-16
3,2,89.00,3,2024-01-17
4,1,25.75,1,2024-01-18
5,3,150.00,5,2024-01-19
CSV

# 2) A minimal DriftSpec — two stages: scale 2x + skew the amount column
cat > sales_drift.yaml <<'YAML'
pattern_id: tutorial-sales-drift
seed: 42
type:
  family: data
  category: drift
  subtype: single_table
data_source:
  kind: csv
  path: ./sales.csv
  schema_extractor:
    source_type: csv
    sample_size: 5
    schema_output_path: ./sales_schema.json
variables:
  base_table: sales
  drifts:
    - name: scale_2x
      drift_type: vary_cardinality
      output_path: ./out/sales_2x.csv
      scale: 2
    - name: skew_amount
      drift_type: value_skew
      output_path: ./out/sales_skewed.csv
      columns: ["amount"]
      portion: 1.0
      skewness: 2
YAML

# 3) Validate, dry-run, then execute
driftbench validate-spec sales_drift.yaml --json
driftbench dry-run sales_drift.yaml --json
driftbench run-yaml sales_drift.yaml

# 4) Inspect the drifted artifacts
ls out/
wc -l sales.csv out/sales_2x.csv      # 5 -> 10 rows after 2x scaling

Tutorial 2 — Vibe coding with Cursor / Claude Code (MCP)

Skip writing YAML by hand: an MCP-aware agent can use extract_schema, build_spec, validate_spec, and run_spec to author and execute a spec from a natural-language prompt.

Configure the MCP client. In Cursor or Claude Code, add to your mcp.json:

{
  "mcpServers": {
    "driftbench": {
      "command": "driftbench-mcp"
    }
  }
}

If the client cannot find driftbench-mcp on its PATH (Cursor and Claude Desktop sometimes launch with a stripped environment), use the absolute path inside your venv. Get it with which driftbench-mcp:

{
  "mcpServers": {
    "driftbench": {
      "command": "/abs/path/to/your/venv/bin/driftbench-mcp"
    }
  }
}

Drive it in natural language. Example prompts that exercise the full chain:

Inspect ~/Documents/sales.csv with driftbench (extract_schema), then propose a single-table drift spec that scales cardinality 2x and skews the amount column. Build, validate, and run it.

Run extract_schema on data/census_original.csv, then build_spec a multi-stage spec covering cardinality + value-skew + outlier injection. Save under driftspec/generated/ and dry-run.

Convert ~/traces/redbench_slice.csv into a workload-drift spec via trace_to_spec, then run it and report row counts per stage.

The agent will call the MCP tools in sequence, write the YAML to disk, and report run summaries (paths, row counts) inline. Since 0.1.0b2 the MCP server accepts any absolute path, so your data does not need to live inside the DriftBench install directory.

The full tool inventory is: driftbench_health, trace_to_spec, validate_spec, dry_run_spec, run_spec, list_outputs, extract_schema, build_spec, save_spec, list_public_specs, import_spec_and_run.

Tutorial 3 — Run a curated example spec

The repository ships several mature DriftSpecs under driftspec/examples/. For pip-installed users who don't want to clone the whole repo, fetch them from the raw URL:

mkdir -p data
curl --create-dirs -o data/census_original.csv \
  https://raw.githubusercontent.com/Liuguanli/DriftBench/main/data/census_original.csv
curl --create-dirs -o data/census_outliers.csv \
  https://raw.githubusercontent.com/Liuguanli/DriftBench/main/data/census_outliers.csv
curl -O https://raw.githubusercontent.com/Liuguanli/DriftBench/main/driftspec/examples/demo_data_single.yaml

driftbench validate-spec demo_data_single.yaml --json
driftbench run-yaml demo_data_single.yaml
ls output/data/                 # cardinality / distributional outputs

Other curated specs you can substitute for demo_data_single.yaml:

Spec	What it demonstrates
`demo_data_census_time_patterns.yaml`	Uniform / periodic / trend / long-tail temporal arrival patterns
`workload_census.yaml`	Workload templates × time × selectivity (full SQL workload)
`demo_postgres.yaml`	Same workflow shape but from a live PostgreSQL source
`demo_template_mix_drift.yaml`	Distribution + structural + selectivity drift composed in one workload

Tutorial 4 — Embed the Python API in your own project

If you'd rather call DriftBench from your existing pipeline, the public API is intentionally small:

from driftbench import (
    load_and_validate_spec,
    run_spec,
    run_spec_and_return_summary,
    trace_to_spec,
    get_schema_extractor,
)

# Validate without executing
spec, type_info = load_and_validate_spec("sales_drift.yaml")
print(type_info)   # {'family': 'data', 'category': 'drift', 'subtype': 'single_table'}

# Execute end-to-end (returns a structured summary)
summary = run_spec_and_return_summary("sales_drift.yaml")

# Derive a spec from a trace summary
trace_to_spec(
    "/path/to/trace.csv",
    "driftspec/generated/from_trace.yaml",
    trace_type="data",
)

# Use a schema extractor directly
extractor = get_schema_extractor("csv", csv_path="sales.csv", sample_size=100)
schema = extractor.extract_schema()

Stick to imports from driftbench or driftbench.api. Internal modules under driftbench.core.* are not part of the supported surface and may move between releases.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

guanli

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.0b7.post1 pre-release

May 15, 2026

0.1.0b7 pre-release

May 15, 2026

0.1.0b6 pre-release

May 14, 2026

0.1.0b5 pre-release

May 11, 2026

0.1.0b4 pre-release

May 10, 2026

This version

0.1.0b3 pre-release

May 9, 2026

0.1.0b2 pre-release

May 9, 2026

0.1.0b1 pre-release

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

driftbench_db-0.1.0b3.tar.gz (145.4 kB view details)

Uploaded May 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

driftbench_db-0.1.0b3-py3-none-any.whl (145.5 kB view details)

Uploaded May 9, 2026 Python 3

File details

Details for the file driftbench_db-0.1.0b3.tar.gz.

File metadata

Download URL: driftbench_db-0.1.0b3.tar.gz
Upload date: May 9, 2026
Size: 145.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for driftbench_db-0.1.0b3.tar.gz
Algorithm	Hash digest
SHA256	`87932369130d531eca13c7c2e69503d3d5a672db565509ce39332e22a4349c98`
MD5	`8bef51a7aedfad40b7b153ecdc527021`
BLAKE2b-256	`6e1bb651590bb0bc295a6b10243a461abf7db1e66227595cdaacd874e8834f91`

See more details on using hashes here.

Provenance

The following attestation bundles were made for driftbench_db-0.1.0b3.tar.gz:

Publisher: publish.yml on Liuguanli/DriftBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: driftbench_db-0.1.0b3.tar.gz
- Subject digest: 87932369130d531eca13c7c2e69503d3d5a672db565509ce39332e22a4349c98
- Sigstore transparency entry: 1485437597
- Sigstore integration time: May 9, 2026
Source repository:
- Permalink: Liuguanli/DriftBench@84ec2ec5091d0dc55af5c93546d968ab054aa9bf
- Branch / Tag: refs/tags/v0.1.0b3
- Owner: https://github.com/Liuguanli
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@84ec2ec5091d0dc55af5c93546d968ab054aa9bf
- Trigger Event: push

File details

Details for the file driftbench_db-0.1.0b3-py3-none-any.whl.

File metadata

Download URL: driftbench_db-0.1.0b3-py3-none-any.whl
Upload date: May 9, 2026
Size: 145.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for driftbench_db-0.1.0b3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c805888329af16f35c8a023803b18bc9c7f0acb2ababb81fa6ad938965ba8ff9`
MD5	`b46af32b3ac4a5c8c87a6f92567f7708`
BLAKE2b-256	`99122f51fd1f5cbeb56d0755b498a9f3e9fc1d6ff32b9ce9a928f203af73e0e9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for driftbench_db-0.1.0b3-py3-none-any.whl:

Publisher: publish.yml on Liuguanli/DriftBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: driftbench_db-0.1.0b3-py3-none-any.whl
- Subject digest: c805888329af16f35c8a023803b18bc9c7f0acb2ababb81fa6ad938965ba8ff9
- Sigstore transparency entry: 1485437673
- Sigstore integration time: May 9, 2026
Source repository:
- Permalink: Liuguanli/DriftBench@84ec2ec5091d0dc55af5c93546d968ab054aa9bf
- Branch / Tag: refs/tags/v0.1.0b3
- Owner: https://github.com/Liuguanli
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@84ec2ec5091d0dc55af5c93546d968ab054aa9bf
- Trigger Event: push

driftbench-db 0.1.0b3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

DriftBench

Highlights

Web Frontend

DriftSpec at a Glance

Key Elements

Minimal Example

Python Integration API (P0)

Testing (P0 Foundation)

Custom Deletion Filters (registry + DriftSpec)

Trace to DriftSpec (mock flow)

Census Temporal Demos (Data)

Paper Artifacts

Case Study Gallery

Installation

Quickstart Workflow

Tutorials

Tutorial 1 — Drift your own CSV in 60 seconds (no clone, no agent)

Tutorial 2 — Vibe coding with Cursor / Claude Code (MCP)

Tutorial 3 — Run a curated example spec

Tutorial 4 — Embed the Python API in your own project

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance