GhostDQ SDK: compute data-quality metrics locally and ship them to GhostDQ.

These details have not been verified by PyPI

Project links

Project description

ghostdq — Python SDK

The GhostDQ SDK computes data-quality metrics locally and ships only aggregated numbers to the GhostDQ Ingest API. Your raw data never leaves your infrastructure.

The main integration point is compute_metrics(df, rules) — pass a pandas DataFrame (or use a high-performance backend when reading from disk). File I/O is optional convenience.

Install

pip install ghostdq

Optional extras:

pip install "ghostdq[polars]"   # Polars lazy-scan backend
pip install "ghostdq[duckdb]"   # DuckDB SQL backend
pip install "ghostdq[fast]"     # both Polars and DuckDB
pip install "ghostdq[dev]"      # pytest, ruff, mypy, stubs

Core dependencies: pandas, pyarrow, fastavro, pyyaml. The HTTP client uses stdlib urllib only.

Quick start

from ghostdq import (
    GhostDQClient,
    compute_metrics,
    parse_contract,
    read_file,
)

# 1. Parse the contract (or fetch from the API — see below)
contract = parse_contract(open("sales_contract.yaml").read())

# 2. Load data (optional if you already have a DataFrame)
df = read_file("sales_2024.parquet", columns=contract.required_columns())

# 3. Compute metrics locally
metrics = compute_metrics(df, contract.rules)
# → {"row_count": 120000, "null_rate:country": 0.02, ...}

# 4. Ship metrics to GhostDQ
client = GhostDQClient(api_key="ghd_your_key")
result = client.create_run(dataset_id="<dataset-uuid>", metrics=metrics)
print(result.run_id, result.status)  # ⇒ <uuid>  pending

If your pipeline already produces a DataFrame (Spark, SQL, Polars, etc.), skip read_file and call compute_metrics directly.

Package layout

ghostdq/
├── contract/       Contract, RuleSpec, ContractParser, parse_contract
├── reading/        PandasFileReader, read_file (CSV / Parquet / Avro)
├── metrics/        MetricsEngine and performance backends
├── evaluation/     RuleEvaluator — local pass/fail without network
├── export/         GhostDQClient — POST metrics to the Ingest API
└── cli/            ghostdq run

Backward-compatible shims remain at ghostdq.client, ghostdq.io_pandas, and ghostdq.evaluate.

Main classes

Class	Purpose
`ContractParser`	Parse YAML contracts
`PandasFileReader`	Read files into a DataFrame
`MetricsEngine`	Compute metrics from a pandas DataFrame
`ArrowMetricsEngine`	Compute metrics from a PyArrow table (no pandas)
`StreamingCsvMetricsEngine`	Chunked CSV scan (constant memory)
`PolarsMetricsEngine`	Lazy Polars scans (optional extra)
`DuckDBMetricsEngine`	SQL over files (optional extra)
`RuleEvaluator`	Evaluate rules locally
`GhostDQClient`	Submit metrics to the Ingest API

Functional shortcuts (compute_metrics, parse_contract, read_file, …) delegate to default instances of the classes above.

Computing metrics

Pandas DataFrame (default)

The primary API. Only columns referenced by the contract are scanned; extra columns in a wide DataFrame are ignored.

from ghostdq import compute_metrics, required_columns

# Optional: pre-narrow before compute
cols = contract.required_columns()
metrics = compute_metrics(df[cols], contract.rules)

# Or pass the full frame — compute_metrics narrows internally
metrics = compute_metrics(df, contract.rules)

MetricsEngine batches work per column (single to_numeric pass for min/max, single duplicate scan when both count and rate are needed).

From a file (auto backend)

compute_metrics_file picks a backend based on format when engine="auto":

Format	Auto backend	Why
`.csv`	`streaming`	Chunked read, constant memory
`.parquet`	`arrow`	Native PyArrow, no pandas conversion
other	`pandas`	Avro and fallback

from ghostdq import compute_metrics_file

metrics = compute_metrics_file("huge.csv", contract.rules)
metrics = compute_metrics_file("wide.parquet", contract.rules, engine="arrow")
metrics = compute_metrics_file("data.csv", contract.rules, engine="pandas")

Streaming CSV

For very large CSV files without loading into memory:

from ghostdq import compute_csv_streaming

metrics = compute_csv_streaming(
    "huge.csv",
    contract.rules,
    chunksize=50_000,
    columns=contract.required_columns(),
)

Arrow (Parquet)

Skip pandas when the source is already Arrow or Parquet:

import pyarrow.parquet as pq
from ghostdq import ArrowMetricsEngine, compute_arrow_metrics

table = pq.read_table("data.parquet", columns=contract.required_columns())
metrics = compute_arrow_metrics(table, contract.rules)

# Or read + compute in one step:
metrics = ArrowMetricsEngine().compute_parquet("data.parquet", contract.rules)

Polars (optional)

pip install "ghostdq[polars]"

import polars as pl
from ghostdq.metrics import PolarsMetricsEngine

engine = PolarsMetricsEngine()
metrics = engine.compute_parquet("data.parquet", contract.rules)
metrics = engine.compute(pl.scan_csv("data.csv"), contract.rules)

DuckDB (optional)

pip install "ghostdq[duckdb]"

import duckdb
from ghostdq.metrics import DuckDBMetricsEngine

conn = duckdb.connect()
metrics = DuckDBMetricsEngine().compute_path(conn, "data.parquet", contract.rules)

Contracts

Contracts are YAML files that define dataset rules. Parse them with parse_contract or ContractParser:

from ghostdq import ContractParser, parse_contract, required_columns

contract = parse_contract(yaml_text)
contract = ContractParser().parse(yaml_text)

contract.required_columns()   # columns referenced by rules
contract.all_metric_keys()    # metric keys the server expects
required_columns(contract.rules)

Example contract:

dataset: sales
version: 1
rules:
  - row_count: {min: 1, max: 1000000}
  - null_rate: {column: country, max: 0.05}
  - unique: {column: order_id}
  - duplicate_rate: {column: order_id, max: 0.01}
  - value_range: {column: amount, min: 0, max: 10000}
  - allowed_values: {column: country, values: [ES, US, MX]}
  - out_of_range_rate: {column: amount, min: 0, max: 10000, max_rate: 0.001}
  - regex_match: {column: order_id, pattern: '^ORD-[0-9]+$', min_rate: 1.0}

See also the ready-to-run samples in examples/.

Supported rule types

Rule	Metric key(s)
`row_count`	`row_count`
`null_rate`	`null_rate:{column}`
`unique`	`duplicate_count:{column}`
`duplicate_rate`	`duplicate_rate:{column}`
`value_range`	`value_min:{column}`, `value_max:{column}`
`allowed_values`	`disallowed_count:{column}`
`out_of_range_rate`	`out_of_range_rate:{column}`
`regex_match`	`regex_match_rate:{column}`

Rule examples

out_of_range_rate — row-level bounds (like Great Expectations expect_column_values_to_be_between). Fails when more than max_rate of rows are null, non-numeric, below min, or above max:

- out_of_range_rate: {column: amount, min: 0, max: 10000, max_rate: 0}

value_range — dataset-level bounds. Checks that the column’s observed min/max fall within the limits (a single outlier row does not fail if the aggregate min/max are still in range):

- value_range: {column: amount, min: 0, max: 10000}

regex_match — whole-string regex match (like expect_column_values_to_match_regex). Nulls count as mismatches. Use [0-9] in YAML patterns instead of \d (backslashes are not escaped in single-quoted YAML):

- regex_match: {column: order_id, pattern: '^ORD-[0-9]+$', min_rate: 1.0}
- regex_match: {column: email, pattern: '^[^@]+@[^@]+\\.[^@]+$', min_rate: 0.99}

Local evaluation

Check pass/fail without calling the API:

from ghostdq import RuleEvaluator, evaluate_rules

results = evaluate_rules(contract.rules, metrics)
for r in results:
    print("✓" if r.passed else "✗", r.rule_type, r.value_display)

evaluator = RuleEvaluator()
results = evaluator.evaluate(contract.rules, metrics)
print(evaluator.format_line(results[0]))

Reading files

File reading is optional — use it when you don't already have a DataFrame.

from ghostdq import PandasFileReader, read_file

df = read_file("data.parquet", columns=contract.required_columns())
df = PandasFileReader().read_csv("data.csv", columns=["id", "amount"])

Format	Extension	Reader
CSV	`.csv`	pandas
Parquet	`.parquet`	pyarrow (column pruning supported)
Avro	`.avro`	fastavro

CLI

# Local validation (no API key)
ghostdq run --contract contract.yaml --file sales.csv

# Remote run: fetch contract + submit metrics
ghostdq run --dataset-id <uuid> --file sales.parquet --api-key ghd_xxx

# Pick a metrics backend
ghostdq run --contract contract.yaml --file huge.csv --engine streaming --chunk-size 50000
ghostdq run --contract contract.yaml --file data.parquet --engine arrow

Flag	Description
`--contract`	Local contract YAML (required for offline runs)
`--dataset-id`	Dataset UUID (enables remote contract fetch + submit)
`--file`	Data file (`.csv`, `.parquet`, `.avro`)
`--api-key`	API key (`GHOSTDQ_API_KEY`)
`--ingest-url`	Ingest API base URL (`GHOSTDQ_INGEST_URL`, default `https://ghostdq.com/ingest`)
`--engine`	`auto`, `pandas`, `arrow`, `streaming`, `polars`, `duckdb`
`--chunk-size`	CSV chunk size for streaming engine (default `100000`)

Environment shortcuts:

export GHOSTDQ_API_KEY=ghd_xxx
export GHOSTDQ_DATASET_ID=<uuid>
ghostdq run --file sales.csv --contract contract.yaml

Exporting metrics

from ghostdq import GhostDQClient

client = GhostDQClient(api_key="ghd_xxx", ingest_url="https://ghostdq.com/ingest")

# By dataset UUID (dashboard)
result = client.create_run(dataset_id="<uuid>", metrics=metrics)

# By dataset name (contract YAML)
result = client.create_run(dataset="sales", metrics=metrics)

# Fetch contract from the API
yaml_text = client.get_contract_yaml("<uuid>")

Choosing a backend

Situation	Recommendation
You already have a pandas DataFrame	`compute_metrics(df, rules)`
Large CSV on disk	`compute_csv_streaming` or `compute_metrics_file(..., engine="streaming")`
Large Parquet on disk	`ArrowMetricsEngine` or `compute_metrics_file(..., engine="arrow")`
Polars pipeline	`PolarsMetricsEngine`
SQL / analytics stack with DuckDB	`DuckDBMetricsEngine`
Avro files	`read_file` + `compute_metrics` (pandas path)
CLI one-shot	`ghostdq run` with `--engine auto` (default)

All backends produce the same metric key format expected by the GhostDQ Ingest API.

Local development

Requires Python 3.10+. From the repo root:

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,fast]"
pytest tests
ruff check src tests
mypy src tests --ignore-missing-imports

Test layout mirrors the package:

tests/
├── contract/
├── reading/
├── metrics/
├── evaluation/
├── export/
└── cli/

License & disclaimer

Licensed under Apache License 2.0.

This software is provided “as is”, without warranty of any kind. You are responsible for evaluating whether it fits your use case and for any outcomes from using it. See the LICENSE for the full terms, including limitations of liability.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0b2 pre-release

Jul 3, 2026

0.2.0b1 pre-release

Jun 24, 2026

0.1.4

Jun 24, 2026

0.1.4b1 pre-release

Jun 24, 2026

0.1.3

Jun 24, 2026

0.1.2

Jun 24, 2026

0.1.1

Jun 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostdq-0.2.0b2.tar.gz (36.3 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ghostdq-0.2.0b2-py3-none-any.whl (39.4 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file ghostdq-0.2.0b2.tar.gz.

File metadata

Download URL: ghostdq-0.2.0b2.tar.gz
Upload date: Jul 3, 2026
Size: 36.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.2.0b2.tar.gz
Algorithm	Hash digest
SHA256	`87338c8f62451471fd97eaff0d14ec76c23d03484a9fe6c6132517c0076b2984`
MD5	`157c76e2b0aaa2bf02df9bd4cf32e155`
BLAKE2b-256	`e6e9be858ce70dfc57ccb74e621a372153ca57fa7c4fb1cf4ac31c04d273c38e`

See more details on using hashes here.

File details

Details for the file ghostdq-0.2.0b2-py3-none-any.whl.

File metadata

Download URL: ghostdq-0.2.0b2-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 39.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.2.0b2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa2b08b647aaa6237116829886a3b1f1142cae72befb98060ba00a997b498533`
MD5	`b92b3ee8ef48b0897ebc6d6cd67c121d`
BLAKE2b-256	`76f6b81f1e3a29ed6e5ae08f8c3297f8d49c12438ddc9c150c45365c46231424`

See more details on using hashes here.

ghostdq 0.2.0b2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ghostdq — Python SDK

Install

Quick start

Package layout

Main classes

Computing metrics

Pandas DataFrame (default)

From a file (auto backend)

Streaming CSV

Arrow (Parquet)

Polars (optional)

DuckDB (optional)

Contracts

Supported rule types

Rule examples

Local evaluation

Reading files

CLI

Exporting metrics

Choosing a backend

Local development

License & disclaimer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes