Skip to main content

GhostDQ SDK: compute data-quality metrics locally and ship them to GhostDQ.

Project description

ghostdq — Python SDK

PyPI License: Apache-2.0

The GhostDQ SDK computes data-quality metrics locally and ships only aggregated numbers to the GhostDQ Ingest API. Your raw data never leaves your infrastructure.

The main integration point is compute_metrics(df, rules) — pass a pandas DataFrame (or use a high-performance backend when reading from disk). File I/O is optional convenience.


Install

pip install ghostdq

Optional extras:

pip install "ghostdq[polars]"   # Polars lazy-scan backend
pip install "ghostdq[duckdb]"   # DuckDB SQL backend
pip install "ghostdq[fast]"     # both Polars and DuckDB
pip install "ghostdq[dev]"      # pytest, ruff, mypy, stubs

Core dependencies: pandas, pyarrow, fastavro, pyyaml. The HTTP client uses stdlib urllib only.


Quick start

from ghostdq import (
    GhostDQClient,
    compute_metrics,
    parse_contract,
    read_file,
)

# 1. Parse the contract (or fetch from the API — see below)
contract = parse_contract(open("sales_contract.yaml").read())

# 2. Load data (optional if you already have a DataFrame)
df = read_file("sales_2024.parquet", columns=contract.required_columns())

# 3. Compute metrics locally
metrics = compute_metrics(df, contract.rules)
# → {"row_count": 120000, "null_rate:country": 0.02, ...}

# 4. Ship metrics to GhostDQ
client = GhostDQClient(api_key="ghd_your_key")
result = client.create_run(dataset_id="<dataset-uuid>", metrics=metrics)
print(result.run_id, result.status)  # ⇒ <uuid>  pending

If your pipeline already produces a DataFrame (Spark, SQL, Polars, etc.), skip read_file and call compute_metrics directly.


Package layout

ghostdq/
├── contract/       Contract, RuleSpec, ContractParser, parse_contract
├── reading/        PandasFileReader, read_file (CSV / Parquet / Avro)
├── metrics/        MetricsEngine and performance backends
├── evaluation/     RuleEvaluator — local pass/fail without network
├── export/         GhostDQClient — POST metrics to the Ingest API
└── cli/            ghostdq run

Backward-compatible shims remain at ghostdq.client, ghostdq.io_pandas, and ghostdq.evaluate.

Main classes

Class Purpose
ContractParser Parse YAML contracts
PandasFileReader Read files into a DataFrame
MetricsEngine Compute metrics from a pandas DataFrame
ArrowMetricsEngine Compute metrics from a PyArrow table (no pandas)
StreamingCsvMetricsEngine Chunked CSV scan (constant memory)
PolarsMetricsEngine Lazy Polars scans (optional extra)
DuckDBMetricsEngine SQL over files (optional extra)
RuleEvaluator Evaluate rules locally
GhostDQClient Submit metrics to the Ingest API

Functional shortcuts (compute_metrics, parse_contract, read_file, …) delegate to default instances of the classes above.


Computing metrics

Pandas DataFrame (default)

The primary API. Only columns referenced by the contract are scanned; extra columns in a wide DataFrame are ignored.

from ghostdq import compute_metrics, required_columns

# Optional: pre-narrow before compute
cols = contract.required_columns()
metrics = compute_metrics(df[cols], contract.rules)

# Or pass the full frame — compute_metrics narrows internally
metrics = compute_metrics(df, contract.rules)

MetricsEngine batches work per column (single to_numeric pass for min/max, single duplicate scan when both count and rate are needed).

From a file (auto backend)

compute_metrics_file picks a backend based on format when engine="auto":

Format Auto backend Why
.csv streaming Chunked read, constant memory
.parquet arrow Native PyArrow, no pandas conversion
other pandas Avro and fallback
from ghostdq import compute_metrics_file

metrics = compute_metrics_file("huge.csv", contract.rules)
metrics = compute_metrics_file("wide.parquet", contract.rules, engine="arrow")
metrics = compute_metrics_file("data.csv", contract.rules, engine="pandas")

Streaming CSV

For very large CSV files without loading into memory:

from ghostdq import compute_csv_streaming

metrics = compute_csv_streaming(
    "huge.csv",
    contract.rules,
    chunksize=50_000,
    columns=contract.required_columns(),
)

Arrow (Parquet)

Skip pandas when the source is already Arrow or Parquet:

import pyarrow.parquet as pq
from ghostdq import ArrowMetricsEngine, compute_arrow_metrics

table = pq.read_table("data.parquet", columns=contract.required_columns())
metrics = compute_arrow_metrics(table, contract.rules)

# Or read + compute in one step:
metrics = ArrowMetricsEngine().compute_parquet("data.parquet", contract.rules)

Polars (optional)

pip install "ghostdq[polars]"
import polars as pl
from ghostdq.metrics import PolarsMetricsEngine

engine = PolarsMetricsEngine()
metrics = engine.compute_parquet("data.parquet", contract.rules)
metrics = engine.compute(pl.scan_csv("data.csv"), contract.rules)

DuckDB (optional)

pip install "ghostdq[duckdb]"
import duckdb
from ghostdq.metrics import DuckDBMetricsEngine

conn = duckdb.connect()
metrics = DuckDBMetricsEngine().compute_path(conn, "data.parquet", contract.rules)

Contracts

Contracts are YAML files that define dataset rules. Parse them with parse_contract or ContractParser:

from ghostdq import ContractParser, parse_contract, required_columns

contract = parse_contract(yaml_text)
contract = ContractParser().parse(yaml_text)

contract.required_columns()   # columns referenced by rules
contract.all_metric_keys()    # metric keys the server expects
required_columns(contract.rules)

Example contract:

dataset: sales
version: 1
rules:
  - row_count: {min: 1, max: 1000000}
  - null_rate: {column: country, max: 0.05}
  - unique: {column: order_id}
  - duplicate_rate: {column: order_id, max: 0.01}
  - value_range: {column: amount, min: 0, max: 10000}
  - allowed_values: {column: country, values: [ES, US, MX]}

Supported rule types

Rule Metric key(s)
row_count row_count
null_rate null_rate:{column}
unique duplicate_count:{column}
duplicate_rate duplicate_rate:{column}
value_range value_min:{column}, value_max:{column}
allowed_values disallowed_count:{column}

Local evaluation

Check pass/fail without calling the API:

from ghostdq import RuleEvaluator, evaluate_rules

results = evaluate_rules(contract.rules, metrics)
for r in results:
    print("✓" if r.passed else "✗", r.rule_type, r.value_display)

evaluator = RuleEvaluator()
results = evaluator.evaluate(contract.rules, metrics)
print(evaluator.format_line(results[0]))

Reading files

File reading is optional — use it when you don't already have a DataFrame.

from ghostdq import PandasFileReader, read_file

df = read_file("data.parquet", columns=contract.required_columns())
df = PandasFileReader().read_csv("data.csv", columns=["id", "amount"])
Format Extension Reader
CSV .csv pandas
Parquet .parquet pyarrow (column pruning supported)
Avro .avro fastavro

CLI

# Local validation (no API key)
ghostdq run --contract contract.yaml --file sales.csv

# Remote run: fetch contract + submit metrics
ghostdq run --dataset-id <uuid> --file sales.parquet --api-key ghd_xxx

# Pick a metrics backend
ghostdq run --contract contract.yaml --file huge.csv --engine streaming --chunk-size 50000
ghostdq run --contract contract.yaml --file data.parquet --engine arrow
Flag Description
--contract Local contract YAML (required for offline runs)
--dataset-id Dataset UUID (enables remote contract fetch + submit)
--file Data file (.csv, .parquet, .avro)
--api-key API key (GHOSTDQ_API_KEY)
--ingest-url Ingest API base URL (GHOSTDQ_INGEST_URL, default https://ghostdq.com/ingest)
--engine auto, pandas, arrow, streaming, polars, duckdb
--chunk-size CSV chunk size for streaming engine (default 100000)

Environment shortcuts:

export GHOSTDQ_API_KEY=ghd_xxx
export GHOSTDQ_DATASET_ID=<uuid>
ghostdq run --file sales.csv --contract contract.yaml

Exporting metrics

from ghostdq import GhostDQClient

client = GhostDQClient(api_key="ghd_xxx", ingest_url="https://ghostdq.com/ingest")

# By dataset UUID (dashboard)
result = client.create_run(dataset_id="<uuid>", metrics=metrics)

# By dataset name (contract YAML)
result = client.create_run(dataset="sales", metrics=metrics)

# Fetch contract from the API
yaml_text = client.get_contract_yaml("<uuid>")

Choosing a backend

Situation Recommendation
You already have a pandas DataFrame compute_metrics(df, rules)
Large CSV on disk compute_csv_streaming or compute_metrics_file(..., engine="streaming")
Large Parquet on disk ArrowMetricsEngine or compute_metrics_file(..., engine="arrow")
Polars pipeline PolarsMetricsEngine
SQL / analytics stack with DuckDB DuckDBMetricsEngine
Avro files read_file + compute_metrics (pandas path)
CLI one-shot ghostdq run with --engine auto (default)

All backends produce the same metric key format expected by the GhostDQ Ingest API.


Local development

Requires Python 3.10+. From the repo root:

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,fast]"
pytest tests
ruff check src tests
mypy src tests --ignore-missing-imports

Test layout mirrors the package:

tests/
├── contract/
├── reading/
├── metrics/
├── evaluation/
├── export/
└── cli/

License & disclaimer

Licensed under Apache License 2.0.

This software is provided “as is”, without warranty of any kind. You are responsible for evaluating whether it fits your use case and for any outcomes from using it. See the LICENSE for the full terms, including limitations of liability.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostdq-0.2.0b1.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostdq-0.2.0b1-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file ghostdq-0.2.0b1.tar.gz.

File metadata

  • Download URL: ghostdq-0.2.0b1.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.2.0b1.tar.gz
Algorithm Hash digest
SHA256 8fa3356fd39052cf2580a02eb5adbe710480e1f475e20329e796f6f8c30fa7a8
MD5 52c822e4e6f0e223dd2d7395c40e435e
BLAKE2b-256 42b5c526368a1c9dc554b20b83148c0c75ab8b3404193ef4d36ae1ede6c5d615

See more details on using hashes here.

File details

Details for the file ghostdq-0.2.0b1-py3-none-any.whl.

File metadata

  • Download URL: ghostdq-0.2.0b1-py3-none-any.whl
  • Upload date:
  • Size: 33.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.2.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 0cff412e03053251072797bc56b15c2439cb12e4b739e8973da948032a50e438
MD5 44658ec4ca0b61c3e3544690278b0424
BLAKE2b-256 9756abe9b01818ea9240c4070003a7a335fb456b89a08f11f0c5673c4957763f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page