Skip to main content

GhostDQ SDK: compute data-quality metrics locally and ship them to GhostDQ.

Project description

ghostdq — Python SDK

PyPI License: Apache-2.0

The GhostDQ SDK lets you compute data-quality metrics locally and ship only the aggregated numbers to the GhostDQ cloud — your raw data never leaves your infrastructure.


Install

pip install ghostdq

Optional extras (Avro support requires fastavro, Parquet requires pyarrow — both are included in the core install):

pip install "ghostdq[dev]"   # adds pytest, ruff, mypy, stubs

Quick start

from ghostdq import read_file, parse_contract, compute_metrics, GhostDQClient

# 1. Load your data
df = read_file("sales_2024.parquet")   # .csv / .parquet / .avro

# 2. Parse the contract (or fetch it from the API — see below)
contract = parse_contract(open("sales_contract.yaml").read())

# 3. Compute metrics *locally* — no raw data leaves your machine
metrics = compute_metrics(df, contract.rules)
# → {"row_count": 120000, "null_rate:country": 0.02, ...}

# 4. Ship the metrics to GhostDQ
client = GhostDQClient(api_key="ghd_your_key")
result = client.create_run(dataset_id="<dataset-uuid>", metrics=metrics)
print(result.run_id, result.status)  # ⇒ <uuid>  pending

CLI

# Validate a file against a local contract
ghostdq run \
  --dataset-id <uuid> \
  --file sales.csv \
  --contract contract.yaml \
  --api-key ghd_xxx

# Fetch the contract automatically from the API
ghostdq run \
  --dataset-id <uuid> \
  --file sales.parquet \
  --api-key ghd_xxx

Environment variable shortcuts:

export GHOSTDQ_API_KEY=ghd_xxx
ghostdq run --dataset-id <uuid> --file sales.csv

The Ingest API defaults to https://ghostdq.com/ingest. Override with --ingest-url or GHOSTDQ_INGEST_URL (e.g. http://localhost:8000 for local dev).


Supported file formats

Format Extension Engine
CSV .csv pandas
Parquet .parquet pyarrow
Avro .avro fastavro

Supported rule types

Rule Metric key(s)
row_count row_count
null_rate null_rate:{column}
unique duplicate_count:{column}
value_range value_min:{column}, value_max:{column}
allowed_values disallowed_count:{column}

Local development

Requires Python 3.10+ (3.13 recommended). From the repo root:

python3.13 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests
ruff check src tests
mypy src tests --ignore-missing-imports

License & disclaimer

Licensed under Apache License 2.0.

This software is provided “as is”, without warranty of any kind. You are responsible for evaluating whether it fits your use case and for any outcomes from using it. See the LICENSE for the full terms, including limitations of liability.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostdq-0.1.4b1.tar.gz (20.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostdq-0.1.4b1-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file ghostdq-0.1.4b1.tar.gz.

File metadata

  • Download URL: ghostdq-0.1.4b1.tar.gz
  • Upload date:
  • Size: 20.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.1.4b1.tar.gz
Algorithm Hash digest
SHA256 66bdfc59fbf07153c139ca1a7a752f6a968295f5821b3b2b2df88a15bf18a12b
MD5 91944565604103345d226da981eb7e92
BLAKE2b-256 655aa14eba5a574442f8aaeb9a8cea7997d4c53a0fef8eca3ee0b012fa9a0c3e

See more details on using hashes here.

File details

Details for the file ghostdq-0.1.4b1-py3-none-any.whl.

File metadata

  • Download URL: ghostdq-0.1.4b1-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.1.4b1-py3-none-any.whl
Algorithm Hash digest
SHA256 a18c8263f79ff971ebf5ca4bf5d1c5bdcc3e2394fb16f0dac84868bac2afbc52
MD5 3ca709054faafc4e20079a91b6ef71b4
BLAKE2b-256 84cb8e960cd3264de4f382e8d44043948507a5b49490e6c623c2a56c89cfcba6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page