Skip to main content

GhostDQ SDK: compute data-quality metrics locally and ship them to GhostDQ.

Project description

ghostdq — Python SDK

PyPI License: Apache-2.0

The GhostDQ SDK lets you compute data-quality metrics locally and ship only the aggregated numbers to the GhostDQ cloud — your raw data never leaves your infrastructure.


Install

pip install ghostdq

Optional extras (Avro support requires fastavro, Parquet requires pyarrow — both are included in the core install):

pip install "ghostdq[dev]"   # adds pytest, ruff, mypy, stubs

Quick start

from ghostdq import read_file, parse_contract, compute_metrics, GhostDQClient

# 1. Load your data
df = read_file("sales_2024.parquet")   # .csv / .parquet / .avro

# 2. Parse the contract (or fetch it from the API — see below)
contract = parse_contract(open("sales_contract.yaml").read())

# 3. Compute metrics *locally* — no raw data leaves your machine
metrics = compute_metrics(df, contract.rules)
# → {"row_count": 120000, "null_rate:country": 0.02, ...}

# 4. Ship the metrics to GhostDQ
client = GhostDQClient(
    api_key="ghd_your_key",
    ingest_url="https://ingest.ghostdq.io",
)
result = client.create_run(dataset_id="<dataset-uuid>", metrics=metrics)
print(result.run_id, result.status)  # ⇒ <uuid>  pending

CLI

# Validate a file against a local contract
ghostdq run \
  --dataset-id <uuid> \
  --file sales.csv \
  --contract contract.yaml \
  --api-key ghd_xxx \
  --ingest-url https://ingest.ghostdq.io

# Fetch the contract automatically from the API
ghostdq run \
  --dataset-id <uuid> \
  --file sales.parquet \
  --api-key ghd_xxx \
  --ingest-url https://ingest.ghostdq.io

Environment variable shortcuts:

export GHOSTDQ_API_KEY=ghd_xxx
export GHOSTDQ_INGEST_URL=https://ingest.ghostdq.io
ghostdq run --dataset-id <uuid> --file sales.csv

Supported file formats

Format Extension Engine
CSV .csv pandas
Parquet .parquet pyarrow
Avro .avro fastavro

Supported rule types

Rule Metric key(s)
row_count row_count
null_rate null_rate:{column}
unique duplicate_count:{column}
value_range value_min:{column}, value_max:{column}
allowed_values disallowed_count:{column}

Local development

Requires Python 3.10+ (3.13 recommended). From the repo root:

python3.13 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests
ruff check src tests
mypy src tests --ignore-missing-imports

License & disclaimer

Licensed under Apache License 2.0.

This software is provided “as is”, without warranty of any kind. You are responsible for evaluating whether it fits your use case and for any outcomes from using it. See the LICENSE for the full terms, including limitations of liability.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostdq-0.1.1.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostdq-0.1.1-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file ghostdq-0.1.1.tar.gz.

File metadata

  • Download URL: ghostdq-0.1.1.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6c34efda0f9ce17888cc672e4e1e5b35404fd215fd694ba51fa168adc0bc1623
MD5 fd0c436c03b9e405b394d622f9c4dbb9
BLAKE2b-256 c39f2c8604d965fb42c649842f69e50b74cbdab2de2063f942be6033eb819a67

See more details on using hashes here.

File details

Details for the file ghostdq-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ghostdq-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ghostdq-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f26cffc5235bcfeb0d8e3f218492d1fec87e86973f1edd2e4ce63f9ed90435af
MD5 2ce7b43982e757271d134348f1b3f116
BLAKE2b-256 016a43a3af6da2d285fb442c8d4f76a418d68a0d91e318c67d0bd2ed1ffa048f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page