Skip to main content

Python wrapper for the anofox-tabular DuckDB extension — data quality, PII, and validation primitives

Project description

anofox-tabular Python package

Python wrapper for the anofox-tabular DuckDB extension — data quality, PII detection, email/phone validation, anomaly detection, diffing, money, and VAT primitives.

Installation

pip install anofox-tabular

Optional extras for DataFrame support:

pip install "anofox-tabular[pandas]"   # adds pandas
pip install "anofox-tabular[polars]"   # adds polars
pip install "anofox-tabular[pandas,polars]"

Quick start

import anofox

# In-memory database (extension downloaded automatically)
with anofox.connect() as conn:
    # Email validation
    print(conn.execute("SELECT anofox_tab_email_is_valid('hi@example.com', 'regex')").fetchone())

# Or use a locally built extension
conn = anofox.connect(
    extension_path="/path/to/anofox_tabular.duckdb_extension"
)

Python-native API

import anofox
from anofox import validate, quality, pii, diff

conn = anofox.connect()

# ── Email validation ──────────────────────────────────────────────────
validate.email_is_valid(conn, "hi@example.com")                # True
validate.email_is_valid(conn, "hi@example.com", mode="dns")    # True (DNS checked)

import pandas as pd
df = pd.DataFrame({"email": ["a@b.com", "bad-email", "c@d.org"]})
result_df = validate.email_is_valid(conn, df, column="email")
# Returns DataFrame with added 'email_is_valid' column

# ── Phone validation ──────────────────────────────────────────────────
validate.phone_is_valid(conn, "+14155552671", region="US")     # True
validate.phone_format(conn, "+14155552671", "US", "INTERNATIONAL")

# ── Data quality ──────────────────────────────────────────────────────
conn.execute("CREATE TABLE orders AS SELECT * FROM read_parquet('orders.parquet')")

quality.volume(conn, "orders", min_rows=100)
# {"status": "pass", "min_rows": 100, ...}

quality.null_rate(conn, "orders", "amount", max_null_rate=0.05)
quality.distinct_count(conn, "orders", "status", min_distinct=2, max_distinct=10)
quality.schema_check(conn, "orders", ["id", "amount", "created_at"])

# ── High-level profile ────────────────────────────────────────────────
summary = conn.profile(df)   # returns pd.DataFrame with per-column metrics

# ── PII detection ─────────────────────────────────────────────────────
pii.pii_contains(conn, "Call me at +1-415-555-2671")  # True
pii.pii_detect(conn, "Email: test@example.com")        # [{"type": "EMAIL", ...}]
pii.pii_mask(conn, "test@example.com", strategy="redact")

scan_result = pii.pii_scan_table(conn, "orders")  # pd.DataFrame

# ── Diff ──────────────────────────────────────────────────────────────
# Table names or DataFrames both work
changes = diff.joindiff(conn, "orders_v1", "orders_v2", primary_keys="id")
changes = diff.joindiff(conn, df_before, df_after, primary_keys="id")
# Returns pd.DataFrame with diff_type: 'added', 'removed', 'changed', 'unchanged'

# ── Schema validation ─────────────────────────────────────────────────
from anofox.validate import EmailRule, PhoneRule

result = conn.validate(df, schema={
    "email": EmailRule(mode="dns"),
    "phone": PhoneRule(region="DE"),
})
print(result.passed)      # True / False
print(result.failures)    # pd.DataFrame of failed rows

Module overview

Module Functions
anofox.validate email_is_valid, email_validate, phone_is_valid, phone_parse, phone_format, phone_region
anofox.quality volume, null_rate, distinct_count, freshness, zscore, iqr, schema_check
anofox.anomaly isolation_forest, isolation_forest_mv, dbscan, dbscan_mv, outlier_tree
anofox.pii pii_detect, pii_mask, pii_contains, pii_scan_table, pii_audit_table
anofox.diff joindiff, hashdiff
anofox.money make_money, money_from_cents, is_valid_currency, currency_symbol, money_add, etc.
anofox.vat make_vat, vat_is_valid, vat_is_eu_member, vat_country_name, etc.

CLI

# Profile any file (colored table output)
anofox profile data.parquet
anofox profile data.csv --format json

# Quality checks (exit 0 = pass, exit 1 = fail)
anofox quality data.parquet --volume-min 1000
anofox quality data.csv --null-max 0.05 --column email

Supported formats: .parquet, .csv, .tsv, .json, .ndjson

pytest plugin

# Run with: pytest --anofox-check
import pytest

@pytest.mark.anofox_quality("orders", volume_min=100)
def test_orders_table_has_data(anofox_conn):
    ...

The anofox_conn session-scoped fixture is provided automatically. Tests skip if the extension is unavailable.

Extension resolution

The package resolves the extension binary in this order:

  1. ANOFOX_EXT_PATH environment variable (path to local binary)
  2. extension_path argument to connect()
  3. Cached binary in ~/.anofox/extensions/
  4. Download from community registry → S3 mirror (https://get.erpl.io)

Development

# Build the extension first
make release

# Install package in dev mode
cd python
pip install -e ".[dev]"

# Run tests
ANOFOX_EXT_PATH=../build/release/extension/anofox_tabular/anofox_tabular.duckdb_extension \
  pytest tests/ -v

# Loader/utils tests run without extension (no env var needed)
pytest tests/test_loader.py tests/test_utils.py -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anofox_tabular-0.3.0.tar.gz (85.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anofox_tabular-0.3.0-py3-none-any.whl (27.3 kB view details)

Uploaded Python 3

File details

Details for the file anofox_tabular-0.3.0.tar.gz.

File metadata

  • Download URL: anofox_tabular-0.3.0.tar.gz
  • Upload date:
  • Size: 85.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for anofox_tabular-0.3.0.tar.gz
Algorithm Hash digest
SHA256 1e0bd5b746d983d7e187d5b01e12deba67f4750ca12b0904521b3538307939d0
MD5 efbca5b402754bc08f11f487615f7d12
BLAKE2b-256 0ce980141a2fbbe27dbcbe1f435a7b9852ffd80ccd757276aba2af697d26ebed

See more details on using hashes here.

Provenance

The following attestation bundles were made for anofox_tabular-0.3.0.tar.gz:

Publisher: publish_python.yml on DataZooDE/anofox-tabular

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file anofox_tabular-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: anofox_tabular-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 27.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for anofox_tabular-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 389d33692e51542a572a4c4cb46297ec4cd50919a1095736573139b7ef25e773
MD5 b67ba6478a9c48a2ce9f08a8b355b6cf
BLAKE2b-256 7ee8adc92abc52e7a2b0fb2d21425378056e02f046e39827df3ffe880f66a0b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for anofox_tabular-0.3.0-py3-none-any.whl:

Publisher: publish_python.yml on DataZooDE/anofox-tabular

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page