Skip to main content

Python wrapper for the anofox-tabular DuckDB extension — data quality, PII, and validation primitives

Project description

anofox-tabular Python package

Python wrapper for the anofox-tabular DuckDB extension — data quality, PII detection, email/phone validation, anomaly detection, diffing, money, and VAT primitives.

Installation

pip install anofox-tabular

Optional extras for DataFrame support:

pip install "anofox-tabular[pandas]"   # adds pandas
pip install "anofox-tabular[polars]"   # adds polars
pip install "anofox-tabular[pandas,polars]"

Quick start

import anofox

# In-memory database (extension downloaded automatically)
with anofox.connect() as conn:
    # Email validation
    print(conn.execute("SELECT anofox_tab_email_is_valid('hi@example.com', 'regex')").fetchone())

# Or use a locally built extension
conn = anofox.connect(
    extension_path="/path/to/anofox_tabular.duckdb_extension"
)

Python-native API

import anofox
from anofox import validate, quality, pii, diff

conn = anofox.connect()

# ── Email validation ──────────────────────────────────────────────────
validate.email_is_valid(conn, "hi@example.com")                # True
validate.email_is_valid(conn, "hi@example.com", mode="dns")    # True (DNS checked)

import pandas as pd
df = pd.DataFrame({"email": ["a@b.com", "bad-email", "c@d.org"]})
result_df = validate.email_is_valid(conn, df, column="email")
# Returns DataFrame with added 'email_is_valid' column

# ── Phone validation ──────────────────────────────────────────────────
validate.phone_is_valid(conn, "+14155552671", region="US")     # True
validate.phone_format(conn, "+14155552671", "US", "INTERNATIONAL")

# ── Data quality ──────────────────────────────────────────────────────
conn.execute("CREATE TABLE orders AS SELECT * FROM read_parquet('orders.parquet')")

quality.volume(conn, "orders", min_rows=100)
# {"status": "pass", "min_rows": 100, ...}

quality.null_rate(conn, "orders", "amount", max_null_rate=0.05)
quality.distinct_count(conn, "orders", "status", min_distinct=2, max_distinct=10)
quality.schema_check(conn, "orders", ["id", "amount", "created_at"])

# ── High-level profile ────────────────────────────────────────────────
summary = conn.profile(df)   # returns pd.DataFrame with per-column metrics

# ── PII detection ─────────────────────────────────────────────────────
pii.pii_contains(conn, "Call me at +1-415-555-2671")  # True
pii.pii_detect(conn, "Email: test@example.com")        # [{"type": "EMAIL", ...}]
pii.pii_mask(conn, "test@example.com", strategy="redact")

scan_result = pii.pii_scan_table(conn, "orders")  # pd.DataFrame

# ── Diff ──────────────────────────────────────────────────────────────
# Table names or DataFrames both work
changes = diff.joindiff(conn, "orders_v1", "orders_v2", primary_keys="id")
changes = diff.joindiff(conn, df_before, df_after, primary_keys="id")
# Returns pd.DataFrame with diff_type: 'added', 'removed', 'changed', 'unchanged'

# ── Schema validation ─────────────────────────────────────────────────
from anofox.validate import EmailRule, PhoneRule

result = conn.validate(df, schema={
    "email": EmailRule(mode="dns"),
    "phone": PhoneRule(region="DE"),
})
print(result.passed)      # True / False
print(result.failures)    # pd.DataFrame of failed rows

Module overview

Module Functions
anofox.validate email_is_valid, email_validate, phone_is_valid, phone_parse, phone_format, phone_region
anofox.quality volume, null_rate, distinct_count, freshness, zscore, iqr, schema_check
anofox.anomaly isolation_forest, isolation_forest_mv, dbscan, dbscan_mv, outlier_tree
anofox.pii pii_detect, pii_mask, pii_contains, pii_scan_table, pii_audit_table
anofox.diff joindiff, hashdiff
anofox.money make_money, money_from_cents, is_valid_currency, currency_symbol, money_add, etc.
anofox.vat make_vat, vat_is_valid, vat_is_eu_member, vat_country_name, etc.

CLI

# Profile any file (colored table output)
anofox profile data.parquet
anofox profile data.csv --format json

# Quality checks (exit 0 = pass, exit 1 = fail)
anofox quality data.parquet --volume-min 1000
anofox quality data.csv --null-max 0.05 --column email

Supported formats: .parquet, .csv, .tsv, .json, .ndjson

pytest plugin

# Run with: pytest --anofox-check
import pytest

@pytest.mark.anofox_quality("orders", volume_min=100)
def test_orders_table_has_data(anofox_conn):
    ...

The anofox_conn session-scoped fixture is provided automatically. Tests skip if the extension is unavailable.

Extension resolution

The package resolves the extension binary in this order:

  1. ANOFOX_EXT_PATH environment variable (path to local binary)
  2. extension_path argument to connect()
  3. Cached binary in ~/.anofox/extensions/
  4. Download from community registry → S3 mirror (https://get.erpl.io)

Development

# Build the extension first
make release

# Install package in dev mode
cd python
pip install -e ".[dev]"

# Run tests
ANOFOX_EXT_PATH=../build/release/extension/anofox_tabular/anofox_tabular.duckdb_extension \
  pytest tests/ -v

# Loader/utils tests run without extension (no env var needed)
pytest tests/test_loader.py tests/test_utils.py -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anofox_tabular-0.4.0.tar.gz (88.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anofox_tabular-0.4.0-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file anofox_tabular-0.4.0.tar.gz.

File metadata

  • Download URL: anofox_tabular-0.4.0.tar.gz
  • Upload date:
  • Size: 88.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for anofox_tabular-0.4.0.tar.gz
Algorithm Hash digest
SHA256 bd20ba58072d14af7229a013fe1a0fb5fa74f0950911b364793b9503a5ca97c5
MD5 0d1a4d7abe29a71834b78b7a99ed4518
BLAKE2b-256 0e2563081999718465ea64b5cfa1dc7155e5b2d897f356268de0fc65353cf582

See more details on using hashes here.

Provenance

The following attestation bundles were made for anofox_tabular-0.4.0.tar.gz:

Publisher: publish_python.yml on DataZooDE/anofox-tabular

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file anofox_tabular-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: anofox_tabular-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 28.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for anofox_tabular-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e232c49b0a5921cd26f63aa4a1e47145b4a2c9365515b1e8ecccdab9fdd5fe8
MD5 abb0020763e8564963b679654d1a3d28
BLAKE2b-256 fcd570f4852cb2f2c5c2d066c6b5655bb2909e16964892b3b234c8b2ea1dcf67

See more details on using hashes here.

Provenance

The following attestation bundles were made for anofox_tabular-0.4.0-py3-none-any.whl:

Publisher: publish_python.yml on DataZooDE/anofox-tabular

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page