Skip to main content

Deterministic dataset validation, drift detection, and lineage audit — Python wrapper for the cjc-locke Rust crate

Project description

cjc-locke (Python)

Thin PyO3 wrapper over the Rust cjc-locke crate — deterministic dataset validation, drift detection, and lineage audit for Python users.

Why

You want Locke's contract (byte-identical reports across runs, machines, and process boundaries) without leaving Python. This wrapper gives you exactly that: every call delegates to one Rust function. There is no business logic on the Python side, no Python-level caching, no extra allocations beyond a single column copy at the FFI boundary.

Install

pip install maturin
cd python/
maturin develop --release          # local dev install
# or
maturin build --release            # produce a wheel in target/wheels/
pip install target/wheels/cjc_locke-*.whl

The build needs Rust 1.63+ (Rust 1.91 is what this project uses) and Python 3.9+. The wheel is abi3-py39 so a single artifact works for every CPython 3.9 through latest.

Performance and determinism guarantees

  • Determinism: byte-identical to a native Rust call. Python dict insertion order is preserved per PEP 468; cjc_data::DataFrame canonicalises everything downstream via BTreeMap. The Rust side does not multithread by default and the wrapper introduces no threading.
  • Memory: one heap allocation per column at the FFI boundary (Vec<f64> / Vec<i64> / Vec<String> / Vec<bool>), then zero copies through to the report. For numpy f64/i64/bool columns the FFI read is zero-copy via the buffer protocol; only the to_vec() into Rust-owned memory counts.
  • Power / thermal: identical to native — the Rust side does the same work it would do from a cjcl CLI run. No background threads, no asyncio, no extra cores.
  • Float / integer fidelity: numpy f64/i64 are bit-exact passed through. f32/i32 widen via direct cast (lossless). Python int extracted as i64 errors on overflow (PyO3 default).

Quick start

import numpy as np
import cjc_locke

# 1. Dict-of-arrays / lists in.
data = {
    "age": np.array([20.0, 30.0, np.nan, 40.0, 99.0]),
    "city": ["NY", "LA", "NY", "SF", "NY"],
    "is_active": [True, True, False, True, True],
}

report = cjc_locke.validate(data, label="users")
print(report)                       # <LockeReport n_findings=2 ...>
print(report.severity_counts)       # {'info': 0, 'notice': 0, 'warning': 1, 'error': 1}
print(report.finding_codes())       # ['E9001', 'E9041']

# 2. Full per-finding evidence via JSON round-trip (cheap).
detail = report.to_dict()
for f in detail["findings"]:
    print(f["code"], f["severity"], f["message"])

# 3. Canonical bytes for downstream audit chains.
canonical = report.to_json()        # byte-identical to a native Rust call

Drift detection

report = cjc_locke.compare_drift(
    train={"x": np.arange(1000.0)},
    test={"x": np.arange(500.0, 1500.0)},
)
print(report.finding_codes())       # ['E9030', 'E9039', ...]

# Or all in one go:
val, drift, belief = cjc_locke.validate_and_compare(train, test, label="my-data")
print(belief.score_dict())
# {'overall': 0.78, 'schema': 1.0, 'drift': 0.42, ...}

Streaming (out-of-core)

sv = cjc_locke.StreamingValidator(label="stream", config={"sample_cap": 100_000})
for chunk in iter_chunks_somehow():
    sv.ingest_chunk(chunk)          # chunk is a dict like above
final = sv.into_report()

Policy gates

result = cjc_locke.apply_policy(report, policy={
    "suppressions": [
        {"code": "E9001", "column": "phone", "reason": "PII expected"},
    ],
    "owners": [
        {"team": "data-quality", "code": "E9072"},
    ],
    "requirements": [
        {"code": "E9001", "operator": "eq_zero", "threshold": 0},
    ],
})

if result.gate_fails():
    raise SystemExit("policy gate failed")

Lineage audit

b = cjc_locke.LineageBuilder("daily-pipeline")
src = b.add_impression("train.csv", kind="dataset", n_rows=10_000,
                       columns=["x", "y"])
node = b.add_idea(name="filter_active",
                  op_id="filter",
                  parents=[src],
                  params={"expr": "is_active == True"})
graph = b.finish()

assert graph.is_acyclic()
graph.validate_audit_monotonic()    # raises ValueError if violated
print(graph.emit_mermaid())          # graph rendering for docs

Pandas / polars

The Python facade auto-detects pandas or polars DataFrames by duck-typing — both libraries remain optional dependencies. Internally we call df[col].values (pandas) or df[col].to_numpy() (polars) per column, then route through the same numpy zero-copy path as a raw dict. No extra business logic.

import pandas as pd
df = pd.DataFrame({"x": [1.0, 2.0, 3.0]})
report = cjc_locke.validate(df)        # works

What's exposed

Everything from the cjc-locke crate's public surface that translates cleanly to Python:

  • validate, compare_drift, validate_and_compare
  • belief_report, apply_policy
  • causal_guardrail, detect_temporal_issues
  • emit_report_json, parse_report_json
  • make_audit_event
  • Classes: LockeReport, InductionRiskReport, BeliefReport, CausalGuardrailReport, StreamingValidator, LineageBuilder, LineageGraph, AuditEvent, PolicyResult

The Rust-side TracedDataFrame (which uses lifetime-borrowed references to a builder) does not map to Python; use LineageBuilder.add_idea(...) directly to build the same provenance graph.

License

MIT — same as the rest of the workspace.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cjc_locke-0.1.0-cp39-abi3-win_arm64.whl (601.9 kB view details)

Uploaded CPython 3.9+Windows ARM64

File details

Details for the file cjc_locke-0.1.0-cp39-abi3-win_arm64.whl.

File metadata

  • Download URL: cjc_locke-0.1.0-cp39-abi3-win_arm64.whl
  • Upload date:
  • Size: 601.9 kB
  • Tags: CPython 3.9+, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cjc_locke-0.1.0-cp39-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 49f01ed3bda1c4d0e657234e75d8eb71870351d07af2d56a5411ad26886eb6c9
MD5 23b63c8efcdf4f70f0ac2daf7122998a
BLAKE2b-256 2ddc942fcbacdb1cada3128fbc538036a2221c692fddd7a57afe803c261b2986

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page