Semantic redaction for financial AI agents: strip identity, keep the signal, prove it with a signed certificate.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vje013

These details have not been verified by PyPI

Project description

Darwin Proxy

Destroy the identity. Keep the signal. Prove it.

Darwin Proxy is a semantic redaction engine for financial AI agents. It strips identity out of a dataset while preserving the analytical signal, then issues a signed certificate attesting to what it did and that the result is re-identifiable below a stated threshold.

The problem

<<<<<<< HEAD What's Next (full product roadmap)

Signed Ed25519 attestation certificate proving exactly how data was abstracted (built on Darwin Agentic Cloud)
K-anonymity re-identification gate validating that no replacement is too rare to be safe
Chroma vector-based semantic classifier replacing heuristic matching with embedding-space neighborhoods
Open-core: engine free (Apache-2.0), policy packs and certification paid

BUILD UPDATE 6/7/2026

Verdict: right now it is a single flat-table, in-memory tool that is not yet schema-flexible. The column policy and the re-id quasi-identifiers are hardcoded to specific English header names.

What it handles

Dimension	Current capability
Input format	One flat CSV (utf-8-sig) via CLI; CSV text or a JSON list of flat records via the API
Structure	Flat rows of string fields. No nested JSON, no multi-table/relational, no Excel/Parquet
Row count (validated)	500 rows real, 2,000 rows logic-only benchmark
Throughput	~765 rows/sec logic-only (blank scanner). Per-record ~1.1 ms, gate ~0.15 ms/row
Structured PII (by column)	First/Last name, Email, Business Name, Phone, City, State, Country
Inline PII (free text)	SSN, ABA routing, CUSIP, ISIN, EIN, account (checksum/context validated), plus PERSON/ORG/LOCATION via spaCy NER on prose of 3+ tokens
Re-id gate	k-anonymity over State, Shares Owned, Acquisition Date, with optimal minimal-loss generalization
Output	Abstracted CSV/rows plus an Ed25519-signed certificate

Hard limits right now

Column names are hardcoded. Semantic replacement only fires on exactly these eight headers: First Name, Last Name, Email, Business Name, Phone Number, City, State, Country. A column called fname or client_first is treated as signal and kept. Because single-token cells skip NER (the 3-token gate), a fname column full of first names passes through largely unredacted. There is no CLI or API way to supply a custom policy yet, even though the engine supports one internally.

The gate only protects data with its three QI columns. If a dataset has none of State, Shares Owned, Acquisition Date, the gate finds no quasi-identifiers, puts every record in one class, reports k equal to the row count, and passes with zero generalization. That is a trivial pass with no real re-identification protection, and the certificate will still say passed. This is the most important footgun: the gate is schema-specific, and on the wrong schema it is a no-op that looks like success.

Everything is in-memory, single-threaded. abstract_csv reads the whole file, abstracts every row, runs the gate over the full set, and writes. No streaming, no chunking, no parallelism. Practical ceiling is low hundreds of thousands of rows before memory and single-thread time bite. The gate is roughly linear in rows but runs many passes during lattice search plus rollback, so it grows with row count.

Throughput on the real model is unmeasured and lower. The 765 rows/sec is logic-only with a blank scanner. With the real spaCy model, every signal string cell goes through analyzer.analyze, which is much heavier, and free-text prose adds NER cost. Data with several signal columns or any free-text column will run materially slower. I have not benchmarked the real-model path because the model will not download in my sandbox.

Entity and locale coverage is narrow. No credit cards, IBAN, IP, street address, DOB, passport, driver license, or any non-US identifiers. Names and org NER are English-centric. The sector corpus is US large-cap only, so funds, LPs, and non-US entities classify to the nearest of seven sectors.

Input hygiene is thin. utf-8-sig only. Ragged rows, missing values, or unexpected types are not hardened against; a None in a name field could throw. No size guard, no timeout, no auth, no rate limit on the service.

The honest one-paragraph summary

It reliably abstracts a clean, flat, English-headered CSV that uses the expected column names, in the low thousands of rows, on a box with the spaCy and Chroma models present, and proves it with a signed certificate.

The moment the schema drifts from that shape, the column names, the three QI fields, the eight known headers, it quietly does less than it appears to, because unrecognized columns fall through to signal and the gate degrades to a trivial pass.

The two changes that would most widen its real range are exposing a configurable policy and a configurable QI set through the CLI and API, so it adapts to a customer's actual schema instead of the stockholders schema.

To be useful, a financial AI workflow has to send client data to third-party models. The moment a client's name, holdings, and account details leave the box, that is PII egressing to a third party, with the regulatory exposure (GLBA, Reg S-P, CCPA) landing on the operator. Darwin Proxy strips the identity before the data leaves, so the model still gets the signal and the real PII never escapes.

What it does

A dataset flows through four stages:

Detect which columns carry identity by their content, not their header names, using a Presidio analyzer with the full predefined recognizer set plus checksum-validated finance recognizers (SSN, ABA routing, CUSIP, ISIN, EIN, account). Renamed or gibberish headers do not fool it.
Transform each identifier. The default is keyed, signal-preserving substitution: a value maps to the same realistic fake everywhere (a custom Presidio operator), so joins and shape survive. An opaque AES-encrypt mode is available when nothing analyzable should leave. Geography and dates are kept for the gate rather than substituted.
Gate the result on k-anonymity, generalizing quasi-identifiers (region, holdings band, acquisition window) until every record shares its combination with at least k others. Quasi-identifiers are inferred from the detected entities, and when none are identified the gate refuses to claim k-anonymity rather than silently passing.
Certify with an Ed25519 signature over the manifest, binding the detection mapping, operators, locale, reversibility mode, the gate result (including whether re-identification risk was actually assessed), and the before/after hashes.

Reversibility has two modes: a keyed map (signal-preserving and reversible only via an encrypted, expiring map) and AES encrypt (opaque and reversible by key alone). Image inputs are supported optionally via OCR when tesseract is present.

Quickstart

pip install darwin-proxy
python -m spacy download en_core_web_lg     # or en_core_web_sm for a lighter box

# abstract a CSV, write output + a signed manifest sidecar next to it
proxy abstract data.csv -o abstracted.csv --k 5

# re-check the certificate against the output (recomputes hash and k)
proxy verify abstracted.csv.manifest.json --output abstracted.csv

# run as a service
proxy serve --port 8000

Stable pseudonyms across runs require a persistent key:

export PROXY_PSEUDONYM_KEY=$(python -c "import os;print(os.urandom(32).hex())")

Reversible (map mode) abstraction persists an encrypted, expiring map; reverse restores the substituted identifiers across the whole table:

export PROXY_MAP_SECRET='a-high-entropy-secret'
proxy abstract data.csv -o out.csv --mode map --ttl 86400
proxy reverse out.csv -o restored.csv --manifest out.csv.manifest.json --map out.csv.map.enc

Opaque, key-only reversibility (no map) uses --mode encrypt.

Performance

Detection (spaCy NER plus the recognizers) is the cost; transform and the gate are negligible by comparison. Measured on the reference box, detection throughput:

configuration	rows/s	note
unbatched (old default)	~40	one document at a time
batched (current default)	~137	~3.4x, result-identical, no flag needed
`--model en_core_web_sm`	~158	lighter model, lower NER accuracy
`--sample-size 200`	~1500	types columns from a sample; may miss sparse PII
`--fast` (no NER)	~380	pattern-only; skips name/org/location detection

Guidance. Batching is on by default and changes nothing about the result. For structured financial data that does not need name/org/location detection, --fast runs pattern-only at several times the speed and records detection_mode: pattern-only in the certificate so the omission is on the record. For large, homogeneous tables, --sample-size N makes detection roughly independent of row count, at the cost of possibly missing PII that is sparse within a column; exhaustive (no sampling) is the default precisely because under-detection is the unsafe direction. --model en_core_web_sm trades NER accuracy for speed.

Trust boundary

The signed manifest is the certificate. There are two roots, one verifier.

Mode	Who holds the key	`verify` reports	Meaning
Self-signed	the operator's local key	Self-signed (OSS self-attestation)	the output is untampered; the signer is anonymous
Darwin-certified	Darwin / DAC authority key only	Darwin-certified (authority root)	a trusted third party vouches

The engine self-signs for free. Only a manifest whose signer equals the configured Darwin root verifies as authority-rooted, and only Darwin holds that private key, so the open-source engine can never forge the stamp. Set PROXY_DARWIN_ROOT to the authority public key to recognize Darwin-certified manifests.

What is independently re-checkable versus what requires the authority:

Re-checkable by anyone holding the output: the signature, the hashes, and the k-anonymity claim (recompute the achieved k from the published rows; the /verify endpoint does this when you pass the rows back).
Judgment, which the authority root vouches for: whether the methodology and policy are adequate for a given regulatory regime. De-identification adequacy is a statistical argument, not a proof, which is exactly why a certification authority has value.

What this is and is not

Darwin Proxy controls one axis: where identity goes when data leaves the box. It is one control, not a compliance program. It does not make an operator "compliant" wholesale. PII mishandling is a civil and regulatory matter, not a criminal one, and the precise scope of the control is the egress axis.

API

POST /abstract (oneway or encrypt mode), POST /verify (re-check a manifest against a supplied output), GET /healthz, GET /metrics. The service is stateless: map mode is not a server concern, since reversing requires a client-held encrypted map and its secret.

License

v2

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vje013

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.1.1

Jun 8, 2026

2.1.0

Jun 8, 2026

0.1.0

Jun 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darwin_proxy-2.1.1.tar.gz (51.6 kB view details)

Uploaded Jun 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

darwin_proxy-2.1.1-py3-none-any.whl (39.4 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file darwin_proxy-2.1.1.tar.gz.

File metadata

Download URL: darwin_proxy-2.1.1.tar.gz
Upload date: Jun 8, 2026
Size: 51.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for darwin_proxy-2.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d094c671e2fe33445708b90c36b9dc0adb659d6121589cddf047ea571f92243f`
MD5	`b102f6127a7e294057abb6d447e14b28`
BLAKE2b-256	`32e629957b47e901d3b7d98b335eb7d1966a9b6f3f0d502d3552d3dcfe685af9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for darwin_proxy-2.1.1.tar.gz:

Publisher: publish.yml on vje013/darwin-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: darwin_proxy-2.1.1.tar.gz
- Subject digest: d094c671e2fe33445708b90c36b9dc0adb659d6121589cddf047ea571f92243f
- Sigstore transparency entry: 1753839838
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: vje013/darwin-proxy@0a360716d11459cc91fb8c56418f7bdbb6bb8f01
- Branch / Tag: refs/tags/v2.1.1
- Owner: https://github.com/vje013
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0a360716d11459cc91fb8c56418f7bdbb6bb8f01
- Trigger Event: push

File details

Details for the file darwin_proxy-2.1.1-py3-none-any.whl.

File metadata

Download URL: darwin_proxy-2.1.1-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 39.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for darwin_proxy-2.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ea83eb9139377f8eb2a3cecd59eeecda833ef0095520429b834be1a970c660d`
MD5	`4e43ba42549bc896bcd8cc2b5c747d7f`
BLAKE2b-256	`93e0d3172777a32e8c45f0349a7960fedff8e6ca2627aeabb1809256262b74d2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for darwin_proxy-2.1.1-py3-none-any.whl:

Publisher: publish.yml on vje013/darwin-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: darwin_proxy-2.1.1-py3-none-any.whl
- Subject digest: 8ea83eb9139377f8eb2a3cecd59eeecda833ef0095520429b834be1a970c660d
- Sigstore transparency entry: 1753839912
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: vje013/darwin-proxy@0a360716d11459cc91fb8c56418f7bdbb6bb8f01
- Branch / Tag: refs/tags/v2.1.1
- Owner: https://github.com/vje013
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0a360716d11459cc91fb8c56418f7bdbb6bb8f01
- Trigger Event: push

darwin-proxy 2.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Darwin Proxy

The problem

BUILD UPDATE 6/7/2026

What it handles

Hard limits right now

The honest one-paragraph summary

The two changes that would most widen its real range are exposing a configurable policy and a configurable QI set through the CLI and API, so it adapts to a customer's actual schema instead of the stockholders schema.

What it does

Quickstart

Performance

Trust boundary

What this is and is not

API

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance