Skip to main content

Semantic redaction for financial AI agents: strip identity, keep the signal, prove it with a signed certificate.

Project description

Darwin Proxy

Destroy the identity. Keep the signal. Prove it.

Darwin Proxy is a semantic redaction engine for financial AI agents. It strips identity out of a dataset while preserving the analytical signal, then issues a signed certificate attesting to what it did and that the result is re-identifiable below a stated threshold.

The problem

<<<<<<< HEAD What's Next (full product roadmap)

  • Signed Ed25519 attestation certificate proving exactly how data was abstracted (built on Darwin Agentic Cloud)
  • K-anonymity re-identification gate validating that no replacement is too rare to be safe
  • Chroma vector-based semantic classifier replacing heuristic matching with embedding-space neighborhoods
  • Open-core: engine free (Apache-2.0), policy packs and certification paid

BUILD UPDATE 6/7/2026

Verdict: right now it is a single flat-table, in-memory tool that is not yet schema-flexible. The column policy and the re-id quasi-identifiers are hardcoded to specific English header names.

What it handles

Dimension Current capability
Input format One flat CSV (utf-8-sig) via CLI; CSV text or a JSON list of flat records via the API
Structure Flat rows of string fields. No nested JSON, no multi-table/relational, no Excel/Parquet
Row count (validated) 500 rows real, 2,000 rows logic-only benchmark
Throughput ~765 rows/sec logic-only (blank scanner). Per-record ~1.1 ms, gate ~0.15 ms/row
Structured PII (by column) First/Last name, Email, Business Name, Phone, City, State, Country
Inline PII (free text) SSN, ABA routing, CUSIP, ISIN, EIN, account (checksum/context validated), plus PERSON/ORG/LOCATION via spaCy NER on prose of 3+ tokens
Re-id gate k-anonymity over State, Shares Owned, Acquisition Date, with optimal minimal-loss generalization
Output Abstracted CSV/rows plus an Ed25519-signed certificate

Hard limits right now

Column names are hardcoded. Semantic replacement only fires on exactly these eight headers: First Name, Last Name, Email, Business Name, Phone Number, City, State, Country. A column called fname or client_first is treated as signal and kept. Because single-token cells skip NER (the 3-token gate), a fname column full of first names passes through largely unredacted. There is no CLI or API way to supply a custom policy yet, even though the engine supports one internally.

The gate only protects data with its three QI columns. If a dataset has none of State, Shares Owned, Acquisition Date, the gate finds no quasi-identifiers, puts every record in one class, reports k equal to the row count, and passes with zero generalization. That is a trivial pass with no real re-identification protection, and the certificate will still say passed. This is the most important footgun: the gate is schema-specific, and on the wrong schema it is a no-op that looks like success.

Everything is in-memory, single-threaded. abstract_csv reads the whole file, abstracts every row, runs the gate over the full set, and writes. No streaming, no chunking, no parallelism. Practical ceiling is low hundreds of thousands of rows before memory and single-thread time bite. The gate is roughly linear in rows but runs many passes during lattice search plus rollback, so it grows with row count.

Throughput on the real model is unmeasured and lower. The 765 rows/sec is logic-only with a blank scanner. With the real spaCy model, every signal string cell goes through analyzer.analyze, which is much heavier, and free-text prose adds NER cost. Data with several signal columns or any free-text column will run materially slower. I have not benchmarked the real-model path because the model will not download in my sandbox.

Entity and locale coverage is narrow. No credit cards, IBAN, IP, street address, DOB, passport, driver license, or any non-US identifiers. Names and org NER are English-centric. The sector corpus is US large-cap only, so funds, LPs, and non-US entities classify to the nearest of seven sectors.

Input hygiene is thin. utf-8-sig only. Ragged rows, missing values, or unexpected types are not hardened against; a None in a name field could throw. No size guard, no timeout, no auth, no rate limit on the service.

The honest one-paragraph summary

It reliably abstracts a clean, flat, English-headered CSV that uses the expected column names, in the low thousands of rows, on a box with the spaCy and Chroma models present, and proves it with a signed certificate.

The moment the schema drifts from that shape, the column names, the three QI fields, the eight known headers, it quietly does less than it appears to, because unrecognized columns fall through to signal and the gate degrades to a trivial pass.

The two changes that would most widen its real range are exposing a configurable policy and a configurable QI set through the CLI and API, so it adapts to a customer's actual schema instead of the stockholders schema.

To be useful, a financial AI workflow has to send client data to third-party models. The moment a client's name, holdings, and account details leave the box, that is PII egressing to a third party, with the regulatory exposure (GLBA, Reg S-P, CCPA) landing on the operator. Darwin Proxy strips the identity before the data leaves, so the model still gets the signal and the real PII never escapes.

What it does

A dataset flows through four stages:

  1. Detect which columns carry identity by their content, not their header names, using a Presidio analyzer with the full predefined recognizer set plus checksum-validated finance recognizers (SSN, ABA routing, CUSIP, ISIN, EIN, account). Renamed or gibberish headers do not fool it.
  2. Transform each identifier. The default is keyed, signal-preserving substitution: a value maps to the same realistic fake everywhere (a custom Presidio operator), so joins and shape survive. An opaque AES-encrypt mode is available when nothing analyzable should leave. Geography and dates are kept for the gate rather than substituted.
  3. Gate the result on k-anonymity, generalizing quasi-identifiers (region, holdings band, acquisition window) until every record shares its combination with at least k others. Quasi-identifiers are inferred from the detected entities, and when none are identified the gate refuses to claim k-anonymity rather than silently passing.
  4. Certify with an Ed25519 signature over the manifest, binding the detection mapping, operators, locale, reversibility mode, the gate result (including whether re-identification risk was actually assessed), and the before/after hashes.

Reversibility has two modes: a keyed map (signal-preserving and reversible only via an encrypted, expiring map) and AES encrypt (opaque and reversible by key alone). Image inputs are supported optionally via OCR when tesseract is present.

Quickstart

pip install darwin-proxy
python -m spacy download en_core_web_lg     # or en_core_web_sm for a lighter box

# abstract a CSV, write output + a signed manifest sidecar next to it
proxy abstract data.csv -o abstracted.csv --k 5

# re-check the certificate against the output (recomputes hash and k)
proxy verify abstracted.csv.manifest.json --output abstracted.csv

# run as a service
proxy serve --port 8000

Stable pseudonyms across runs require a persistent key:

export PROXY_PSEUDONYM_KEY=$(python -c "import os;print(os.urandom(32).hex())")

Reversible (map mode) abstraction persists an encrypted, expiring map; reverse restores the substituted identifiers across the whole table:

export PROXY_MAP_SECRET='a-high-entropy-secret'
proxy abstract data.csv -o out.csv --mode map --ttl 86400
proxy reverse out.csv -o restored.csv --manifest out.csv.manifest.json --map out.csv.map.enc

Opaque, key-only reversibility (no map) uses --mode encrypt.

Performance

Detection (spaCy NER plus the recognizers) is the cost; transform and the gate are negligible by comparison. Measured on the reference box, detection throughput:

configuration rows/s note
unbatched (old default) ~40 one document at a time
batched (current default) ~137 ~3.4x, result-identical, no flag needed
--model en_core_web_sm ~158 lighter model, lower NER accuracy
--sample-size 200 ~1500 types columns from a sample; may miss sparse PII
--fast (no NER) ~380 pattern-only; skips name/org/location detection

Guidance. Batching is on by default and changes nothing about the result. For structured financial data that does not need name/org/location detection, --fast runs pattern-only at several times the speed and records detection_mode: pattern-only in the certificate so the omission is on the record. For large, homogeneous tables, --sample-size N makes detection roughly independent of row count, at the cost of possibly missing PII that is sparse within a column; exhaustive (no sampling) is the default precisely because under-detection is the unsafe direction. --model en_core_web_sm trades NER accuracy for speed.

Trust boundary

The signed manifest is the certificate. There are two roots, one verifier.

Mode Who holds the key verify reports Meaning
Self-signed the operator's local key Self-signed (OSS self-attestation) the output is untampered; the signer is anonymous
Darwin-certified Darwin / DAC authority key only Darwin-certified (authority root) a trusted third party vouches

The engine self-signs for free. Only a manifest whose signer equals the configured Darwin root verifies as authority-rooted, and only Darwin holds that private key, so the open-source engine can never forge the stamp. Set PROXY_DARWIN_ROOT to the authority public key to recognize Darwin-certified manifests.

What is independently re-checkable versus what requires the authority:

  • Re-checkable by anyone holding the output: the signature, the hashes, and the k-anonymity claim (recompute the achieved k from the published rows; the /verify endpoint does this when you pass the rows back).
  • Judgment, which the authority root vouches for: whether the methodology and policy are adequate for a given regulatory regime. De-identification adequacy is a statistical argument, not a proof, which is exactly why a certification authority has value.

What this is and is not

Darwin Proxy controls one axis: where identity goes when data leaves the box. It is one control, not a compliance program. It does not make an operator "compliant" wholesale. PII mishandling is a civil and regulatory matter, not a criminal one, and the precise scope of the control is the egress axis.

API

POST /abstract (oneway or encrypt mode), POST /verify (re-check a manifest against a supplied output), GET /healthz, GET /metrics. The service is stateless: map mode is not a server concern, since reversing requires a client-held encrypted map and its secret.

License

Apache-2.0. Copyright 2026 Darwin Adaptive Systems LLC.

v2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darwin_proxy-2.1.1.tar.gz (51.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

darwin_proxy-2.1.1-py3-none-any.whl (39.4 kB view details)

Uploaded Python 3

File details

Details for the file darwin_proxy-2.1.1.tar.gz.

File metadata

  • Download URL: darwin_proxy-2.1.1.tar.gz
  • Upload date:
  • Size: 51.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for darwin_proxy-2.1.1.tar.gz
Algorithm Hash digest
SHA256 d094c671e2fe33445708b90c36b9dc0adb659d6121589cddf047ea571f92243f
MD5 b102f6127a7e294057abb6d447e14b28
BLAKE2b-256 32e629957b47e901d3b7d98b335eb7d1966a9b6f3f0d502d3552d3dcfe685af9

See more details on using hashes here.

Provenance

The following attestation bundles were made for darwin_proxy-2.1.1.tar.gz:

Publisher: publish.yml on vje013/darwin-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file darwin_proxy-2.1.1-py3-none-any.whl.

File metadata

  • Download URL: darwin_proxy-2.1.1-py3-none-any.whl
  • Upload date:
  • Size: 39.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for darwin_proxy-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8ea83eb9139377f8eb2a3cecd59eeecda833ef0095520429b834be1a970c660d
MD5 4e43ba42549bc896bcd8cc2b5c747d7f
BLAKE2b-256 93e0d3172777a32e8c45f0349a7960fedff8e6ca2627aeabb1809256262b74d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for darwin_proxy-2.1.1-py3-none-any.whl:

Publisher: publish.yml on vje013/darwin-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page