Skip to main content

Privacy-preserving LLM wrapper with PII anonymization.

Project description

redacit

A local privacy layer that anonymizes sensitive data before it reaches a cloud LLM, then restores original values in the response. No data leaves your machine as-is. No Docker required.


How it works

Your prompt
    ↓
Anonymizer  →  detects PII spans (Presidio, in-process)
            →  replaces each span with a tagged placeholder  e.g. <PERSON_0>
            →  records a placeholder → original mapping
    ↓
Cloud LLM  (sees only anonymized text)
    ↓
Deanonymizer  →  replaces placeholders in the response with original values
    ↓
Your app  (receives the reply with real names / emails / etc. restored)

Detected entity types

Entity Example
PERSON John Smith
EMAIL_ADDRESS john@acme.com
PHONE_NUMBER +1 (415) 555-0192
CREDIT_CARD 4532-0151-1283-0366
US_SSN 346-12-5678
IP_ADDRESS 203.0.113.42
LOCATION Austin, TX
ORGANIZATION Acme Holdings
DATE_TIME 2024-04-15
IBAN_CODE GB29NWBK60161331926819
URL acme.com
US_PASSPORT 938475610
US_DRIVER_LICENSE
US_BANK_ACCOUNT 7823901645 (custom)
US_ROUTING_NUMBER 021000021 (custom)
EIN 12-3456789 (custom)
API_KEY sk-xK92mLp… (custom)

Setup

Requires Python 3.11+ and uv.

pip install redacit                  # base install — regex-only PII detection
python -m spacy download en_core_web_sm          # + person names, locations (11 MB)
python -m spacy download en_core_web_md          # + word vectors, recommended (43 MB)
# Or use the interactive wizard: redacit init

Copy .env.example to .env and add your API key for live LLM calls:

cp .env.example .env
# set OPENAI_API_KEY=sk-...

Model options

redacit auto-detects the best available spaCy model at startup. No configuration needed — it just uses whatever is installed.

Install command Model Size Detects
pip install redacit none (regex-only) 0 MB emails, SSNs, credit cards, phones, IBANs, API keys, bank accounts, EINs, URLs, IPs
`python -m spacy download en_core_web_sm # + person names, locations (11 MB)
`python -m spacy download en_core_web_md # + word vectors, recommended (43 MB)
`# Or use the interactive wizard: redacit init

For most use cases, en_core_web_md is the best balance of size and accuracy. Use en_core_web_sm for minimal footprint, or the base install for structured-PII-only use cases (financial data, API key scrubbing).

You can also select the model explicitly in code:

from redacit import Anonymizer

anon = Anonymizer()                          # auto-detect best available
anon = Anonymizer(model="en_core_web_sm")    # explicit small model
anon = Anonymizer(model=None)                # regex-only, no NLP model

Usage

1. CLI — no code needed

redacit anonymize "Schedule a call with John Smith at john@acme.com"

# Anonymized:
# Schedule a call with <PERSON_0> at <EMAIL_ADDRESS_0>
#
# Mapping:
#   <PERSON_0>                       John Smith
#   <EMAIL_ADDRESS_0>                john@acme.com

Filter entity types or tune the confidence threshold:

redacit anonymize "John Smith, card 4111-1111-1111-1111" --entity PERSON
redacit anonymize "..." --threshold 0.6

Analyse an audit log:

redacit stats privacy_audit.jsonl --top 5

Start the REST API server (requires the server extra):

uv add 'redacit[server]'
redacit serve --host 0.0.0.0 --port 8000

2. Drop-in OpenAI replacement

The fastest path if you already have OpenAI code — change one line:

# Before
from openai import OpenAI
client = OpenAI()

# After
from redacit import PrivacyOpenAI
client = PrivacyOpenAI()

# Everything else stays identical
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise Alice Jones's contract at alice@corp.com"}],
)
# Alice Jones and alice@corp.com are anonymized before the API call
# and restored in response.choices[0].message.content automatically

Tools, response_format, streaming, embeddings, and all other SDK call patterns work unchanged.


3. Simple chat client (OpenAI)

from redacit import OpenAIPrivacyClient

client = OpenAIPrivacyClient()    # reads OPENAI_API_KEY from env
reply  = client.chat("Draft a letter to John Smith at john@acme.com")
# PII stripped before the call, restored in the reply

Stream the response:

for chunk in client.stream("Summarise the following contract: ..."):
    print(chunk, end="", flush=True)

3b. Unified client — any SDK

from redacit import PrivacyClient
from openai import OpenAI              # or anthropic.Anthropic, google.genai.Client

client = PrivacyClient(OpenAI())
reply  = client.query("Draft a letter to John Smith at john@acme.com")
# Works identically with any supported SDK

4. Low-level anonymizer (manage the LLM call yourself)

from redacit import anonymize, deanonymize

result   = anonymize("SSN: 346-12-5678, card: 4111-1111-1111-1111")
raw      = your_llm_call(result.anonymized_text)
restored = deanonymize(raw, result.mapping)

Restrict which entity types are detected for a single call:

result = anonymize(text, entities=["PERSON", "EMAIL_ADDRESS"])

5. Multi-turn conversations

PrivacySession accumulates the placeholder-to-original mapping across turns so PII introduced in one message stays resolvable in later responses:

from redacit import OpenAIPrivacyClient, PrivacySession

session = PrivacySession()
client  = OpenAIPrivacyClient(session=session)

client.chat("My name is Alice Jones")       # <PERSON_0> → Alice Jones stored
client.chat("What did I just tell you?")    # placeholder resolved from session
session.clear()                             # start a new conversation

6. REST API

# Anonymize
curl -s -X POST http://localhost:8000/anonymize \
  -H "Content-Type: application/json" \
  -d '{"text": "Email alice@corp.com by Friday"}' | jq
# { "anonymized_text": "Email <EMAIL_ADDRESS_0> by Friday",
#   "mapping": {"<EMAIL_ADDRESS_0>": "alice@corp.com"} }

# Restore
curl -s -X POST http://localhost:8000/deanonymize \
  -H "Content-Type: application/json" \
  -d '{"text": "Email <EMAIL_ADDRESS_0> by Friday",
       "mapping": {"<EMAIL_ADDRESS_0>": "alice@corp.com"}}' | jq
# { "text": "Email alice@corp.com by Friday" }

# Chat proxy (requires OPENAI_API_KEY on the server)
curl -s -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Summarise the contract for John Smith"}' | jq

Full OpenAPI docs available at http://localhost:8000/docs when the server is running.


7. Structured data — CSV and JSON files

from redacit import CsvAnonymizer, JsonAnonymizer

# CSV — one result per row
for row in CsvAnonymizer().anonymize_file("customers.csv"):
    print(row.anonymized)      # dict with PII replaced per column
    print(row.flat_mapping)    # combined placeholder map for this row

# JSON — one result per record
for rec in JsonAnonymizer().anonymize_file("records.json"):
    print(rec.anonymized)      # nested dict with PII replaced at leaf strings

Add a sidecar config file to control per-column or per-path rules:

// customers.json  (placed alongside customers.csv)
{
  "fields": {
    "name":    { "entities": ["PERSON"] },
    "email":   { "entities": ["EMAIL_ADDRESS"] },
    "amount":  { "skip": true },
    "date":    { "skip": true }
  }
}
Field option Effect
"entities": [...] Only those PII types detected for this field
"skip": true Field passed through unchanged
"score_threshold": N Per-field confidence threshold
(no entry) Full default entity list at default threshold

8. Audit logging

AuditLogger writes append-only JSONL. Raw text and mapping values are never stored — only metadata safe for compliance review:

from redacit import OpenAIPrivacyClient, AuditLogger

with AuditLogger("privacy_audit.jsonl") as log:
    client = OpenAIPrivacyClient(audit_logger=log)
    client.chat("Wire $50,000 to account 7823901645")

# Appended record:
# {
#   "ts": "2024-11-01T12:00:00+00:00",
#   "input_hash": "a3f9b2c1...",          ← SHA-256[:16] of the input
#   "entity_counts": {"US_BANK_ACCOUNT": 1},
#   "total_redacted": 1,
#   "provider": "openai",
#   "model": "gpt-4o-mini"
# }

Analyse a log file from the CLI:

redacit stats privacy_audit.jsonl

# Audit log : privacy_audit.jsonl
# Records   : 142
# Total PII : 389
#
# Top 5 entity types:
#   PERSON                         98
#   EMAIL_ADDRESS                  71
#   US_BANK_ACCOUNT                54
#   CREDIT_CARD                    41
#   PHONE_NUMBER                   38

Demo

uv run python demo.py                        # run all demo datasets
uv run python demo.py general_pii            # plain text PII samples
uv run python demo.py financial              # financial prose samples
uv run python demo.py financial_transactions # CSV with per-column config
uv run python demo.py financial_records      # nested JSON with sidecar

Adding a demo dataset

Plain text — add a .py file to demo_data/:

# demo_data/my_dataset.py
TITLE = "My Dataset"
SAMPLES = [
    "Text with sensitive data here.",
    "Another sample with John Doe at john@example.com.",
]

CSV — drop a .csv into demo_data/ and optionally a .json sidecar with the same stem. demo.py auto-discovers both.


Tests

uv run pytest                        # full suite
uv run pytest tests/unit/            # recognizer unit tests only
uv run pytest tests/test_samples.py  # data-driven leakage and roundtrip tests

Project structure

redacit/
├── src/redacit/
│   ├── __init__.py             # public API — all exports live here
│   ├── anonymizer.py           # core PII detection and placeholder replacement
│   ├── _types.py               # FieldConfig, SidecarConfig, LLMClient protocol
│   ├── session.py              # PrivacySession — multi-turn mapping accumulator
│   ├── audit.py                # AuditLogger — append-only JSONL compliance log
│   ├── cli.py                  # redacit CLI (anonymize / serve / stats)
│   ├── server.py               # FastAPI server (optional — requires [server] extra)
│   ├── client/
│   │   ├── base.py             # BaseLLMClient — anonymize → call → deanonymize lifecycle
│   │   ├── privacy_client.py   # PrivacyClient — unified drop-in proxy for any SDK
│   │   ├── openai_client.py    # OpenAIPrivacyClient + PrivacyOpenAI
│   │   └── litellm_client.py   # LiteLLMPrivacyClient (optional — requires [litellm] extra)
│   ├── formats/
│   │   ├── csv.py              # CsvAnonymizer — row-by-row CSV processing
│   │   ├── json_format.py      # JsonAnonymizer — record-by-record JSON processing
│   │   └── _helpers.py         # flatten / unflatten / load_sidecar / anonymize_flat
│   └── recognizers/
│       ├── bank_account.py     # UsBankAccountRecognizer
│       ├── routing_number.py   # UsRoutingNumberRecognizer
│       ├── ein.py              # EinRecognizer
│       └── api_key.py          # ApiKeyRecognizer (sk-*, Bearer tokens, hex secrets)
├── demo_data/                  # sample datasets for demo.py
├── tests/
│   ├── fixtures/sample_prompts.py
│   ├── test_anonymizer.py
│   ├── test_samples.py
│   ├── test_cli.py
│   ├── test_server.py
│   └── unit/test_recognizers.py
├── demo.py
└── pyproject.toml

Optional extras

Extra Installs Enables
redacit[server] fastapi, uvicorn redacit serve, REST API
redacit[litellm] litellm LiteLLMPrivacyClient (Anthropic, Gemini, Ollama, …)

Known limitations

Limitation Detail
Non-US phone numbers UK/EU mobile numbers may fall below the default confidence threshold without a country-specific recognizer
Numeric pattern collisions Bank account and routing numbers can overlap with PHONE_NUMBER detections; overlap resolution keeps the higher-confidence span
Credit card Luhn validation Card numbers must pass checksum validation — synthetic or invalid numbers are not caught
LLM response paraphrasing If the LLM rewrites a placeholder (e.g. expands <PERSON_0> to Person Zero), deanonymization will not restore it
Streaming deanonymization The streaming client buffers the full response before deanonymizing, since placeholders may span token boundaries

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redacit-0.1.0.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redacit-0.1.0-py3-none-any.whl (47.7 kB view details)

Uploaded Python 3

File details

Details for the file redacit-0.1.0.tar.gz.

File metadata

  • Download URL: redacit-0.1.0.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for redacit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5291ebf8329a6dd7ff50027df74a963aedf95ec59bc8170bf26a0da5dbffd0ea
MD5 edfe9ab0607700a6e5dc980d03778574
BLAKE2b-256 ef51bd2a774175b91bb82ff8e79c806cd93d86437ed9f660e4f22b0b003daeba

See more details on using hashes here.

File details

Details for the file redacit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: redacit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 47.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for redacit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f0b6f63eb5e1393365862573512d4cf75512af829a0148342e7af68cd9f32454
MD5 aa9aa01bcc6bb1ab6fb2fffe628ed019
BLAKE2b-256 40efd4300d9f7efdd740c18c830e09a50486272770ac4575bfbdb2ac95acf575

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page