Make data safe before feeding it to AI
Project description
RedactAI
Strip PII from text, files, and pipelines before it reaches your AI.
Install
pip install redactai
python -m spacy download en_core_web_sm
Quick Start
Clean a file:
redactai clean data.csv -o data.clean.csv
Clean text in Python:
from redactai import clean
safe = clean("Call John Smith at 555-0123")
# "Call Marcia Wells at 555-8912" (faker replacements by default)
Scan for PII in CI:
redactai scan ./data --ci # exits 1 if PII detected
CLI Commands
| Command | Description |
|---|---|
redactai clean [PATH] |
Anonymize a file, folder, or stdin |
redactai scan PATH |
Detect PII and report findings |
redactai analyze |
Analyze text or file and return entity details |
redactai decrypt |
Decrypt previously encrypted output |
redactai watch PATH |
Watch a folder and clean files on change |
redactai init |
Generate a .redactai.yml config file |
redactai entities |
List supported PII entity types |
redactai profiles |
List built-in profiles |
redactai profiles show NAME |
Show profile details |
redactai mcp |
Start MCP tool server for AI agents |
redactai server start |
Start the local API daemon |
redactai server stop |
Stop the daemon |
redactai server status |
Show daemon status |
redactai server restart |
Restart the daemon |
redactai login |
Authenticate with a remote API |
redactai logout |
Remove stored credentials |
redactai whoami |
Show current auth status |
Python API
from redactai import clean, scan
clean(text, *, profile, threshold, language, entities, operators) -> str
# Use a built-in profile
clean("Email me at john@acme.com", profile="llm_guardrail")
# "Email me at <EMAIL_ADDRESS>"
# Override operator for a specific entity
clean("Call 555-0123", operators={"PHONE_NUMBER": {"type": "mask", "masking_char": "*", "chars_to_mask": 6}})
# "Call ***-****"
scan(text, *, threshold, language, entities) -> list[dict]
hits = scan("My SSN is 123-45-6789")
# [{"entity_type": "US_SSN", "start": 10, "end": 21, "score": 0.85, "text": "123-45-6789"}]
Other exports
from redactai import entities, profiles, profile_detail
entities() # ["CREDIT_CARD", "EMAIL_ADDRESS", "PERSON", ...]
profiles() # [{"id": "llm_guardrail@1", "name": "llm_guardrail", ...}, ...]
profile_detail("llm_guardrail") # full config including operators
Profiles
| Profile | Description | Threshold |
|---|---|---|
llm_guardrail |
Redact all PII before sending to LLMs | 0.3 |
app_logs_safe |
Mask PII in logs, keep structure for debugging | 0.7 |
analytics_pseudonymized |
Replace PII with consistent fakes for analytics | 0.5 |
customer_support_shareable |
Redact sensitive PII, keep names/locations for context | 0.5 |
strict_compliance_export |
Maximum redaction for GDPR/HIPAA/CCPA compliance | 0.3 |
dev_demo_readable |
Replace PII with realistic Faker data for demos | 0.5 |
CI/CD
Exit code
redactai scan ./data --ci # exit 0 = clean, exit 1 = PII found
GitHub Actions
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install redactai && python -m spacy download en_core_web_sm
- run: redactai scan ./data --ci
pre-commit
# .pre-commit-config.yaml
repos:
- repo: https://github.com/yourorg/redactai
rev: v0.1.0
hooks:
- id: redactai-scan
Config (.redactai.yml)
Generate a starter config:
redactai init
Minimal example:
profile: llm_guardrail
threshold: 0.4
entities:
- PERSON
- EMAIL_ADDRESS
- CREDIT_CARD
operators:
PERSON:
type: faker
locale: en_US
EMAIL_ADDRESS:
type: redact
allow_list:
- "Acme Corp"
files:
include:
- "**/*.csv"
- "**/*.txt"
exclude:
- "**/node_modules/**"
output_dir: ./clean
Hooks
Three hook layers fire on events: pre_scan, on_pii_detected, post_clean, on_error.
Shell hooks (in .redactai.yml)
hooks:
post_clean:
- shell: "echo 'Cleaned {{file}} -> {{output_file}}'"
on_pii_detected:
- shell: "notify-send 'PII found: {{entity_count}} entities in {{file}}'"
Python plugins
from redactai.hooks import on_pii_detected, HookEvent
@on_pii_detected
def alert(event: HookEvent):
print(f"Found {event.entity_count} entities in {event.file}")
Webhooks
hooks:
on_pii_detected:
- url: "https://hooks.slack.com/services/..."
MCP Server
Expose RedactAI as tools for Claude Desktop or other AI agents:
pip install redactai[mcp]
redactai mcp # starts stdio transport
Add to Claude Desktop config (claude_desktop_config.json):
{
"mcpServers": {
"redactai": {
"command": "redactai",
"args": ["mcp"]
}
}
}
API Server
# Start as background daemon (auto-starts on first CLI call)
redactai server start
# Or run in foreground
redactai server start --foreground
# Manage
redactai server status
redactai server restart
redactai server stop
The daemon exposes a REST API at http://localhost:8000 with endpoints:
POST /analyze-- detect PII entitiesPOST /anonymize-- anonymize textPOST /upload-- process files (multipart)GET /health-- health check
Persistence
Tokens and the audit log are stored in Postgres when REDACTAI_DATABASE_URL is set.
Without it, both fall back to in-process memory (convenient for local dev and tests,
but lost on restart and not safe across multiple instances).
# Supabase pooled connection (recommended for FastAPI)
export REDACTAI_DATABASE_URL="postgresql://postgres.PROJECT:PASSWORD@REGION.pooler.supabase.com:6543/postgres"
Schema is applied automatically on startup via CREATE TABLE IF NOT EXISTS.
Raw tokens are never stored — only SHA-256 hashes and the first 12-character
prefix. Revoked tokens are retained with a revoked_at timestamp for audit purposes.
Testing
The default test suite runs entirely in-memory (no Postgres required).
Live-database smoke tests are marked @pytest.mark.postgres and auto-skip
unless REDACTAI_DATABASE_URL is set:
# Default — in-memory only
pytest
# Include Postgres smoke tests (requires live DB)
set -a; source .env; set +a
pytest -m postgres
File Types
Supported: .txt, .csv, .pdf, .docx, .png, .jpg, .jpeg, .bmp, .tiff, .json
Image Redaction
Redact PII from images using OCR:
# Single image
redactai redact-image screenshot.png -o screenshot.redacted.png
# Batch directory
redactai redact-image ./screenshots -o ./screenshots.redacted
# Custom fill color
redactai redact-image photo.jpg --fill "255,192,203"
Structured Data
Anonymize PII in CSV and JSON files with column-aware detection:
# CSV
redactai structured data.csv -o data.clean.csv
# JSON
redactai structured data.json -o data.clean.json
# Custom strategy
redactai structured data.csv --strategy highest_confidence
Pseudonymization
Consistent fake↔real mappings across files and sessions. Same input → same output always.
# Pseudonymize with deterministic seed
redactai pseudonymize data.txt --seed "project-alpha" --store mappings.json
# Restore originals
redactai pseudonymize data.pseudonymized.txt --restore --store mappings.json
# Show mapping stats
redactai pseudonymize data.txt --show-mapping --seed "project-alpha"
Multi-Language Support
20+ languages with dedicated spaCy models:
# List all supported languages
redactai languages
# Use a specific language
redactai clean document.txt --language de # German
redactai clean document.txt --language ja # Japanese
redactai clean document.txt --language zh # Chinese
Custom NER Recognizers
Plug in Transformers, GLiNER, or Flair models for domain-specific detection:
# Add a Transformers recognizer
redactai add-recognizer transformers --model obi/deid_roberta_i2b2 --threshold 0.5
# Add GLiNER zero-shot recognizer
redactai add-recognizer gliner --model urchade/gliner_medium-v2.1 --labels "PERSON,EMAIL,PHONE"
# Add Flair recognizer
redactai add-recognizer flair --model ner-multi
PDF Annotation
Highlight PII in PDFs without destroying the original. Perfect for legal review and audit trails.
# Annotate with highlights
redactai annotate-pdf document.pdf -o document.annotated.pdf
# Use underline instead of highlight
redactai annotate-pdf document.pdf --type underline --color "0.0,0.0,1.0"
# Generate a PII report (JSON, CSV, or text)
redactai annotate-pdf document.pdf --report --report-format json
Evaluation
Benchmark detection quality against ground truth labels. Critical for audit evidence.
# Run evaluation against ground truth
redactai evaluate ground_truth.json -o report --format both
# Custom threshold and entities
redactai evaluate ground_truth.json --threshold 0.5 --entities PERSON,EMAIL_ADDRESS
Ground truth format (ground_truth.json):
[
{
"text": "My name is John Smith and email is john@example.com",
"entities": [
{"entity_type": "PERSON", "start": 11, "end": 21},
{"entity_type": "EMAIL_ADDRESS", "start": 39, "end": 54}
]
}
]
License
Apache 2.0 — see LICENSE.
Decision Trace
Explain exactly why each PII entity was detected — perfect for audit compliance and debugging.
# Show detailed decision trace
redactai trace --text "My name is John Smith and email is john@example.com"
# Trace from file
redactai trace --file document.txt --format json -o trace.json
# Show as markdown
redactai trace --file document.txt --format markdown
DICOM Medical Redaction
HIPAA-compliant de-identification of medical images. Redacts both pixel text (OCR) and metadata tags.
# Single DICOM file
redactai redact-dicom scan.dcm -o scan.redacted.dcm
# Batch directory
redactai redact-dicom ./dicom_folder -o ./dicom_redacted
# Clean pixels only, keep metadata
redactai redact-dicom scan.dcm --no-clean-metadata
K-Anonymity
Statistical anonymization guarantees — each record is indistinguishable from at least k-1 others.
# Apply k-anonymity (k=5)
redactai k-anonymity data.csv age,zip,gender -k 5 -o data.anonymous.csv
# With l-diversity check
redactai k-anonymity data.csv age,zip -k 5 --sensitive disease --l 3
Streaming Processing
Real-time PII masking for logs, telemetry, and data pipelines.
# Process log file
redactai stream app.log -o app.masked.log
# Process stdin (pipe)
tail -f /var/log/app.log | redactai stream
# Custom entities
redactai stream app.log --entities EMAIL_ADDRESS,IP_ADDRESS
Synthetic Data Generation
Generate realistic but fake datasets that preserve statistical patterns without exposing real PII.
# Generate synthetic CSV from real data
redactai synthetic real_data.csv -o synthetic.csv --num 1000
# Generate synthetic text
redactai synthetic "My name is John Smith, email john@test.com" -o synthetic.json
# Reproducible with seed
redactai synthetic data.csv --seed 42 --locale en_GB
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redactai-0.1.0.tar.gz.
File metadata
- Download URL: redactai-0.1.0.tar.gz
- Upload date:
- Size: 102.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7bb8daa2d848cafbc0fcf3ef34791e7181feaa7aa5d051af45ddeaa8af2c30f
|
|
| MD5 |
62a30096a1b6b5c33dc49c248f352df0
|
|
| BLAKE2b-256 |
6614f86b5dc03106979b8d74144867ac473a5069c52ce11f668da46acc4c10cf
|
Provenance
The following attestation bundles were made for redactai-0.1.0.tar.gz:
Publisher:
release.yml on jagreehal/redactai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
redactai-0.1.0.tar.gz -
Subject digest:
e7bb8daa2d848cafbc0fcf3ef34791e7181feaa7aa5d051af45ddeaa8af2c30f - Sigstore transparency entry: 1343364906
- Sigstore integration time:
-
Permalink:
jagreehal/redactai@22a0871e95f8676fdfce48fcf5daf9559dff726e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jagreehal
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@22a0871e95f8676fdfce48fcf5daf9559dff726e -
Trigger Event:
push
-
Statement type:
File details
Details for the file redactai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: redactai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 110.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daa63c8730cd7d63d95c55d780a2eac820054d2885c8dda5293461fba4b0616d
|
|
| MD5 |
6fa8aab54da2412f6f84a5ea177516ea
|
|
| BLAKE2b-256 |
3f76de01fa6bc7f26e5f70f371cbe94cc7eca8cbc983436307e49e25198757af
|
Provenance
The following attestation bundles were made for redactai-0.1.0-py3-none-any.whl:
Publisher:
release.yml on jagreehal/redactai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
redactai-0.1.0-py3-none-any.whl -
Subject digest:
daa63c8730cd7d63d95c55d780a2eac820054d2885c8dda5293461fba4b0616d - Sigstore transparency entry: 1343364917
- Sigstore integration time:
-
Permalink:
jagreehal/redactai@22a0871e95f8676fdfce48fcf5daf9559dff726e -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jagreehal
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@22a0871e95f8676fdfce48fcf5daf9559dff726e -
Trigger Event:
push
-
Statement type: