Skip to main content

Make data safe before feeding it to AI

Project description

RedactAI

Strip PII from text, files, and pipelines before it reaches your AI.

Install

pip install redactai
python -m spacy download en_core_web_sm

Quick Start

Clean a file:

redactai clean data.csv -o data.clean.csv

Clean text in Python:

from redactai import clean

safe = clean("Call John Smith at 555-0123")
# "Call Marcia Wells at 555-8912"  (faker replacements by default)

Scan for PII in CI:

redactai scan ./data --ci  # exits 1 if PII detected

CLI Commands

Command Description
redactai clean [PATH] Anonymize a file, folder, or stdin
redactai scan PATH Detect PII and report findings
redactai analyze Analyze text or file and return entity details
redactai decrypt Decrypt previously encrypted output
redactai watch PATH Watch a folder and clean files on change
redactai init Generate a .redactai.yml config file
redactai entities List supported PII entity types
redactai profiles List built-in profiles
redactai profiles show NAME Show profile details
redactai mcp Start MCP tool server for AI agents
redactai server start Start the local API daemon
redactai server stop Stop the daemon
redactai server status Show daemon status
redactai server restart Restart the daemon
redactai login Authenticate with a remote API
redactai logout Remove stored credentials
redactai whoami Show current auth status

Python API

from redactai import clean, scan

clean(text, *, profile, threshold, language, entities, operators) -> str

# Use a built-in profile
clean("Email me at john@acme.com", profile="llm_guardrail")
# "Email me at <EMAIL_ADDRESS>"

# Override operator for a specific entity
clean("Call 555-0123", operators={"PHONE_NUMBER": {"type": "mask", "masking_char": "*", "chars_to_mask": 6}})
# "Call ***-****"

scan(text, *, threshold, language, entities) -> list[dict]

hits = scan("My SSN is 123-45-6789")
# [{"entity_type": "US_SSN", "start": 10, "end": 21, "score": 0.85, "text": "123-45-6789"}]

Other exports

from redactai import entities, profiles, profile_detail

entities()          # ["CREDIT_CARD", "EMAIL_ADDRESS", "PERSON", ...]
profiles()          # [{"id": "llm_guardrail@1", "name": "llm_guardrail", ...}, ...]
profile_detail("llm_guardrail")  # full config including operators

Profiles

Profile Description Threshold
llm_guardrail Redact all PII before sending to LLMs 0.3
app_logs_safe Mask PII in logs, keep structure for debugging 0.7
analytics_pseudonymized Replace PII with consistent fakes for analytics 0.5
customer_support_shareable Redact sensitive PII, keep names/locations for context 0.5
strict_compliance_export Maximum redaction for GDPR/HIPAA/CCPA compliance 0.3
dev_demo_readable Replace PII with realistic Faker data for demos 0.5

CI/CD

Exit code

redactai scan ./data --ci  # exit 0 = clean, exit 1 = PII found

GitHub Actions

- uses: actions/setup-python@v5
  with:
    python-version: "3.12"
- run: pip install redactai && python -m spacy download en_core_web_sm
- run: redactai scan ./data --ci

pre-commit

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/yourorg/redactai
    rev: v0.1.0
    hooks:
      - id: redactai-scan

Config (.redactai.yml)

Generate a starter config:

redactai init

Minimal example:

profile: llm_guardrail
threshold: 0.4
entities:
  - PERSON
  - EMAIL_ADDRESS
  - CREDIT_CARD

operators:
  PERSON:
    type: faker
    locale: en_US
  EMAIL_ADDRESS:
    type: redact

allow_list:
  - "Acme Corp"

files:
  include:
    - "**/*.csv"
    - "**/*.txt"
  exclude:
    - "**/node_modules/**"
  output_dir: ./clean

Hooks

Three hook layers fire on events: pre_scan, on_pii_detected, post_clean, on_error.

Shell hooks (in .redactai.yml)

hooks:
  post_clean:
    - shell: "echo 'Cleaned {{file}} -> {{output_file}}'"
  on_pii_detected:
    - shell: "notify-send 'PII found: {{entity_count}} entities in {{file}}'"

Python plugins

from redactai.hooks import on_pii_detected, HookEvent

@on_pii_detected
def alert(event: HookEvent):
    print(f"Found {event.entity_count} entities in {event.file}")

Webhooks

hooks:
  on_pii_detected:
    - url: "https://hooks.slack.com/services/..."

MCP Server

Expose RedactAI as tools for Claude Desktop or other AI agents:

pip install redactai[mcp]
redactai mcp  # starts stdio transport

Add to Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "redactai": {
      "command": "redactai",
      "args": ["mcp"]
    }
  }
}

API Server

# Start as background daemon (auto-starts on first CLI call)
redactai server start

# Or run in foreground
redactai server start --foreground

# Manage
redactai server status
redactai server restart
redactai server stop

The daemon exposes a REST API at http://localhost:8000 with endpoints:

  • POST /analyze -- detect PII entities
  • POST /anonymize -- anonymize text
  • POST /upload -- process files (multipart)
  • GET /health -- health check

Persistence

Tokens and the audit log are stored in Postgres when REDACTAI_DATABASE_URL is set. Without it, both fall back to in-process memory (convenient for local dev and tests, but lost on restart and not safe across multiple instances).

# Supabase pooled connection (recommended for FastAPI)
export REDACTAI_DATABASE_URL="postgresql://postgres.PROJECT:PASSWORD@REGION.pooler.supabase.com:6543/postgres"

Schema is applied automatically on startup via CREATE TABLE IF NOT EXISTS. Raw tokens are never stored — only SHA-256 hashes and the first 12-character prefix. Revoked tokens are retained with a revoked_at timestamp for audit purposes.

Testing

The default test suite runs entirely in-memory (no Postgres required). Live-database smoke tests are marked @pytest.mark.postgres and auto-skip unless REDACTAI_DATABASE_URL is set:

# Default — in-memory only
pytest

# Include Postgres smoke tests (requires live DB)
set -a; source .env; set +a
pytest -m postgres

File Types

Supported: .txt, .csv, .pdf, .docx, .png, .jpg, .jpeg, .bmp, .tiff, .json

Image Redaction

Redact PII from images using OCR:

# Single image
redactai redact-image screenshot.png -o screenshot.redacted.png

# Batch directory
redactai redact-image ./screenshots -o ./screenshots.redacted

# Custom fill color
redactai redact-image photo.jpg --fill "255,192,203"

Structured Data

Anonymize PII in CSV and JSON files with column-aware detection:

# CSV
redactai structured data.csv -o data.clean.csv

# JSON
redactai structured data.json -o data.clean.json

# Custom strategy
redactai structured data.csv --strategy highest_confidence

Pseudonymization

Consistent fake↔real mappings across files and sessions. Same input → same output always.

# Pseudonymize with deterministic seed
redactai pseudonymize data.txt --seed "project-alpha" --store mappings.json

# Restore originals
redactai pseudonymize data.pseudonymized.txt --restore --store mappings.json

# Show mapping stats
redactai pseudonymize data.txt --show-mapping --seed "project-alpha"

Multi-Language Support

20+ languages with dedicated spaCy models:

# List all supported languages
redactai languages

# Use a specific language
redactai clean document.txt --language de   # German
redactai clean document.txt --language ja   # Japanese
redactai clean document.txt --language zh   # Chinese

Custom NER Recognizers

Plug in Transformers, GLiNER, or Flair models for domain-specific detection:

# Add a Transformers recognizer
redactai add-recognizer transformers --model obi/deid_roberta_i2b2 --threshold 0.5

# Add GLiNER zero-shot recognizer
redactai add-recognizer gliner --model urchade/gliner_medium-v2.1 --labels "PERSON,EMAIL,PHONE"

# Add Flair recognizer
redactai add-recognizer flair --model ner-multi

PDF Annotation

Highlight PII in PDFs without destroying the original. Perfect for legal review and audit trails.

# Annotate with highlights
redactai annotate-pdf document.pdf -o document.annotated.pdf

# Use underline instead of highlight
redactai annotate-pdf document.pdf --type underline --color "0.0,0.0,1.0"

# Generate a PII report (JSON, CSV, or text)
redactai annotate-pdf document.pdf --report --report-format json

Evaluation

Benchmark detection quality against ground truth labels. Critical for audit evidence.

# Run evaluation against ground truth
redactai evaluate ground_truth.json -o report --format both

# Custom threshold and entities
redactai evaluate ground_truth.json --threshold 0.5 --entities PERSON,EMAIL_ADDRESS

Ground truth format (ground_truth.json):

[
  {
    "text": "My name is John Smith and email is john@example.com",
    "entities": [
      {"entity_type": "PERSON", "start": 11, "end": 21},
      {"entity_type": "EMAIL_ADDRESS", "start": 39, "end": 54}
    ]
  }
]

License

Apache 2.0 — see LICENSE.

Decision Trace

Explain exactly why each PII entity was detected — perfect for audit compliance and debugging.

# Show detailed decision trace
redactai trace --text "My name is John Smith and email is john@example.com"

# Trace from file
redactai trace --file document.txt --format json -o trace.json

# Show as markdown
redactai trace --file document.txt --format markdown

DICOM Medical Redaction

HIPAA-compliant de-identification of medical images. Redacts both pixel text (OCR) and metadata tags.

# Single DICOM file
redactai redact-dicom scan.dcm -o scan.redacted.dcm

# Batch directory
redactai redact-dicom ./dicom_folder -o ./dicom_redacted

# Clean pixels only, keep metadata
redactai redact-dicom scan.dcm --no-clean-metadata

K-Anonymity

Statistical anonymization guarantees — each record is indistinguishable from at least k-1 others.

# Apply k-anonymity (k=5)
redactai k-anonymity data.csv age,zip,gender -k 5 -o data.anonymous.csv

# With l-diversity check
redactai k-anonymity data.csv age,zip -k 5 --sensitive disease --l 3

Streaming Processing

Real-time PII masking for logs, telemetry, and data pipelines.

# Process log file
redactai stream app.log -o app.masked.log

# Process stdin (pipe)
tail -f /var/log/app.log | redactai stream

# Custom entities
redactai stream app.log --entities EMAIL_ADDRESS,IP_ADDRESS

Synthetic Data Generation

Generate realistic but fake datasets that preserve statistical patterns without exposing real PII.

# Generate synthetic CSV from real data
redactai synthetic real_data.csv -o synthetic.csv --num 1000

# Generate synthetic text
redactai synthetic "My name is John Smith, email john@test.com" -o synthetic.json

# Reproducible with seed
redactai synthetic data.csv --seed 42 --locale en_GB

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redactai-0.1.0.tar.gz (102.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redactai-0.1.0-py3-none-any.whl (110.2 kB view details)

Uploaded Python 3

File details

Details for the file redactai-0.1.0.tar.gz.

File metadata

  • Download URL: redactai-0.1.0.tar.gz
  • Upload date:
  • Size: 102.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for redactai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e7bb8daa2d848cafbc0fcf3ef34791e7181feaa7aa5d051af45ddeaa8af2c30f
MD5 62a30096a1b6b5c33dc49c248f352df0
BLAKE2b-256 6614f86b5dc03106979b8d74144867ac473a5069c52ce11f668da46acc4c10cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for redactai-0.1.0.tar.gz:

Publisher: release.yml on jagreehal/redactai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file redactai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: redactai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 110.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for redactai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 daa63c8730cd7d63d95c55d780a2eac820054d2885c8dda5293461fba4b0616d
MD5 6fa8aab54da2412f6f84a5ea177516ea
BLAKE2b-256 3f76de01fa6bc7f26e5f70f371cbe94cc7eca8cbc983436307e49e25198757af

See more details on using hashes here.

Provenance

The following attestation bundles were made for redactai-0.1.0-py3-none-any.whl:

Publisher: release.yml on jagreehal/redactai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page