Skip to main content

Kloak your data before it touches AI. Local-first PII & secrets redaction.

Project description

kloak

Kloak your data before it touches AI.

Tests PyPI Python 3.11+ License

Local-first PII + secrets redaction for Python. Built on Microsoft Presidio. pip install and go — no API keys, no cloud, no data leaving your machine.

import kloak

result = kloak.redact("Email me at ahmad@mail.com, my key is sk-proj-abc123xyz")
print(result.text)
# → 'Email me at <EMAIL_ADDRESS>, my key is <OPENAI_API_KEY>'

Why kloak?

The moment you send text to an LLM API, you've lost control of it. Kloak strips PII and secrets locally, before the API call — not after.

kloak LLM Guard LangChain Presidio AWS Comprehend
Simple redact()<ENTITY> ✅ core API via scanner config open issue #14328
Secrets (API keys, tokens) ✅ GitLeaks rules basic partial
Regional PII (MyKad, PDPA…) ✅ extras system
Zero-dep regex-only mode pip install kloak N/A
Local-first, zero network mostly mostly ❌ sends to cloud
Air-gap compatible
pip install → works in <10s ❌ (~500 MB deps) ❌ (needs spaCy) N/A

Install

pip install kloak                # Core — regex-only, zero NLP deps, works anywhere
pip install kloak[nlp]           # + spaCy NER (names, orgs, locations)
pip install kloak[malaysian]            # + Malaysian PII (MyKad, phones, bank accounts, SSM)
pip install kloak[gitleaks]      # + GitLeaks secrets detection (API keys, tokens)
pip install kloak[nlp,malaysian,gitleaks]  # Full stack

The core install is ~50 MB, no NLP models, no compilation. Catches emails, phones, credit cards, IPs, IBANs, URLs, and more via regex.


Usage

Redact (irreversible)

import kloak

result = kloak.redact("My IC is 880101-01-1234 and email is ahmad@mail.com")
print(result.text)
# → 'My IC is <MY_IC> and email is <EMAIL_ADDRESS>'

# Inspect what was detected
for e in result.entities:
    print(e.type, e.start, e.end, e.score)
# → MY_IC 6 23 0.85
# → EMAIL_ADDRESS 37 53 1.0

Filter what gets redacted

# Only redact specific types
result = kloak.redact(text, include=["EMAIL_ADDRESS", "MY_IC"])

# Skip specific types
result = kloak.redact(text, exclude=["PERSON"])

# include takes priority over exclude if both are passed

Check which backend is active

print(kloak.backend)
# → "spacy:en_core_web_sm"   (if kloak[nlp] installed)
# → "regex-only"             (core install, no spaCy)

KloakEngine for repeated calls

from kloak import KloakEngine

engine = KloakEngine(language="en", score_threshold=0.6)
for text in my_texts:
    result = engine.redact(text)

Extras

Malaysian PII (kloak[malaysian])

Zero extra dependencies — just regex patterns for Malaysian-specific PII:

result = kloak.redact("IC: 880101-01-1234, phone: 012-3456789, SSM: 123456-A")
# → 'IC: <MY_IC>, phone: <MY_PHONE_NUMBER>, SSM: <MY_SSM>'

Entities: MY_IC, MY_PHONE_NUMBER, MY_LANDLINE, MY_SSM, MY_BANK_ACCOUNT

Secrets detection (kloak[gitleaks])

Dynamically loads GitLeaks rules on first use, caches locally at ~/.kloak/gitleaks_rules.toml, refreshes weekly. Works offline after first fetch.

result = kloak.redact("stripe key: sk_live_abc123, github: ghp_xyz456")
# → 'stripe key: <STRIPE_ACCESS_TOKEN>, github: <GITHUB_PAT>'

Covers: Stripe, OpenAI, GitHub, GitLab, AWS, GCP, Shopify, Twilio, PEM private keys, and 250+ more rules.

NLP / NER (kloak[nlp])

Adds spaCy en_core_web_sm for name, organisation, and location detection:

# Without [nlp]: names are missed
kloak.redact("Ahmad called the office").text
# → 'Ahmad called the office'

# With [nlp]: NER catches the name
kloak.redact("Ahmad called the office").text
# → '<PERSON> called the office'

Local-first guarantees

  • Zero network calls in core. Emails, phones, ICs, credit cards — all processed in-memory on your machine.
  • [gitleaks] fetches one file on first use, then works fully offline. Fetch failure falls back to cached rules — never crashes.
  • No telemetry, no phone-home, no usage tracking. Ever.
  • Input text never touches disk. Processing is in-memory; nothing is logged by default.
  • Air-gap compatible. Run pip install kloak with zero internet. Pre-cache the GitLeaks TOML if needed and point KLOAK_GITLEAKS_CACHE_PATH at it.

Configuration

KLOAK_NLP_BACKEND=auto                    # auto | spacy | regex
KLOAK_SPACY_MODEL=en_core_web_sm          # override spaCy model
KLOAK_SECRETS_REFRESH_HOURS=168           # gitleaks cache TTL (default: weekly)
KLOAK_GITLEAKS_CACHE_PATH=~/.kloak/gitleaks_rules.toml
KLOAK_LOG_LEVEL=INFO

What kloak does NOT do

  • No prompt injection detection — use LLM Guard or NeMo Guardrails
  • No output scanning — kloak runs pre-flight, before text reaches the LLM
  • No token map storageredact() strips permanently; if you need reversible tokenisation, that's a future extra
  • No business-context sensitivity — kloak detects PII patterns, not confidential business information
  • No streaming / audio — text only, processed in batch
  • No encryption — kloak redacts (removes/replaces), not encrypts

Kloak's job: text in → PII and secrets stripped → clean text out.


Contributing

Kloak is built on a modular extras system — there are many ways to contribute beyond the core:

  • Add your country's PII — regional extras are self-contained regex patterns + tests. Copy kloak/extras/malaysian/ as a starting point. Singapore, Indonesia, Thailand, EU — each country is a standalone PR.
  • Add a new recognizer to an existing extra — e.g. Malaysian passport numbers, additional bank formats.
  • Add a new extra — messaging platforms (Telegram, Signal), compliance logging, new secret sources. Each extra is an independent module under kloak/extras/.
  • Improve detection accuracy — better regex patterns, context word tuning, edge case handling.
  • Core improvements — performance, new API features, better NLP backend support.

Every extra follows the same structure:

kloak/extras/<name>/
├── __init__.py
├── recognizers.py          # PatternRecognizer objects
└── test_fixtures.json      # Sample inputs + expected entities

Every recognizer is a Presidio PatternRecognizer:

from presidio_analyzer import Pattern, PatternRecognizer

PatternRecognizer(
    supported_entity="SG_NRIC",
    patterns=[Pattern("sg_nric", r"[STFGM]\d{7}[A-Z]", score=0.85)],
    context=["nric", "ic", "identity"],
)

Add tests under tests/extras/<name>/ and open a PR. See CLAUDE.md for the full checklist.


License

Apache 2.0 — use it anywhere, including commercial projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kloak-0.1.0.tar.gz (168.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kloak-0.1.0-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file kloak-0.1.0.tar.gz.

File metadata

  • Download URL: kloak-0.1.0.tar.gz
  • Upload date:
  • Size: 168.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for kloak-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c04550ca44630e46b6c365b758ffe2889eb55f77f636cc27d6b249faeecb7a9b
MD5 1c2f3f5aef36f9e5251744e72bdcfebd
BLAKE2b-256 495c271ba348dbddfd219fc36ff43ecb58a7ed1accfa03d395b125cb90e37c9e

See more details on using hashes here.

File details

Details for the file kloak-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kloak-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for kloak-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 964ff3e124983db5d8f65c0ce816ef4e384c7a0367af0b3f603a2aced5bc9aba
MD5 ad26705e525a4b259a083001ac718aed
BLAKE2b-256 cd6968957943cd356e6dd958a9801d3d2a815206574fc700e30b3e9fd934df6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page