Lightweight, pluggable markdown redaction library for LLM pipelines
Project description
markdown-redactor
A lightweight, pluggable Python library that redacts sensitive information from Markdown before content is sent to LLMs.
It is designed for teams that need practical safety controls without adding heavy dependencies or complex infrastructure.
First 60 seconds
If you want a fast smoke test:
pip install markdown-redactor
printf "Contact me at jane@example.com\n" | markdown-redactor -
Expected output:
Contact me at [REDACTED]
From this point, move to the Quickstart (5 minutes) for API and CLI examples.
Table of contents
- First 60 seconds
- Who is this for
- Key features
- Quickstart (5 minutes)
- Python API guide
- CLI guide
- Makefile shortcuts
- How redaction works
- Built-in redaction rules
- Writing custom rules (plugin model)
- Performance and Big-O
- Security and compliance notes
- Troubleshooting
- Additional resources
- Development and contribution
- Release process
Who is this for
- Teams feeding Markdown documents into LLMs (RAG, agents, chat pipelines)
- Security-conscious teams that need deterministic redaction before inference
- Developers who want a small codebase with extensible rules
Key features
- Pluggable architecture: register custom redaction rules without touching core engine
- Markdown-aware behavior: by default, skips fenced code blocks and inline code spans
- Lightweight runtime: zero runtime dependencies
- Typed API: strict typing-friendly design
- Operational visibility: per-rule match counters and timing stats
Quickstart (5 minutes)
1) Install
Install from package index:
pip install markdown-redactor
Or install from source:
pip install -e .
2) Redact text in Python
from markdown_redactor import create_default_engine
engine = create_default_engine()
markdown = """
Contact: jane@example.com
Server IP: 10.0.0.1
Token: ghp_ABCDEF1234567890
"""
result = engine.redact(markdown)
print(result.content)
print(result.stats.total_matches)
print(result.stats.rule_matches)
3) Redact from CLI
markdown-redactor input.md -o output.md --stats
Python API guide
Create the default engine
from markdown_redactor import create_default_engine
engine = create_default_engine()
Basic redaction
result = engine.redact("Email me at jane@example.com")
print(result.content)
Configure masking and markdown behavior
from markdown_redactor import RedactionConfig
config = RedactionConfig(
mask="<redacted>",
replacement_mode="full",
skip_fenced_code_blocks=True,
skip_inline_code=True,
)
result = engine.redact(content, config=config)
Replacement modes
Available modes:
full: replace the whole match withmaskpreserve_last4: keep the last 4 alphanumeric characterspreserve_format: keep separators like-,.,(,)while masking characters
config = RedactionConfig(replacement_mode="preserve_last4")
File helpers
You can redact files directly from the Python API.
from markdown_redactor import create_default_engine
engine = create_default_engine()
result = engine.redact_file("input.md")
result = engine.redact_to_file("input.md", "output.md")
Allowlist specific values
Use allowlist when a value looks sensitive but should remain visible.
config = RedactionConfig(
allowlist=("jane@example.com", "10.0.0.1"),
)
Enable or disable specific rules
Only enable chosen rules:
config = RedactionConfig(enabled_rule_names=("email", "jwt"))
Disable specific rules:
config = RedactionConfig(disabled_rule_names=("phone", "swift_bic"))
Add context metadata (optional)
from markdown_redactor import RuleContext
context = RuleContext(file_path="docs/customer.md", metadata={"source": "crm"})
result = engine.redact(content, context=context)
Understand returned stats
result.stats includes:
total_matches: total number of replacementsrule_matches: replacements grouped by rule nameelapsed_ms: execution time for this callsource_bytesandoutput_bytes: input/output size in bytes
CLI guide
Input and output
Redact a file to stdout:
markdown-redactor input.md
Read from stdin and write to stdout:
cat input.md | markdown-redactor -
Write to a file:
markdown-redactor input.md -o output.md
Useful flags
--mask "<secret>": custom replacement value--replacement-mode preserve_last4: control redaction rendering--allowlist jane@example.com: preserve exact values--enable-rule email,jwt: only run selected rules--disable-rule phone,swift_bic: skip selected rules--redact-inline-code: redact inside inline code spans (default is skip)--redact-fenced-code-blocks: redact inside fenced blocks (default is skip)--stats: print stats as JSON to stderr
Example:
markdown-redactor input.md -o output.md --mask "<secret>" --stats
Examples with CLI filtering:
markdown-redactor input.md --allowlist jane@example.com --disable-rule phone
markdown-redactor input.md --enable-rule email,jwt
Makefile shortcuts
This repository includes convenient local commands:
make lintmake typemake testmake check(runs lint + type + test)make redact FILE=input.md OUT=output.md
Redact with additional CLI flags:
make redact FILE=input.md OUT=output.md REDACT_FLAGS="--redact-inline-code --redact-fenced-code-blocks"
Redact from stdin:
cat input.md | make redact FILE=- OUT=-
Copy/paste recipes
Use these examples as starting points for common LLM workflows.
1) RAG ingest preprocessor (single file)
Redact first, then pass clean text to your embedding/indexing pipeline.
from pathlib import Path
from markdown_redactor import create_default_engine
engine = create_default_engine()
source_path = Path("docs/customer-notes.md")
clean_path = Path("docs/customer-notes.redacted.md")
source_text = source_path.read_text(encoding="utf-8")
result = engine.redact(source_text)
clean_path.write_text(result.content, encoding="utf-8")
print(result.stats.rule_matches)
2) Chat app pre-send filter
Apply redaction before sending user-provided markdown to an LLM.
from markdown_redactor import create_default_engine
engine = create_default_engine()
def prepare_prompt(user_markdown: str) -> str:
result = engine.redact(user_markdown)
return result.content
3) Keep code examples unchanged (default behavior)
By default, fenced code blocks and inline code are skipped.
from markdown_redactor import create_default_engine
engine = create_default_engine()
result = engine.redact("""
My email is jane@example.com
```python
API_KEY = \"ghp_ABCDEF1234567890\"
Inline token: ghp_ABCDEF1234567890
""")
### 4) Strict mode for high-risk exports
If required by policy, redact inside inline and fenced code too.
```bash
markdown-redactor input.md -o output.md --redact-inline-code --redact-fenced-code-blocks
5) Batch process a folder with shell
Redact every markdown file into a sibling output folder.
mkdir -p redacted
for file in docs/*.md; do
markdown-redactor "$file" -o "redacted/$(basename "$file")"
done
6) Batch process with Python
Useful when you need richer reporting or custom naming.
from pathlib import Path
from markdown_redactor import create_default_engine
engine = create_default_engine()
input_dir = Path("docs")
output_dir = Path("redacted")
output_dir.mkdir(exist_ok=True)
for path in input_dir.glob("*.md"):
content = path.read_text(encoding="utf-8")
result = engine.redact(content)
destination = output_dir / path.name
destination.write_text(result.content, encoding="utf-8")
print(path.name, result.stats.total_matches)
7) Custom company identifier rule
Add a simple plugin for org-specific IDs.
import re
from dataclasses import dataclass
from markdown_redactor import RedactionConfig, RedactionEngine, RuleContext, RuleRegistry
@dataclass(frozen=True, slots=True)
class TicketRule:
name: str = "ticket_id"
pattern: re.Pattern[str] = re.compile(r"\bTICKET-\d{6}\b")
def redact(
self,
content: str,
config: RedactionConfig,
context: RuleContext,
) -> tuple[str, int]:
updated, count = self.pattern.subn(config.mask, content)
return updated, count
registry = RuleRegistry()
registry.register(TicketRule())
engine = RedactionEngine(registry=registry)
8) CI check to prevent raw secrets in generated artifacts
Example step to redact docs before publishing snapshots.
make redact FILE=README.md OUT=/tmp/README.redacted.md
How redaction works
- Markdown text is segmented.
- Based on config, non-redactable segments (like fenced code) can be preserved.
- Each redactable segment is processed by registered rules in order.
- Output and stats are returned.
This makes behavior explicit and easy to extend.
Built-in redaction rules
Default engine includes:
emailus_ssnus_einuk_ninoin_panin_aadhaarin_gstinbr_cpfbr_cnpjibanswift_biceu_vatlabeled_sensitive_id(tax ID, driver license, passport, national ID labels)secret_assignment(password/api_key/token style assignments)credential_uri(connection-string credentials)phoneipv4ipv6aws_access_keygeneric_tokengoogle_api_keyjwtprivate_keycredit_card(Luhn-validated to reduce false positives)
Writing custom rules (plugin model)
Rules implement a simple contract:
name: string identifierredact(content, config, context) -> (updated_content, match_count)
Example custom rule:
from dataclasses import dataclass
from markdown_redactor import RedactionConfig, RedactionEngine, RuleContext, RuleRegistry
@dataclass(frozen=True, slots=True)
class EmployeeIdRule:
name: str = "employee_id"
def redact(
self,
content: str,
config: RedactionConfig,
context: RuleContext,
) -> tuple[str, int]:
updated = content.replace("EMP-", config.mask + "-")
count = content.count("EMP-")
return updated, count
registry = RuleRegistry()
registry.register(EmployeeIdRule())
engine = RedactionEngine(registry=registry)
result = engine.redact("Employee: EMP-001")
Rule design tips
- Keep rules deterministic and side-effect free
- Precompile regex at module load time
- Return accurate match counts for observability
- Avoid very broad patterns that over-redact business content
Tenant-specific layering (recommended)
For enterprise deployments, keep the global baseline and layer tenant rules on top.
from dataclasses import dataclass
from markdown_redactor import (
RedactionConfig,
RuleContext,
create_tenant_engine,
)
@dataclass(frozen=True, slots=True)
class CustomerTicketRule:
name: str = "customer_ticket"
def redact(
self,
content: str,
config: RedactionConfig,
context: RuleContext,
) -> tuple[str, int]:
updated = content.replace("TICKET-", f"{config.mask}-")
count = content.count("TICKET-")
return updated, count
engine = create_tenant_engine(
[CustomerTicketRule()],
include_default_rules=True,
)
You can disable default rules for tenant-only behavior:
engine = create_tenant_engine([CustomerTicketRule()], include_default_rules=False)
Performance and Big-O
Let:
- $n$ = input length
- $r$ = number of active rules
Complexity:
- Time: $O(n \cdot r)$
- Memory: $O(n)$
Why this stays lightweight:
- Precompiled regex patterns in built-in rules
- No Markdown AST parsing dependency
- No network I/O, no external services, no heavy runtime libs
Security and compliance notes
- This is best-effort pattern redaction, not formal DLP certification
- Always validate on your real data and threat model
- Combine with downstream controls (access controls, logging, policy engines)
- Add organization-specific rules for identifiers, ticket IDs, or internal secrets
Troubleshooting
Nothing is being redacted
- Verify you are using
create_default_engine()or registering custom rules - Check whether content is inside fenced/inline code that is skipped by default
Too much is being redacted
- Tighten custom regex patterns
- Keep
--redact-inline-code/--redact-fenced-code-blocksdisabled unless required
CLI command not found
- Ensure package is installed in active environment
- Try module mode:
python -m markdown_redactor.cli input.md
Additional resources
- Architecture guide: docs/ARCHITECTURE.md
- FAQ: docs/FAQ.md
- Support process: SUPPORT.md
- Security policy: SECURITY.md
- Changelog: CHANGELOG.md
- Releasing guide: docs/RELEASING.md
- Guided onboarding docs: docs/README.md
- Runnable examples:
Development and contribution
See CONTRIBUTING.md for setup and quality checks.
Primary local quality command:
PYTHONPATH=src .venv/bin/python -m ruff check src tests && \
PYTHONPATH=src .venv/bin/python -m mypy src && \
PYTHONPATH=src .venv/bin/python -m pytest
Release process
Maintainers can follow docs/RELEASING.md.
Publishing is automated via .github/workflows/release.yml on tags matching v*.
GitHub Release notes and signed provenance attestations are generated via .github/workflows/github-release.yml.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdown_redactor-0.1.2.tar.gz.
File metadata
- Download URL: markdown_redactor-0.1.2.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01d3dd111f6efc5103c990dcc37ea49aaa2b89f61aea03f373a2abac1244cfcc
|
|
| MD5 |
e98020b542b27883e44174a32cdf408c
|
|
| BLAKE2b-256 |
c1606cdec21ce232e6dbfb12bb438f5e8933345152be58fc60a8dddd56030cc7
|
Provenance
The following attestation bundles were made for markdown_redactor-0.1.2.tar.gz:
Publisher:
release.yml on jcatama/markdown-redactor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdown_redactor-0.1.2.tar.gz -
Subject digest:
01d3dd111f6efc5103c990dcc37ea49aaa2b89f61aea03f373a2abac1244cfcc - Sigstore transparency entry: 1064843564
- Sigstore integration time:
-
Permalink:
jcatama/markdown-redactor@ca842e86d2a6c32f8fb00535f4a4975e44999793 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/jcatama
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ca842e86d2a6c32f8fb00535f4a4975e44999793 -
Trigger Event:
push
-
Statement type:
File details
Details for the file markdown_redactor-0.1.2-py3-none-any.whl.
File metadata
- Download URL: markdown_redactor-0.1.2-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76cc7f3895b8dbafcd6c126e06cc3c3cde599de414c3ff741a6236d55e8d1e19
|
|
| MD5 |
9536456ec9fa52710113e861145b7d6f
|
|
| BLAKE2b-256 |
8abe8c0b3dde859a5e402db1244917d9f243e896cf8c6bda3f0302cec1d14cd2
|
Provenance
The following attestation bundles were made for markdown_redactor-0.1.2-py3-none-any.whl:
Publisher:
release.yml on jcatama/markdown-redactor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdown_redactor-0.1.2-py3-none-any.whl -
Subject digest:
76cc7f3895b8dbafcd6c126e06cc3c3cde599de414c3ff741a6236d55e8d1e19 - Sigstore transparency entry: 1064843590
- Sigstore integration time:
-
Permalink:
jcatama/markdown-redactor@ca842e86d2a6c32f8fb00535f4a4975e44999793 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/jcatama
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ca842e86d2a6c32f8fb00535f4a4975e44999793 -
Trigger Event:
push
-
Statement type: