Lightweight, pluggable markdown redaction library for LLM pipelines

These details have not been verified by PyPI

Project description

markdown-redactor

A lightweight, pluggable Python library that redacts sensitive information from Markdown before content is sent to LLMs.

It is designed for teams that need practical safety controls without adding heavy dependencies or complex infrastructure.

First 60 seconds

If you want a fast smoke test:

pip install markdown-redactor
printf "Contact me at jane@example.com\n" | markdown-redactor -

Expected output:

Contact me at [REDACTED]

From this point, move to the Quickstart (5 minutes) for API and CLI examples.

First 60 seconds
Who is this for
Key features
Quickstart (5 minutes)
Python API guide
CLI guide
Makefile shortcuts
How redaction works
Built-in redaction rules
Writing custom rules (plugin model)
Performance and Big-O
Security and compliance notes
Troubleshooting
Additional resources
Development and contribution
Release process

Who is this for

Teams feeding Markdown documents into LLMs (RAG, agents, chat pipelines)
Security-conscious teams that need deterministic redaction before inference
Developers who want a small codebase with extensible rules

Key features

Pluggable architecture: register custom redaction rules without touching core engine
Markdown-aware behavior: by default, skips fenced code blocks and inline code spans
Lightweight runtime: zero runtime dependencies
Typed API: strict typing-friendly design
Operational visibility: per-rule match counters and timing stats

Quickstart (5 minutes)

1) Install

Install from package index:

pip install markdown-redactor

Or install from source:

pip install -e .

2) Redact text in Python

from markdown_redactor import create_default_engine

engine = create_default_engine()

markdown = """
Contact: jane@example.com
Server IP: 10.0.0.1
Token: ghp_ABCDEF1234567890
"""

result = engine.redact(markdown)

print(result.content)
print(result.stats.total_matches)
print(result.stats.rule_matches)

3) Redact from CLI

markdown-redactor input.md -o output.md --stats

Python API guide

Create the default engine

from markdown_redactor import create_default_engine

engine = create_default_engine()

Basic redaction

result = engine.redact("Email me at jane@example.com")
print(result.content)

Configure masking and markdown behavior

from markdown_redactor import RedactionConfig

config = RedactionConfig(
    mask="<redacted>",
    replacement_mode="full",
    skip_fenced_code_blocks=True,
    skip_inline_code=True,
)

result = engine.redact(content, config=config)

Replacement modes

Available modes:

full: replace the whole match with mask
preserve_last4: keep the last 4 alphanumeric characters
preserve_format: keep separators like -, ., (, ) while masking characters

config = RedactionConfig(replacement_mode="preserve_last4")

File helpers

You can redact files directly from the Python API.

from markdown_redactor import create_default_engine

engine = create_default_engine()

result = engine.redact_file("input.md")
result = engine.redact_to_file("input.md", "output.md")

Allowlist specific values

Use allowlist when a value looks sensitive but should remain visible.

config = RedactionConfig(
    allowlist=("jane@example.com", "10.0.0.1"),
)

Enable or disable specific rules

Only enable chosen rules:

config = RedactionConfig(enabled_rule_names=("email", "jwt"))

Disable specific rules:

config = RedactionConfig(disabled_rule_names=("phone", "swift_bic"))

Add context metadata (optional)

from markdown_redactor import RuleContext

context = RuleContext(file_path="docs/customer.md", metadata={"source": "crm"})
result = engine.redact(content, context=context)

Understand returned stats

result.stats includes:

total_matches: total number of replacements
rule_matches: replacements grouped by rule name
elapsed_ms: execution time for this call
source_bytes and output_bytes: input/output size in bytes

CLI guide

Input and output

Redact a file to stdout:

markdown-redactor input.md

Read from stdin and write to stdout:

cat input.md | markdown-redactor -

Write to a file:

markdown-redactor input.md -o output.md

Useful flags

--mask "<secret>": custom replacement value
--replacement-mode preserve_last4: control redaction rendering
--allowlist jane@example.com: preserve exact values
--enable-rule email,jwt: only run selected rules
--disable-rule phone,swift_bic: skip selected rules
--redact-inline-code: redact inside inline code spans (default is skip)
--redact-fenced-code-blocks: redact inside fenced blocks (default is skip)
--stats: print stats as JSON to stderr

Example:

markdown-redactor input.md -o output.md --mask "<secret>" --stats

Examples with CLI filtering:

markdown-redactor input.md --allowlist jane@example.com --disable-rule phone
markdown-redactor input.md --enable-rule email,jwt

Makefile shortcuts

This repository includes convenient local commands:

make lint
make type
make test
make check (runs lint + type + test)
make redact FILE=input.md OUT=output.md

Redact with additional CLI flags:

make redact FILE=input.md OUT=output.md REDACT_FLAGS="--redact-inline-code --redact-fenced-code-blocks"

Redact from stdin:

cat input.md | make redact FILE=- OUT=-

Copy/paste recipes

Use these examples as starting points for common LLM workflows.

1) RAG ingest preprocessor (single file)

Redact first, then pass clean text to your embedding/indexing pipeline.

from pathlib import Path

from markdown_redactor import create_default_engine

engine = create_default_engine()

source_path = Path("docs/customer-notes.md")
clean_path = Path("docs/customer-notes.redacted.md")

source_text = source_path.read_text(encoding="utf-8")
result = engine.redact(source_text)

clean_path.write_text(result.content, encoding="utf-8")
print(result.stats.rule_matches)

2) Chat app pre-send filter

Apply redaction before sending user-provided markdown to an LLM.

from markdown_redactor import create_default_engine

engine = create_default_engine()


def prepare_prompt(user_markdown: str) -> str:
    result = engine.redact(user_markdown)
    return result.content

3) Keep code examples unchanged (default behavior)

By default, fenced code blocks and inline code are skipped.

from markdown_redactor import create_default_engine

engine = create_default_engine()
result = engine.redact("""
My email is jane@example.com
```python
API_KEY = \"ghp_ABCDEF1234567890\"

Inline token: ghp_ABCDEF1234567890 """)


### 4) Strict mode for high-risk exports

If required by policy, redact inside inline and fenced code too.

```bash
markdown-redactor input.md -o output.md --redact-inline-code --redact-fenced-code-blocks

5) Batch process a folder with shell

Redact every markdown file into a sibling output folder.

mkdir -p redacted
for file in docs/*.md; do
  markdown-redactor "$file" -o "redacted/$(basename "$file")"
done

6) Batch process with Python

Useful when you need richer reporting or custom naming.

from pathlib import Path

from markdown_redactor import create_default_engine

engine = create_default_engine()
input_dir = Path("docs")
output_dir = Path("redacted")
output_dir.mkdir(exist_ok=True)

for path in input_dir.glob("*.md"):
    content = path.read_text(encoding="utf-8")
    result = engine.redact(content)
    destination = output_dir / path.name
    destination.write_text(result.content, encoding="utf-8")
    print(path.name, result.stats.total_matches)

7) Custom company identifier rule

Add a simple plugin for org-specific IDs.

import re
from dataclasses import dataclass

from markdown_redactor import RedactionConfig, RedactionEngine, RuleContext, RuleRegistry


@dataclass(frozen=True, slots=True)
class TicketRule:
    name: str = "ticket_id"
    pattern: re.Pattern[str] = re.compile(r"\bTICKET-\d{6}\b")

    def redact(
        self,
        content: str,
        config: RedactionConfig,
        context: RuleContext,
    ) -> tuple[str, int]:
        updated, count = self.pattern.subn(config.mask, content)
        return updated, count


registry = RuleRegistry()
registry.register(TicketRule())
engine = RedactionEngine(registry=registry)

8) CI check to prevent raw secrets in generated artifacts

Example step to redact docs before publishing snapshots.

make redact FILE=README.md OUT=/tmp/README.redacted.md

How redaction works

Markdown text is segmented.
Based on config, non-redactable segments (like fenced code) can be preserved.
Each redactable segment is processed by registered rules in order.
Output and stats are returned.

This makes behavior explicit and easy to extend.

Built-in redaction rules

Default engine includes:

email
us_ssn
us_ein
uk_nino
in_pan
in_aadhaar
in_gstin
br_cpf
br_cnpj
iban
swift_bic
eu_vat
labeled_sensitive_id (tax ID, driver license, passport, national ID labels)
secret_assignment (password/api_key/token style assignments)
credential_uri (connection-string credentials)
phone
ipv4
ipv6
aws_access_key
generic_token
google_api_key
jwt
private_key
credit_card (Luhn-validated to reduce false positives)

Writing custom rules (plugin model)

Rules implement a simple contract:

name: string identifier
redact(content, config, context) -> (updated_content, match_count)

Example custom rule:

from dataclasses import dataclass

from markdown_redactor import RedactionConfig, RedactionEngine, RuleContext, RuleRegistry


@dataclass(frozen=True, slots=True)
class EmployeeIdRule:
    name: str = "employee_id"

    def redact(
        self,
        content: str,
        config: RedactionConfig,
        context: RuleContext,
    ) -> tuple[str, int]:
        updated = content.replace("EMP-", config.mask + "-")
        count = content.count("EMP-")
        return updated, count


registry = RuleRegistry()
registry.register(EmployeeIdRule())

engine = RedactionEngine(registry=registry)
result = engine.redact("Employee: EMP-001")

Rule design tips

Keep rules deterministic and side-effect free
Precompile regex at module load time
Return accurate match counts for observability
Avoid very broad patterns that over-redact business content

Tenant-specific layering (recommended)

For enterprise deployments, keep the global baseline and layer tenant rules on top.

from dataclasses import dataclass

from markdown_redactor import (
    RedactionConfig,
    RuleContext,
    create_tenant_engine,
)


@dataclass(frozen=True, slots=True)
class CustomerTicketRule:
    name: str = "customer_ticket"

    def redact(
        self,
        content: str,
        config: RedactionConfig,
        context: RuleContext,
    ) -> tuple[str, int]:
        updated = content.replace("TICKET-", f"{config.mask}-")
        count = content.count("TICKET-")
        return updated, count


engine = create_tenant_engine(
    [CustomerTicketRule()],
    include_default_rules=True,
)

You can disable default rules for tenant-only behavior:

engine = create_tenant_engine([CustomerTicketRule()], include_default_rules=False)

Performance and Big-O

Let:

$n$ = input length
$r$ = number of active rules

Complexity:

Time: $O(n \cdot r)$
Memory: $O(n)$

Why this stays lightweight:

Precompiled regex patterns in built-in rules
No Markdown AST parsing dependency
No network I/O, no external services, no heavy runtime libs

Security and compliance notes

This is best-effort pattern redaction, not formal DLP certification
Always validate on your real data and threat model
Combine with downstream controls (access controls, logging, policy engines)
Add organization-specific rules for identifiers, ticket IDs, or internal secrets

Troubleshooting

Nothing is being redacted

Verify you are using create_default_engine() or registering custom rules
Check whether content is inside fenced/inline code that is skipped by default

Too much is being redacted

Tighten custom regex patterns
Keep --redact-inline-code / --redact-fenced-code-blocks disabled unless required

CLI command not found

Ensure package is installed in active environment
Try module mode: python -m markdown_redactor.cli input.md

Additional resources

Architecture guide: docs/ARCHITECTURE.md
FAQ: docs/FAQ.md
Support process: SUPPORT.md
Security policy: SECURITY.md
Changelog: CHANGELOG.md
Releasing guide: docs/RELEASING.md
Guided onboarding docs: docs/README.md
Runnable examples:

Development and contribution

See CONTRIBUTING.md for setup and quality checks.

Primary local quality command:

PYTHONPATH=src .venv/bin/python -m ruff check src tests && \
PYTHONPATH=src .venv/bin/python -m mypy src && \
PYTHONPATH=src .venv/bin/python -m pytest

Release process

Maintainers can follow docs/RELEASING.md.

Publishing is automated via .github/workflows/release.yml on tags matching v*. GitHub Release notes and signed provenance attestations are generated via .github/workflows/github-release.yml.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.5

Apr 20, 2026

0.1.4

Mar 19, 2026

0.1.3

Mar 11, 2026

This version

0.1.2

Mar 9, 2026

0.1.1

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_redactor-0.1.2.tar.gz (21.6 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markdown_redactor-0.1.2-py3-none-any.whl (15.9 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file markdown_redactor-0.1.2.tar.gz.

File metadata

Download URL: markdown_redactor-0.1.2.tar.gz
Upload date: Mar 9, 2026
Size: 21.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdown_redactor-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`01d3dd111f6efc5103c990dcc37ea49aaa2b89f61aea03f373a2abac1244cfcc`
MD5	`e98020b542b27883e44174a32cdf408c`
BLAKE2b-256	`c1606cdec21ce232e6dbfb12bb438f5e8933345152be58fc60a8dddd56030cc7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdown_redactor-0.1.2.tar.gz:

Publisher: release.yml on jcatama/markdown-redactor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdown_redactor-0.1.2.tar.gz
- Subject digest: 01d3dd111f6efc5103c990dcc37ea49aaa2b89f61aea03f373a2abac1244cfcc
- Sigstore transparency entry: 1064843564
- Sigstore integration time: Mar 9, 2026
Source repository:
- Permalink: jcatama/markdown-redactor@ca842e86d2a6c32f8fb00535f4a4975e44999793
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/jcatama
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ca842e86d2a6c32f8fb00535f4a4975e44999793
- Trigger Event: push

File details

Details for the file markdown_redactor-0.1.2-py3-none-any.whl.

File metadata

Download URL: markdown_redactor-0.1.2-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdown_redactor-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`76cc7f3895b8dbafcd6c126e06cc3c3cde599de414c3ff741a6236d55e8d1e19`
MD5	`9536456ec9fa52710113e861145b7d6f`
BLAKE2b-256	`8abe8c0b3dde859a5e402db1244917d9f243e896cf8c6bda3f0302cec1d14cd2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdown_redactor-0.1.2-py3-none-any.whl:

Publisher: release.yml on jcatama/markdown-redactor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdown_redactor-0.1.2-py3-none-any.whl
- Subject digest: 76cc7f3895b8dbafcd6c126e06cc3c3cde599de414c3ff741a6236d55e8d1e19
- Sigstore transparency entry: 1064843590
- Sigstore integration time: Mar 9, 2026
Source repository:
- Permalink: jcatama/markdown-redactor@ca842e86d2a6c32f8fb00535f4a4975e44999793
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/jcatama
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ca842e86d2a6c32f8fb00535f4a4975e44999793
- Trigger Event: push

markdown-redactor 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

markdown-redactor

First 60 seconds

Table of contents

Who is this for

Key features

Quickstart (5 minutes)

1) Install

2) Redact text in Python

3) Redact from CLI

Python API guide

Create the default engine

Basic redaction

Configure masking and markdown behavior

Replacement modes

File helpers

Allowlist specific values

Enable or disable specific rules

Add context metadata (optional)

Understand returned stats

CLI guide

Input and output

Useful flags

Makefile shortcuts

Copy/paste recipes

1) RAG ingest preprocessor (single file)

2) Chat app pre-send filter

3) Keep code examples unchanged (default behavior)

5) Batch process a folder with shell

6) Batch process with Python

7) Custom company identifier rule

8) CI check to prevent raw secrets in generated artifacts

How redaction works

Built-in redaction rules

Writing custom rules (plugin model)

Rule design tips

Tenant-specific layering (recommended)

Performance and Big-O

Security and compliance notes

Troubleshooting

Nothing is being redacted

Too much is being redacted

CLI command not found

Additional resources

Development and contribution

Release process

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance