Reversible PII anonymization for Polish documents, designed for LLM workflows.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Tatarinho

These details have not been verified by PyPI

Project description

llm-safe-pl

Reversible PII anonymization for Polish documents, designed for LLM workflows.

Status: alpha (v0.2.0). Core regex + checksum detection, anonymization, deanonymization, and the CLI are implemented and tested (319 tests, ~99% coverage). v0.2.0 is a service-pack release: ~25× faster Shield.anonymize() on documents with thousands of PII items, plus a security-hardening pass (strict Mapping.from_dict validation, Shield(max_input_bytes=...), Shield.reset(), CLI --force / --max-bytes). The optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION is still scheduled for a later 0.x release. See CHANGELOG.md and Roadmap.

Why this exists

When you send a Polish document to an LLM (OpenAI, Anthropic, a local model), you're exposing PESEL numbers, NIPs, ID card numbers, addresses, and names to a third party. Existing PII tools either focus on English data, flag every 11-digit string as a PESEL (false positives), or provide one-way redaction that breaks when you need to post-process the LLM's response.

llm-safe-pl is built around the full round-trip:

Anonymize — detect Polish PII, replace with stable tokens, return a reversible mapping.
Call the LLM — the request contains no raw PII.
Deanonymize — restore original values in the response using the saved mapping.

Checksum validation (PESEL, NIP, REGON, Luhn, IBAN mod-97) is first-class, so valid-looking-but-wrong numbers are rejected before they become false positives.

Installation

Core install — stdlib + typer only:

pip install llm-safe-pl

Optional spaCy-based NER (Phase 6, not yet released):

pip install "llm-safe-pl[ner]"
python -m spacy download pl_core_news_lg

Requires Python 3.10+.

Quick example — Python API

from llm_safe_pl import Shield

shield = Shield()

result = shield.anonymize(
    "Jan Kowalski ma PESEL 44051401359, NIP 526-000-12-46, email jan@example.pl."
)
# result.text    -> "Jan Kowalski ma PESEL [PESEL_001], NIP [NIP_001], email [EMAIL_001]."
# result.mapping -> reversible Mapping object (JSON-serializable)
# result.matches -> tuple[Match, ...] for audit

# Safe to send to an LLM now.
# response = call_any_llm(result.text)

restored = shield.deanonymize(result.text)
# "Jan Kowalski ma PESEL 44051401359, NIP 526-000-12-46, email jan@example.pl."

The same value always maps to the same token within a Shield instance, including across multiple anonymize() calls. Formatted identifiers (e.g. 526-000-12-46) round-trip exactly — the dashes are preserved.

If you process unrelated documents (different users, different requests) through one Shield, call shield.reset() between them to drop the accumulated mapping and prevent cross-document token leakage. For pipelines that ingest untrusted text, pass Shield(max_input_bytes=...) to refuse oversized inputs at the boundary instead of letting them turn into an O(n) memory blowup.

PERSON detection (Jan Kowalski in the example) requires pip install "llm-safe-pl[ner]" and is scheduled for a later 0.x release. Without the extra, names remain visible and structured identifiers (PESEL, NIP, IBAN, etc.) are tokenized.

Try it live in Colab

No install needed — the notebook walks through a full anonymize → LLM → deanonymize round-trip in a Polish customer-service scenario.

Quick example — CLI

# Detect PII without modifying the file (JSON or tab-separated output)
llm-safe detect document.txt
llm-safe detect document.txt --format text

# Anonymize: writes rewritten text and a reversible mapping
llm-safe anonymize document.txt -o anon.txt -m mapping.json

# Re-running on the same outputs requires --force (otherwise the CLI refuses
# to overwrite, since v0.2.0)
llm-safe anonymize document.txt -o anon.txt -m mapping.json --force

# Restore original values (prints to stdout, or use -o FILE)
llm-safe deanonymize anon.txt -m mapping.json

The CLI reads UTF-8 (with or without BOM) and UTF-16 (when a BOM is present), so files produced by PowerShell's default > redirection work without manual conversion. Output is always canonical UTF-8. Each subcommand also supports --max-bytes (default 64 MiB) to refuse pathologically large inputs.

What's supported

PII type	Format examples	Checksum validated
PESEL	`44051401359`	✅
NIP	`5260001246`, `526-000-12-46`	✅
REGON	`123456785` (9-digit), `12345678500001` (14-digit)	✅
ID card (dowód)	`ABC123456`	regex only
Passport	`AB1234567`	regex only
Phone	`+48 600 123 456`, `600-123-456`	—
Email	`user@example.pl`	—
IBAN	`PL61109010140000071219812874` (bare or 4-digit-grouped)	✅ (mod-97, ~80 countries)
Credit card	`4532 0151 1283 0366` (13-19 digits, various groupings)	✅ (Luhn)
Person / Organization / Location	via optional `[ner]` extra (Phase 6)	—

Public API

These are the only names exported from llm_safe_pl:

from llm_safe_pl import Shield, Match, Mapping, AnonymizeResult, PIIType

Anything else is an implementation detail and may change without a major version bump.

Key design choices

Minimal dependencies. Detection, anonymization, and mapping run on stdlib alone; typer (used by the CLI) is the only required install-time dep. Heavy features (spaCy, Faker, pdfplumber) are opt-in extras.
Checksums written from scratch. PESEL, NIP, REGON, Luhn, mod-97 IBAN — the library's core value, not outsourced.
Reversibility is a contract. Every anonymize() call returns a Mapping that enables perfect restoration, preserving source formatting (dashes, spaces).
Polish-first. Native handling of Polish identifiers and, via the [ner] extra, Polish names and addresses through pl_core_news_lg.

Development

git clone https://github.com/Tatarinho/llm-safe-pl.git
cd llm-safe-pl
python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate      # macOS / Linux
pip install -e ".[dev]"

CI runs these four gates — run them the same way locally:

ruff check .
ruff format --check .
mypy
pytest

The 80% coverage gate is enforced in pyproject.toml.

Roadmap

Phase 0 — Scaffolding: packaging, CI, locked public API surface, tests green. Done in v0.1.0.
Phase 1 — models.py: Match, Mapping, AnonymizeResult, PIIType. Done in v0.1.0.
Phase 2 — Checksum validators: PESEL, NIP, REGON, Luhn, mod-97 IBAN. Done in v0.1.0.
Phase 3 — Nine regex + checksum detectors. Done in v0.1.0.
Phase 4 — Anonymizer / Deanonymizer with consistent tokens. Done in v0.1.0.
Phase 5 — Shield facade + CLI subcommands. Done in v0.1.0.
v0.2.0 — Algorithmic perf fix (Shield.anonymize() ~25× faster on large docs), security-hardening pass (Mapping.from_dict strict validation, Shield.reset(), Shield(max_input_bytes=...), CLI --force / --max-bytes). Done. See CHANGELOG.md.
Next 0.x — Optional spaCy NER recognizer for PERSON / ORGANIZATION / LOCATION via pip install "llm-safe-pl[ner]".
Later — Faker-based fake substitution, PDF/DOCX parsing, broader IBAN detector scope.

Non-goals

Not a SaaS, browser extension, or GUI — this is a Python library.
Not a legal compliance product. The library is a technical tool; compliance is the user's responsibility. See docs/limitations.md.
Not optimized for non-Polish text.
Not reimplementing PDF parsing, HTTP servers, or GUI frameworks that belong in separate libraries.

License

MIT. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Tatarinho

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Apr 29, 2026

This version

0.2.0

Apr 26, 2026

0.1.0

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_safe_pl-0.2.0.tar.gz (41.2 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_safe_pl-0.2.0-py3-none-any.whl (27.8 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file llm_safe_pl-0.2.0.tar.gz.

File metadata

Download URL: llm_safe_pl-0.2.0.tar.gz
Upload date: Apr 26, 2026
Size: 41.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_safe_pl-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`65fd66fe247571141d4c73ffe706fb769cbc9cb2edfec7040cd2c126e3f26ada`
MD5	`e4267e85cc51742561efb98121ac289b`
BLAKE2b-256	`f7c07bd0f261639cec6a7b02d0b04df9bcf2cad6389a51675e154a5974bda339`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_safe_pl-0.2.0.tar.gz:

Publisher: publish.yml on Tatarinho/llm-safe-pl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_safe_pl-0.2.0.tar.gz
- Subject digest: 65fd66fe247571141d4c73ffe706fb769cbc9cb2edfec7040cd2c126e3f26ada
- Sigstore transparency entry: 1390574396
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: Tatarinho/llm-safe-pl@c9bcd0887e063699681d553b97d01d841b1e20e1
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Tatarinho
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c9bcd0887e063699681d553b97d01d841b1e20e1
- Trigger Event: push

File details

Details for the file llm_safe_pl-0.2.0-py3-none-any.whl.

File metadata

Download URL: llm_safe_pl-0.2.0-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 27.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_safe_pl-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d303d470a9fd80f1b0520aabe5c2d0d369d458fe9cd8a7f953b5abbfaf5baba3`
MD5	`85283faff6dab69a43deb6cdc956f37a`
BLAKE2b-256	`b68fedcb8bfc55da2dffeab18c6f3617a066019cb307d507dcfb38d1ebea077a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_safe_pl-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Tatarinho/llm-safe-pl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_safe_pl-0.2.0-py3-none-any.whl
- Subject digest: d303d470a9fd80f1b0520aabe5c2d0d369d458fe9cd8a7f953b5abbfaf5baba3
- Sigstore transparency entry: 1390574407
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: Tatarinho/llm-safe-pl@c9bcd0887e063699681d553b97d01d841b1e20e1
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Tatarinho
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c9bcd0887e063699681d553b97d01d841b1e20e1
- Trigger Event: push

llm-safe-pl 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llm-safe-pl

Why this exists

Installation

Quick example — Python API

Try it live in Colab

Quick example — CLI

What's supported

Public API

Key design choices

More examples and documentation

Development

Roadmap

Non-goals

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance