CLI tool to redact PII from text files using the openai/privacy-filter model

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

privacy-steward

A CLI tool that redacts PII from plain-text files using the openai/privacy-filter model via a native PyTorch implementation. All inference runs locally — no data ever leaves your machine, making it a natural fit for GDPR-regulated environments where personal data must not be transferred to external processors (Articles 25 and 44).

Why privacy-steward?

The official OpenAI opf CLI processes one file at a time, requires manual installation, and offers limited placeholder control. privacy-steward is a drop-in alternative built for practitioners who need to sanitise datasets at scale:

	privacy-steward	opf
Zero-install one-liner (`uvx`)	✓	—
Directory batch processing	✓	—
~1.4× faster throughput (native PyTorch vs. bundled runtime)	✓	—
Progress bar with ETA	✓	—
Automatic per-file audit trail	✓	—
Typed placeholders (`<PRIVATE_PERSON>`, `<ACCOUNT_NUMBER>`, …)	✓	fixed format
`--model` flag for any HF token-classification model	✓	—
Offline inference	✓	✓

Installation

No installation required — run directly with uvx:

uvx privacy-steward notes.txt

Or install permanently to add privacy-steward to your PATH:

uv tool install -U privacy-steward

First run downloads the openai/privacy-filter model weights and caches them in ~/.cache/huggingface/hub/. Subsequent runs are fully offline.

Requirements: Python 3.12 or later.

Quick start

# Redact a single .txt file (output: notes.redacted.txt alongside the source)
privacy-steward notes.txt

# Write a single-file result into an existing output directory
privacy-steward notes.txt --output ./clean/

# Redact an entire directory of .txt files, write to a custom output location
privacy-steward ./corpus/ --output ./corpus_clean/

# Show detected entities as they are processed (-v)
privacy-steward notes.txt -v

# Preview without writing files
privacy-steward ./corpus/ --dry-run

# Write an aggregate JSON summary report
privacy-steward ./corpus/ --output ./corpus_clean/ --report

Default output format

By default each detected entity is replaced with a typed label that reflects what was found:

Hi, my name is <PRIVATE_PERSON> and I work at Acme Corp.
You can reach me at <PRIVATE_EMAIL> or call me at <PRIVATE_PHONE>.
My home address is <PRIVATE_ADDRESS>.
The meeting is on <PRIVATE_DATE>. Visit us at <PRIVATE_URL>.
Please send the invoice to account number <ACCOUNT_NUMBER>.
My API key is <SECRET>.

Pass --placeholder to override: any literal string, or use {entity_type} for interpolation (e.g. --placeholder "[{entity_type}]" → [PRIVATE_PERSON]).

Output layout

For a single-file input, the input must be a .txt file. For a directory input, redacted .txt files mirror the source tree, non-.txt files are skipped, and an .audit/ directory is always created alongside the redacted outputs:

corpus_clean/
├── chapter1.redacted.txt
├── chapter2.redacted.txt
├── subdir/
│   └── chapter3.redacted.txt
└── .audit/
    ├── chapter1.audit.json    ← offsets, labels, and scores for auditing
    ├── chapter2.audit.json
    └── subdir/
        └── chapter3.audit.json

Each audit JSON records the source path, destination path, and every detected span (character offsets, entity type, and confidence score). To avoid re-exposing the PII that was just redacted, audit records omit the original matched text by default. Pass --include-text-in-audit only when you intentionally need surface forms in the audit trail and can protect the .audit/ directory accordingly.

Options

Flag	Short	Default	Description
`--output`	`-o`	derived	Output file or directory
`--placeholder`	`-p`	`<{entity_type}>`	Replacement string; `{entity_type}` is interpolated
`--report`		off	Write `redaction_report.json` to output dir
`--dry-run`		off	Show what would be redacted without writing files
`--verbose`	`-v`	off	Print per-file entity details alongside the progress bar
`--include-text-in-audit`		off	Include original matched text in audit JSON files
`--model`		`openai/privacy-filter`	HuggingFace model ID
`--version`			Show version and exit

Benchmark vs. OpenAI privacy-filter CLI (`opf`)

Both tools process text files through the OpenAI privacy-filter model family on CPU. opf uses a custom bundled runtime; privacy-steward uses a native PyTorch implementation loaded directly from the model's safetensors weights.

Hardware: Apple MacBook Pro (2020), Apple M1, 8-core CPU (4 Performance + 4 Efficiency), 16 GB unified memory. No GPU acceleration — all inference on CPU.

Setup: 10 synthetic files across diverse document types (emails, chat logs, support tickets, contracts, invoices, etc.), 573–1,362 tokens per file, single process.

Corpus	Size	Tokens	privacy-steward (s)	privacy-steward (tok/s)	opf (s)	opf (tok/s)	Speedup
01_emails.txt	5 KB	990	22.48	44	37.34	26	1.66×
02_chat_logs.txt	6 KB	1,168	33.13	35	42.16	27	1.27×
03_support_tickets.txt	4 KB	657	22.13	29	29.55	22	1.34×
04_meeting_notes.txt	5 KB	969	25.13	38	33.39	29	1.33×
05_contracts.txt	5 KB	954	22.44	42	29.53	32	1.32×
06_invoices.txt	3 KB	573	19.02	30	27.60	20	1.45×
07_intake_forms.txt	4 KB	750	20.84	35	28.88	25	1.39×
08_travel_itineraries.txt	4 KB	604	19.44	31	27.82	21	1.43×
09_incident_reports.txt	5 KB	836	23.25	35	31.54	26	1.36×
10_crm_diary.txt	7 KB	1,362	30.45	44	40.73	33	1.34×

Benchmarks are reproducible: uv run python benchmarks/benchmark_throughput.py (requires benchmarks/data/ to be present).

Development

uv sync                                  # install all deps
make lint                                # ruff + mypy
make test                                # fast unit tests only
pytest -m slow                           # integration tests (require model)
uv run python benchmarks/benchmark_throughput.py  # run benchmarks

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

NeuralNotwork

Release history Release notifications | RSS feed

0.1.2

May 7, 2026

0.1.1

May 7, 2026

This version

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

privacy_steward-0.1.0.tar.gz (155.7 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

privacy_steward-0.1.0-py3-none-any.whl (24.9 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file privacy_steward-0.1.0.tar.gz.

File metadata

Download URL: privacy_steward-0.1.0.tar.gz
Upload date: May 7, 2026
Size: 155.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for privacy_steward-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`329aa97f1bbdbd567b3832c2f9fd1ce90cdc79a427b36c47e21b12e28f2d8cf8`
MD5	`b74d946504bfe813b6cf96a9101ceadc`
BLAKE2b-256	`5d9034696d74e0427e7d9d223eb8e9b5a6e48faf95d26b0586931209dfa077ab`

See more details on using hashes here.

Provenance

The following attestation bundles were made for privacy_steward-0.1.0.tar.gz:

Publisher: ci.yml on AI-Colleagues/privacy-steward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: privacy_steward-0.1.0.tar.gz
- Subject digest: 329aa97f1bbdbd567b3832c2f9fd1ce90cdc79a427b36c47e21b12e28f2d8cf8
- Sigstore transparency entry: 1458919479
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: AI-Colleagues/privacy-steward@de4b90b5134275a8c2bd9dfa7f17664b3e2a0ee7
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/AI-Colleagues
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@de4b90b5134275a8c2bd9dfa7f17664b3e2a0ee7
- Trigger Event: push

File details

Details for the file privacy_steward-0.1.0-py3-none-any.whl.

File metadata

Download URL: privacy_steward-0.1.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for privacy_steward-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b2839d1af76e4083854bc39e00e66426e5895b9538bd275a7e2fe2cca74cfbf`
MD5	`e7fb21a0b35df59b8a5f5d88ab4a4ec3`
BLAKE2b-256	`bc84f8cfff3ad3a86734f0f46d2f6d0a8eec751f142cd7095906de18af6995de`

See more details on using hashes here.

Provenance

The following attestation bundles were made for privacy_steward-0.1.0-py3-none-any.whl:

Publisher: ci.yml on AI-Colleagues/privacy-steward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: privacy_steward-0.1.0-py3-none-any.whl
- Subject digest: 6b2839d1af76e4083854bc39e00e66426e5895b9538bd275a7e2fe2cca74cfbf
- Sigstore transparency entry: 1458919606
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: AI-Colleagues/privacy-steward@de4b90b5134275a8c2bd9dfa7f17664b3e2a0ee7
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/AI-Colleagues
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@de4b90b5134275a8c2bd9dfa7f17664b3e2a0ee7
- Trigger Event: push

privacy-steward 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

privacy-steward

Why privacy-steward?

Installation

Quick start

Default output format

Output layout

Options

Benchmark vs. OpenAI privacy-filter CLI (`opf`)

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

privacy-steward 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

privacy-steward

Why privacy-steward?

Installation

Quick start

Default output format

Output layout

Options

Benchmark vs. OpenAI privacy-filter CLI (opf)

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Benchmark vs. OpenAI privacy-filter CLI (`opf`)