Skip to main content

CLI for a tool to anonymize PDF, Markdown, and plain text files using LLMs.

Project description

🦉🫥 PDF Anonymizer CLI

A command-line interface for anonymizing PDF, Markdown, and plain text files using LLMs.

  • High-Quality Anonymization: Leverages LLMs to identify and replace Personally Identifiable Information (PII) with high accuracy.
  • Large File Support: Consistently anonymizes large files (tested up to 1GB).
  • Multi-Provider & Cost-Effective: Free to use with local Ollama models. It also supports major providers like OpenAI, Anthropic, Google, Hugging Face, and OpenRouter.
  • Reversible: Supports deanonymization to recover original data when needed.
  • Multi-Format: Works with PDF, Markdown, and plain text files.

Installation

Install the CLI with your favorite package manager. To use a specific LLM provider, you must install the corresponding extra.

  • Google: pip install "pdf-anonymizer-cli[google]"
  • Ollama: pip install "pdf-anonymizer-cli[ollama]"
  • Hugging Face: pip install "pdf-anonymizer-cli[huggingface]"
  • OpenRouter: pip install "pdf-anonymizer-cli[openrouter]"
  • OpenAI: pip install "pdf-anonymizer-cli[openai]"
  • Anthropic: pip install "pdf-anonymizer-cli[anthropic]"

You can also install multiple extras at once:

pip install "pdf-anonymizer-cli[google,openrouter]"

This installs the pdf-anonymizer executable.

Environment Variables

The CLI will automatically load a .env file from the current directory or any parent directory. For consistency, it's recommended to place a single .env file at the root of the repository.

  • GOOGLE_API_KEY: Required when using Google models.
  • HUGGING_FACE_TOKEN: Required when using Hugging Face models. You can get a token from here.
  • OPENROUTER_API_KEY: Required when using OpenRouter models.
  • OPENAI_API_KEY: Required when using OpenAI models.
  • ANTHROPIC_API_KEY: Required when using Anthropic models.
  • OLLAMA_HOST: Optional, defaults to http://localhost:11434 when using Ollama models.

Example .env file:

GOOGLE_API_KEY="YOUR_API_KEY_HERE"
HUGGING_FACE_TOKEN="YOUR_HF_TOKEN_HERE"
OPENROUTER_API_KEY="YOUR_OPENROUTER_KEY"

Usage

Anonymize

The run command anonymizes one or more files.

pdf-anonymizer run FILE_PATH [FILE_PATH ...] \
  [--characters-to-anonymize INTEGER] \
  [--prompt-name {simple|detailed}] \
  [--model-name TEXT] \
  [--anonymized-entities PATH]

Arguments:

  • FILE_PATH: Path to one or several PDF, Markdown, or text files for anonymization.

Options:

  • --characters-to-anonymize INTEGER: Number of characters to process in each chunk (default: 100000).
  • --prompt-name [simple|detailed]: The prompt template to use (default: detailed).
  • --model-name TEXT: The language model to use.
  • --anonymized-entities PATH: Path to a file with a list of entities to anonymize.

Models: You can use any of the predefined models below, or specify a new model using the format "provider/model-name". For example: --model-name "google/gemini-flash-latest".

  • Google: gemini-2.5-pro, gemini-2.5-flash (default), gemini-2.5-flash-lite.
  • Ollama: gemma:7b, phi4-mini.
  • Hugging Face: openai/gpt-oss-20b, mistralai/Mistral-7B-Instruct-v0.1, HuggingFaceH4/zephyr-7b-beta.
  • OpenRouter: openai/gpt-4o, google/gemini-pro.
  • OpenAI: gpt-4o, gpt-5.
  • Anthropic: claude-4-sonet, claude-4.5-sonet.

Examples

Basic anonymization with the default model (Google):

pdf-anonymizer run document.pdf

A new model (Google) and a simple prompt:

pdf-anonymizer run notes.md --model-name "google/gemini-flash-latest" --prompt-name simple

Using an OpenRouter model:

pdf-anonymizer run report.pdf --model-name "openai/gpt-4o"

Deanonymize

The deanonymize command reverts anonymization using a mapping file.

pdf-anonymizer deanonymize ANONYMIZED_FILE MAPPING_FILE

Arguments:

  • ANONYMIZED_FILE: Path to the anonymized text file.
  • MAPPING_FILE: Path to the JSON mapping file.

Example:

pdf-anonymizer deanonymize \
    data/anonymized/document.anonymized.md \
    data/mappings/document.mapping.json

This will create a deanonymized version of the file at data/deanonymized/document.deanonymized.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_anonymizer_cli-0.3.2.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_anonymizer_cli-0.3.2-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf_anonymizer_cli-0.3.2.tar.gz.

File metadata

  • Download URL: pdf_anonymizer_cli-0.3.2.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pdf_anonymizer_cli-0.3.2.tar.gz
Algorithm Hash digest
SHA256 bf5d4b172884f65059843b919b141297620745bb39c9d85389e78aacdeacabea
MD5 2603cd46a7c41404eb8386bdf71046ea
BLAKE2b-256 f0b57052235d9e1f2d4b1c8cb149dea168d058a73fc5058ee9f3fcb93d6bc096

See more details on using hashes here.

File details

Details for the file pdf_anonymizer_cli-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_anonymizer_cli-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7504b6e3fe57a0b2466d8ec767432fe1fbb20cc0e7ae156e66c9b271d0200587
MD5 fa997b64dcf122eb211aa29ca76713b7
BLAKE2b-256 fe5a5a03d327df2695d481af7dc878629a52df377a67be41798675f30cb77016

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page