Skip to main content

A core library to anonymize PDF, Markdown, and plain text files using LLMs.

Project description

🦉🫥 PDF Anonymizer Core

This package provides the core functionality for the PDF/Text anonymizer, including text extraction, LLM-driven anonymization, and deanonymization logic. It is used by pdf-anonymizer-cli.

Installation

Install the base package with your favorite package manager:

pip install pdf-anonymizer-core

To use a specific LLM provider, you must install the corresponding extra. This helps to keep the installation lightweight by only downloading the libraries you need.

  • Google: pip install "pdf-anonymizer-core[google]"
  • Ollama: pip install "pdf-anonymizer-core[ollama]"
  • Hugging Face: pip install "pdf-anonymizer-core[huggingface]"
  • OpenRouter: pip install "pdf-anonymizer-core[openrouter]"
  • OpenAI: pip install "pdf-anonymizer-core[openai]"
  • Anthropic: pip install "pdf-anonymizer-core[anthropic]"

You can also install multiple extras at once:

pip install "pdf-anonymizer-core[google,ollama]"

Environment Variables

The core library itself does not load .env files. Environment variables must be loaded by the application that uses this library (e.g., pdf-anonymizer-cli) or set in your shell.

  • GOOGLE_API_KEY: Required when using Google models.
  • HUGGING_FACE_TOKEN: Required when using Hugging Face models.
  • OPENROUTER_API_KEY: Required when using OpenRouter models.
  • OPENAI_API_KEY: Required when using OpenAI models.
  • ANTHROPIC_API_KEY: Required when using Anthropic models.
  • OLLAMA_HOST: Optional, defaults to http://localhost:11434 when using Ollama models.

API Usage

anonymize_file()

Anonymizes a single file and returns the anonymized text and a mapping of original entities to their placeholders.

from pdf_anonymizer_core.core import anonymize_file
from pdf_anonymizer_core.prompts import detailed

# Example of programmatic usage
text, mapping = anonymize_file(
    file_path="/path/to/file.pdf",
    prompt_template=detailed.prompt_template,
    model_name="gemini-2.5-pro"  # Can also be a new model like "google/gemini-flash-latest"
)

if text and mapping:
    print("Anonymized Text:", text)
    print("Mapping:", mapping)

deanonymize_file()

Reverts anonymization using a mapping file.

from pdf_anonymizer_core.utils import deanonymize_file

# Assumes you have an anonymized file and a mapping file
deanonymized_text, stats = deanonymize_file(
    anonymized_file="path/to/anonymized.md",
    mapping_file="path/to/mapping.json"
)

if deanonymized_text:
    print("Deanonymized Text:", deanonymized_text)

Configuration

You can import default configurations and available models from the conf module.

from pdf_anonymizer_core.conf import (
    DEFAULT_MODEL_NAME,
    ModelName,
    PromptEnum,
)

print(f"Default model: {DEFAULT_MODEL_NAME}")
print(f"Available Google models: {[m.value for m in ModelName if m.provider == 'google']}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_anonymizer_core-0.3.2.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_anonymizer_core-0.3.2-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_anonymizer_core-0.3.2.tar.gz.

File metadata

  • Download URL: pdf_anonymizer_core-0.3.2.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pdf_anonymizer_core-0.3.2.tar.gz
Algorithm Hash digest
SHA256 e08b835d0285c8521dd7100294a0a071921e4c7c68ce0476da1d102347f38988
MD5 858175e78ffef96be7091efa07636424
BLAKE2b-256 af3a8fac2c01337581057cc23b5201f82f7d378113f0ec79df9e6ec21eb4707f

See more details on using hashes here.

File details

Details for the file pdf_anonymizer_core-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_anonymizer_core-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1f98bd0df25294f906c666e797fa306427a150133043ac5d73eb7874d0d13ca0
MD5 ae575875499cb95ea0f609f8298e8a88
BLAKE2b-256 5967563fdf6477fa4de748cabbb4ff3f1358fbe0c09a9a2d99891c5ad79bcef4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page