A core library to anonymize PDF, Markdown, and plain text files using LLMs.
Project description
🦉🫥 PDF Anonymizer Core
This package provides the core functionality for the PDF/Text anonymizer, including text extraction, LLM-driven anonymization, and deanonymization logic. It is used by pdf-anonymizer-cli.
Installation
Install the base package with your favorite package manager:
pip install pdf-anonymizer-core
To use a specific LLM provider, you must install the corresponding extra. This helps to keep the installation lightweight by only downloading the libraries you need.
- Google:
pip install "pdf-anonymizer-core[google]" - Ollama:
pip install "pdf-anonymizer-core[ollama]" - Hugging Face:
pip install "pdf-anonymizer-core[huggingface]" - OpenRouter:
pip install "pdf-anonymizer-core[openrouter]" - OpenAI:
pip install "pdf-anonymizer-core[openai]" - Anthropic:
pip install "pdf-anonymizer-core[anthropic]"
You can also install multiple extras at once:
pip install "pdf-anonymizer-core[google,ollama]"
Environment Variables
The core library itself does not load .env files. Environment variables must be loaded by the application that uses this library (e.g., pdf-anonymizer-cli) or set in your shell.
GOOGLE_API_KEY: Required when using Google models.HUGGING_FACE_TOKEN: Required when using Hugging Face models.OPENROUTER_API_KEY: Required when using OpenRouter models.OPENAI_API_KEY: Required when using OpenAI models.ANTHROPIC_API_KEY: Required when using Anthropic models.OLLAMA_HOST: Optional, defaults tohttp://localhost:11434when using Ollama models.
API Usage
anonymize_file()
Anonymizes a single file and returns the anonymized text and a mapping of original entities to their placeholders.
from pdf_anonymizer_core.core import anonymize_file
from pdf_anonymizer_core.prompts import detailed
# Example of programmatic usage
text, mapping = anonymize_file(
file_path="/path/to/file.pdf",
prompt_template=detailed.prompt_template,
model_name="gemini-2.5-pro" # Can also be a new model like "google/gemini-flash-latest"
)
if text and mapping:
print("Anonymized Text:", text)
print("Mapping:", mapping)
deanonymize_file()
Reverts anonymization using a mapping file.
from pdf_anonymizer_core.utils import deanonymize_file
# Assumes you have an anonymized file and a mapping file
deanonymized_text, stats = deanonymize_file(
anonymized_file="path/to/anonymized.md",
mapping_file="path/to/mapping.json"
)
if deanonymized_text:
print("Deanonymized Text:", deanonymized_text)
Configuration
You can import default configurations and available models from the conf module.
from pdf_anonymizer_core.conf import (
DEFAULT_MODEL_NAME,
ModelName,
PromptEnum,
)
print(f"Default model: {DEFAULT_MODEL_NAME}")
print(f"Available Google models: {[m.value for m in ModelName if m.provider == 'google']}")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_anonymizer_core-0.3.2.tar.gz.
File metadata
- Download URL: pdf_anonymizer_core-0.3.2.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e08b835d0285c8521dd7100294a0a071921e4c7c68ce0476da1d102347f38988
|
|
| MD5 |
858175e78ffef96be7091efa07636424
|
|
| BLAKE2b-256 |
af3a8fac2c01337581057cc23b5201f82f7d378113f0ec79df9e6ec21eb4707f
|
File details
Details for the file pdf_anonymizer_core-0.3.2-py3-none-any.whl.
File metadata
- Download URL: pdf_anonymizer_core-0.3.2-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f98bd0df25294f906c666e797fa306427a150133043ac5d73eb7874d0d13ca0
|
|
| MD5 |
ae575875499cb95ea0f609f8298e8a88
|
|
| BLAKE2b-256 |
5967563fdf6477fa4de748cabbb4ff3f1358fbe0c09a9a2d99891c5ad79bcef4
|