A core library to anonymize PDF, Markdown, and plain text files using LLMs.
Project description
🦉🫥 PDF Anonymizer Core
This package provides the core functionality for the PDF/Text anonymizer, including text extraction, LLM-driven anonymization, and deanonymization logic. It is used by pdf-anonymizer-cli.
Installation for Development
This project uses uv and is structured as a monorepo. To install the necessary dependencies for development, run the following command from the root of the repository:
# From the repository root
uv sync
This will install the pdf-anonymizer-core package in editable mode.
Environment Variables
The core library itself does not load .env files. Environment variables must be loaded by the application that uses this library (e.g., pdf-anonymizer-cli) or set in your shell.
GOOGLE_API_KEY: Required when using Google's Gemini models.OLLAMA_HOST: Optional, defaults tohttp://localhost:11434when using local Ollama models.
API Usage
anonymize_file()
Anonymizes a single file and returns the anonymized text and a mapping of original entities to their placeholders.
from pdf_anonymizer_core.core import anonymize_file
from pdf_anonymizer_core.prompts import detailed
# Example of programmatic usage
text, mapping = anonymize_file(
file_path="/path/to/file.pdf",
prompt_template=detailed.prompt_template,
model_name="gemini-2.5-flash"
)
if text and mapping:
print("Anonymized Text:", text)
print("Mapping:", mapping)
deanonymize_file()
Reverts anonymization using a mapping file.
from pdf_anonymizer_core.utils import deanonymize_file
# Assumes you have an anonymized file and a mapping file
deanonymized_text, stats = deanonymize_file(
anonymized_file="path/to/anonymized.md",
mapping_file="path/to/mapping.json"
)
if deanonymized_text:
print("Deanonymized Text:", deanonymized_text)
Configuration
You can import default configurations and available models from the conf module.
from pdf_anonymizer_core.conf import (
DEFAULT_MODEL_NAME,
ModelName,
PromptEnum,
)
print(f"Default model: {DEFAULT_MODEL_NAME}")
print(f"Available Google models: {[m.value for m in ModelName if 'gemini' in m.value]}")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_anonymizer_core-0.3.1.tar.gz.
File metadata
- Download URL: pdf_anonymizer_core-0.3.1.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68271c8fe30b6822dc76f319aa509425f9990debfbd80417b8ff1acf6a8fd29f
|
|
| MD5 |
b180100b3b519ed76a506e2615b20db1
|
|
| BLAKE2b-256 |
889ec063b5e36b8b3580e0cb8320359f81320f9892b9870eb2b2c911b243cb37
|
File details
Details for the file pdf_anonymizer_core-0.3.1-py3-none-any.whl.
File metadata
- Download URL: pdf_anonymizer_core-0.3.1-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
524f876b87287b3ca5c02cfe57f802178cbe6020d489b7bbbac5896e99cb910f
|
|
| MD5 |
e24be1407a76a0d2b32eafc480e12965
|
|
| BLAKE2b-256 |
2c35dcd33dcc653b845b55cb6ac7c8e40ca5797e6db48264c4ee56fe47a52120
|