Skip to main content

A core library to anonymize PDF, Markdown, and plain text files using LLMs.

Project description

🦉🫥 PDF Anonymizer Core

This package provides the core functionality for the PDF/Text anonymizer, including text extraction, LLM-driven anonymization, and deanonymization logic. It is used by pdf-anonymizer-cli.

Installation for Development

This project uses uv and is structured as a monorepo. To install the necessary dependencies for development, run the following command from the root of the repository:

# From the repository root
uv sync

This will install the pdf-anonymizer-core package in editable mode.

Environment Variables

The core library itself does not load .env files. Environment variables must be loaded by the application that uses this library (e.g., pdf-anonymizer-cli) or set in your shell.

  • GOOGLE_API_KEY: Required when using Google's Gemini models.
  • OLLAMA_HOST: Optional, defaults to http://localhost:11434 when using local Ollama models.

API Usage

anonymize_file()

Anonymizes a single file and returns the anonymized text and a mapping of original entities to their placeholders.

from pdf_anonymizer_core.core import anonymize_file
from pdf_anonymizer_core.prompts import detailed

# Example of programmatic usage
text, mapping = anonymize_file(
    file_path="/path/to/file.pdf",
    prompt_template=detailed.prompt_template,
    model_name="gemini-2.5-flash"
)

if text and mapping:
    print("Anonymized Text:", text)
    print("Mapping:", mapping)

deanonymize_file()

Reverts anonymization using a mapping file.

from pdf_anonymizer_core.utils import deanonymize_file

# Assumes you have an anonymized file and a mapping file
deanonymized_text, stats = deanonymize_file(
    anonymized_file="path/to/anonymized.md",
    mapping_file="path/to/mapping.json"
)

if deanonymized_text:
    print("Deanonymized Text:", deanonymized_text)

Configuration

You can import default configurations and available models from the conf module.

from pdf_anonymizer_core.conf import (
    DEFAULT_MODEL_NAME,
    ModelName,
    PromptEnum,
)

print(f"Default model: {DEFAULT_MODEL_NAME}")
print(f"Available Google models: {[m.value for m in ModelName if 'gemini' in m.value]}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_anonymizer_core-0.3.1.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_anonymizer_core-0.3.1-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_anonymizer_core-0.3.1.tar.gz.

File metadata

  • Download URL: pdf_anonymizer_core-0.3.1.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pdf_anonymizer_core-0.3.1.tar.gz
Algorithm Hash digest
SHA256 68271c8fe30b6822dc76f319aa509425f9990debfbd80417b8ff1acf6a8fd29f
MD5 b180100b3b519ed76a506e2615b20db1
BLAKE2b-256 889ec063b5e36b8b3580e0cb8320359f81320f9892b9870eb2b2c911b243cb37

See more details on using hashes here.

File details

Details for the file pdf_anonymizer_core-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_anonymizer_core-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 524f876b87287b3ca5c02cfe57f802178cbe6020d489b7bbbac5896e99cb910f
MD5 e24be1407a76a0d2b32eafc480e12965
BLAKE2b-256 2c35dcd33dcc653b845b55cb6ac7c8e40ca5797e6db48264c4ee56fe47a52120

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page