Skip to main content

A core library to anonymize PDF, Markdown, and plain text files using LLMs.

Project description

PDF Anonymizer Core

This package provides the core functionality for the PDF/Text anonymizer, including text extraction, LLM-driven anonymization, and deanonymization logic. It is used by pdf-anonymizer-cli.

Installation for Development

This project uses uv and is structured as a monorepo. To install the necessary dependencies for development, run the following command from the root of the repository:

# From the repository root
uv sync

This will install the pdf-anonymizer-core package in editable mode.

Environment Variables

The core library itself does not load .env files. Environment variables must be loaded by the application that uses this library (e.g., pdf-anonymizer-cli) or set in your shell.

  • GOOGLE_API_KEY: Required when using Google's Gemini models.
  • OLLAMA_HOST: Optional, defaults to http://localhost:11434 when using local Ollama models.

API Usage

anonymize_file()

Anonymizes a single file and returns the anonymized text and a mapping of original entities to their placeholders.

from pdf_anonymizer_core.core import anonymize_file
from pdf_anonymizer_core.prompts import detailed

# Example of programmatic usage
text, mapping = anonymize_file(
    file_path="/path/to/file.pdf",
    prompt_template=detailed.prompt_template,
    model_name="gemini-2.5-flash"
)

if text and mapping:
    print("Anonymized Text:", text)
    print("Mapping:", mapping)

deanonymize_file()

Reverts anonymization using a mapping file.

from pdf_anonymizer_core.utils import deanonymize_file

# Assumes you have an anonymized file and a mapping file
deanonymized_text, stats = deanonymize_file(
    anonymized_file="path/to/anonymized.md",
    mapping_file="path/to/mapping.json"
)

if deanonymized_text:
    print("Deanonymized Text:", deanonymized_text)

Configuration

You can import default configurations and available models from the conf module.

from pdf_anonymizer_core.conf import (
    DEFAULT_MODEL_NAME,
    ModelName,
    PromptEnum,
)

print(f"Default model: {DEFAULT_MODEL_NAME}")
print(f"Available Google models: {[m.value for m in ModelName if 'gemini' in m.value]}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_anonymizer_core-0.3.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_anonymizer_core-0.3.0-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf_anonymizer_core-0.3.0.tar.gz.

File metadata

  • Download URL: pdf_anonymizer_core-0.3.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pdf_anonymizer_core-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d41f2aca1eb72174753486525728e0fd1936fa27b0ffe68e67293c0b9dbd363d
MD5 a3ff8c4dd084c28e5f1326733ac3b4c2
BLAKE2b-256 60b95c002e3cca3152384e38db6bdab5ecc150cf3058f9c38cb41a60b0837abc

See more details on using hashes here.

File details

Details for the file pdf_anonymizer_core-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_anonymizer_core-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 197939b3a2b5bf7e74f42f78a99acc928a0fdcfc70c26e53e23f9f8752b9cc13
MD5 103b99d6b67a01ea2d88ce91488cb72e
BLAKE2b-256 8b11eb5743b4175637971526508d89951b5e68799beaf06906d4958f56fecfb3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page