Skip to main content

Translate PDF documents using OCR and machine translation

Project description

TranslatePDF

Test Lint Type Check Python License

Translate PDFs using OCR and machine translation—supports both scanned documents and digital PDFs.

This tool extracts text from PDF pages using Surya OCR (for scanned documents) or PyMuPDF (for digital PDFs), translates it via Google Translate, and renders the translated text back onto the original document—preserving layout, colors, and formatting.

Features

  • Dual-mode translation: Automatically detects PDF type and uses the optimal method
    • Digital PDFs: Fast text extraction and in-place replacement (no OCR needed)
    • Scanned PDFs: High-quality OCR powered by Surya (supports 90+ languages)
  • Automatic text color detection ensures readability on any background
  • Batch processing for faster translation of multi-page documents
  • GPU acceleration on NVIDIA (CUDA) and Apple Silicon (MPS)
  • Page range selection to translate specific pages
  • Preserves formatting including bold and underlined text

Installation

pip install translatepdf

System Dependencies

You'll need poppler for PDF rendering:

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows (via conda)
conda install -c conda-forge poppler

Quick Start

Command Line

# Basic usage - auto-detects PDF type, translates English to Hindi
translatepdf document.pdf

# Specify languages
translatepdf document.pdf --source en --target hi

# Force digital mode (faster for PDFs with embedded text)
translatepdf document.pdf --mode digital

# Force OCR mode (for scanned documents)
translatepdf document.pdf --mode ocr

# Translate specific pages
translatepdf document.pdf --pages 1-5

# Custom output path
translatepdf document.pdf -o translated.pdf

# Higher quality OCR (slower)
translatepdf document.pdf --dpi 300

Python API

from pdf_translator import PDFTranslator, TranslationConfig

# Simple usage - auto-detects PDF type
translator = PDFTranslator()
translator.translate("document.pdf", target_lang="hi")

# Force digital mode for PDFs with embedded text
config = TranslationConfig(
    source_lang="en",
    target_lang="hi",
    mode="digital",  # "auto", "ocr", or "digital"
)

translator = PDFTranslator(config)
translator.translate("document.pdf")

# Force OCR mode with custom settings
config = TranslationConfig(
    source_lang="en",
    target_lang="hi",
    mode="ocr",
    device="mps",  # or "cuda", "cpu"
    dpi=200,
)

translator = PDFTranslator(config)
output_path = translator.translate(
    "document.pdf",
    output_path="translated.pdf",
    page_range="1-10",
)

Device-Specific Configs

from pdf_translator import TranslationConfig

# For Apple Silicon Macs
config = TranslationConfig.for_apple_silicon()

# For NVIDIA GPUs
config = TranslationConfig.for_nvidia_gpu()

# For CPU-only systems
config = TranslationConfig.for_cpu()

CLI Options

Option Short Description Default
--output -o Output file path input_translated.pdf
--source -s Source language code en
--target -t Target language code hi
--mode -m Translation mode: auto, ocr, digital auto
--pages -p Page range (e.g., "1-5", "1,3,5") all
--dpi -d Rendering resolution (OCR mode only) 200
--batch-size -b Pages per OCR batch (OCR mode only) 4
--device Compute device (OCR mode only) auto

Translation Modes

Mode Description Best For
auto Auto-detect PDF type Most PDFs (default)
digital Extract embedded text directly Digital/native PDFs, Word exports, LaTeX
ocr Use Surya OCR on page images Scanned documents, image-based PDFs

Digital mode is significantly faster and produces better results for PDFs with embedded text (e.g., documents created in Word, LaTeX, or other text editors).

OCR mode is required for scanned documents or image-based PDFs where text is not selectable.

Supported Languages

The source language can be any of the 90+ languages supported by Surya OCR. Target languages depend on Google Translate availability.

Common language codes:

  • en - English
  • hi - Hindi
  • es - Spanish
  • fr - French
  • de - German
  • zh - Chinese
  • ja - Japanese
  • ko - Korean
  • ar - Arabic
  • ru - Russian

How It Works

Auto Mode (Default)

The tool first checks if the PDF contains extractable text. If it does, it uses digital mode; otherwise, it falls back to OCR mode.

Digital Mode

  1. Text Extraction: PyMuPDF extracts text with position and formatting info
  2. Translation: Text is batch-translated via Google Translate
  3. In-Place Replacement: Original text is replaced with translations using redaction annotations
  4. Output: Native PDF with translated text

This mode preserves the original PDF structure and is much faster than OCR.

OCR Mode

  1. PDF to Images: Each page is rendered as a high-resolution image
  2. OCR: Surya detects and recognizes text regions with their positions
  3. Translation: Text is batch-translated via Google Translate
  4. Rendering: Original text is erased and replaced with translations
  5. Output: Processed images are combined back into a PDF

The tool samples background colors around each text region to cleanly erase the original text, then automatically chooses black or white text for maximum contrast.

Performance Tips

  • Lower DPI (150-200) for faster processing, higher (300+) for better quality
  • Increase batch size if you have more GPU memory
  • Use page ranges to translate only what you need
  • GPU acceleration provides 5-10x speedup over CPU

Development

Setup

# Clone the repo
git clone https://github.com/bhanurathore/pdf-translator.git
cd pdf-translator

# Create virtual environment
python -m venv .venv

# Activate it
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Set up pre-commit hooks
pre-commit install

Running Tests

# Run all tests
make test

# Run tests with coverage report
make test-cov

# Or using pytest directly
pytest tests/ -v

# Run specific test file
pytest tests/core/test_config.py -v

# Run with coverage
pytest tests/ --cov=pdf_translator --cov-report=term-missing

Linting and Formatting

# Check code style with ruff
make lint

# Auto-fix linting issues
make lint-fix

# Format code with black
make format

# Check formatting without changes
make format-check

# Run type checker
make typecheck

# Run all checks (format, lint, typecheck, test)
make check

Pre-commit Hooks

Pre-commit hooks run automatically on git commit. To run manually:

# Run on all files
pre-commit run --all-files

# Run specific hook
pre-commit run black --all-files

Building and Publishing

# Build package
make build

# Publish to Test PyPI (for testing)
make publish-test

# Publish to PyPI
make publish

Project Structure

pdf-translator/
├── pdf_translator/              # Main package
│   ├── __init__.py              # Package exports
│   ├── cli.py                   # Command-line interface
│   ├── py.typed                 # PEP 561 type marker
│   ├── core/                    # Core functionality
│   │   ├── config.py            # Configuration management
│   │   ├── ocr.py               # Surya OCR wrapper
│   │   ├── pdf_extractor.py     # Digital PDF text extraction
│   │   ├── pdf_renderer.py      # Digital PDF text replacement
│   │   ├── renderer.py          # Image-based text rendering
│   │   ├── text_translator.py   # Google Translate wrapper
│   │   └── translator.py        # Main orchestrator
│   └── utils/                   # Utilities
│       ├── fonts.py             # Cross-platform font discovery
│       └── page_range.py        # Page range parsing
├── tests/                       # Test suite (mirrors package structure)
│   ├── conftest.py              # Shared fixtures
│   ├── test_cli.py
│   ├── core/                    # Tests for core modules
│   │   ├── test_config.py
│   │   ├── test_ocr.py
│   │   ├── test_pdf_extractor.py
│   │   ├── test_pdf_renderer.py
│   │   ├── test_renderer.py
│   │   └── test_text_translator.py
│   └── utils/                   # Tests for utilities
│       ├── test_fonts.py
│       └── test_page_range.py
├── docs/                        # Documentation
│   └── PROJECT_STRUCTURE.md     # Explains all project files
├── pyproject.toml               # Package configuration (main config!)
├── setup.py                     # Legacy compatibility
├── Makefile                     # Command shortcuts
├── MANIFEST.in                  # Package manifest
├── .pre-commit-config.yaml      # Pre-commit hooks
├── .gitignore                   # Git ignore rules
├── .editorconfig                # Editor settings
├── .python-version              # Python version (for pyenv)
└── LICENSE                      # MIT License

License

MIT License - see LICENSE for details.

Maintainer

Bhanu Pratap Singh Rathore

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translatepdf-1.0.0.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

translatepdf-1.0.0-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file translatepdf-1.0.0.tar.gz.

File metadata

  • Download URL: translatepdf-1.0.0.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translatepdf-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ed80d7c5993139be018216e661aea11dfabd1b10bc2f43867eb143f2ee66215d
MD5 b41c17155c185c785eeecd79697d48d4
BLAKE2b-256 75f13f7fc671eca19e0a2c79e548930104bd44afee618474339da232f6ffbf36

See more details on using hashes here.

Provenance

The following attestation bundles were made for translatepdf-1.0.0.tar.gz:

Publisher: publish.yml on BhanuJodha/PDF-Translator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translatepdf-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: translatepdf-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 28.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translatepdf-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fffd264c4eaeab7a94c31577bc202d27da001df74f11e95509dd23b63a6b5892
MD5 4a2dbf21b7f3e814c83e26f3f63309f9
BLAKE2b-256 e011725b593513441bcd206cda9876db026fa5b2df0cfa3e80710dcf001b4a83

See more details on using hashes here.

Provenance

The following attestation bundles were made for translatepdf-1.0.0-py3-none-any.whl:

Publisher: publish.yml on BhanuJodha/PDF-Translator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page