Translate PDF documents using OCR and machine translation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

BhanuJodha

These details have not been verified by PyPI

Project description

TranslatePDF

Test Lint Type Check Python License

Translate PDFs using OCR and machine translation—supports both scanned documents and digital PDFs.

This tool extracts text from PDF pages using Surya OCR (for scanned documents) or PyMuPDF (for digital PDFs), translates it via Google Translate, and renders the translated text back onto the original document—preserving layout, colors, and formatting.

Features

Dual-mode translation: Automatically detects PDF type and uses the optimal method
- Digital PDFs: Fast text extraction and in-place replacement (no OCR needed)
- Scanned PDFs: High-quality OCR powered by Surya (supports 90+ languages)
Automatic text color detection ensures readability on any background
Batch processing for faster translation of multi-page documents
GPU acceleration on NVIDIA (CUDA) and Apple Silicon (MPS)
Page range selection to translate specific pages
Preserves formatting including bold and underlined text

Installation

pip install translatepdf

System Dependencies

You'll need poppler for PDF rendering:

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows (via conda)
conda install -c conda-forge poppler

Quick Start

Command Line

# Basic usage - auto-detects PDF type, translates English to Hindi
translatepdf document.pdf

# Specify languages
translatepdf document.pdf --source en --target hi

# Force digital mode (faster for PDFs with embedded text)
translatepdf document.pdf --mode digital

# Force OCR mode (for scanned documents)
translatepdf document.pdf --mode ocr

# Translate specific pages
translatepdf document.pdf --pages 1-5

# Custom output path
translatepdf document.pdf -o translated.pdf

# Higher quality OCR (slower)
translatepdf document.pdf --dpi 300

Python API

from pdf_translator import PDFTranslator, TranslationConfig

# Simple usage - auto-detects PDF type
translator = PDFTranslator()
translator.translate("document.pdf", target_lang="hi")

# Force digital mode for PDFs with embedded text
config = TranslationConfig(
    source_lang="en",
    target_lang="hi",
    mode="digital",  # "auto", "ocr", or "digital"
)

translator = PDFTranslator(config)
translator.translate("document.pdf")

# Force OCR mode with custom settings
config = TranslationConfig(
    source_lang="en",
    target_lang="hi",
    mode="ocr",
    device="mps",  # or "cuda", "cpu"
    dpi=200,
)

translator = PDFTranslator(config)
output_path = translator.translate(
    "document.pdf",
    output_path="translated.pdf",
    page_range="1-10",
)

Device-Specific Configs

from pdf_translator import TranslationConfig

# For Apple Silicon Macs
config = TranslationConfig.for_apple_silicon()

# For NVIDIA GPUs
config = TranslationConfig.for_nvidia_gpu()

# For CPU-only systems
config = TranslationConfig.for_cpu()

CLI Options

Option	Short	Description	Default
`--output`	`-o`	Output file path	`input_translated.pdf`
`--source`	`-s`	Source language code	`en`
`--target`	`-t`	Target language code	`hi`
`--mode`	`-m`	Translation mode: `auto`, `ocr`, `digital`	`auto`
`--pages`	`-p`	Page range (e.g., "1-5", "1,3,5")	`all`
`--dpi`	`-d`	Rendering resolution (OCR mode only)	`200`
`--batch-size`	`-b`	Pages per OCR batch (OCR mode only)	`4`
`--device`		Compute device (OCR mode only)	`auto`

Translation Modes

Mode	Description	Best For
`auto`	Auto-detect PDF type	Most PDFs (default)
`digital`	Extract embedded text directly	Digital/native PDFs, Word exports, LaTeX
`ocr`	Use Surya OCR on page images	Scanned documents, image-based PDFs

Digital mode is significantly faster and produces better results for PDFs with embedded text (e.g., documents created in Word, LaTeX, or other text editors).

OCR mode is required for scanned documents or image-based PDFs where text is not selectable.

Supported Languages

The source language can be any of the 90+ languages supported by Surya OCR. Target languages depend on Google Translate availability.

Common language codes:

en - English
hi - Hindi
es - Spanish
fr - French
de - German
zh - Chinese
ja - Japanese
ko - Korean
ar - Arabic
ru - Russian

How It Works

Auto Mode (Default)

The tool first checks if the PDF contains extractable text. If it does, it uses digital mode; otherwise, it falls back to OCR mode.

Digital Mode

Text Extraction: PyMuPDF extracts text with position and formatting info
Translation: Text is batch-translated via Google Translate
In-Place Replacement: Original text is replaced with translations using redaction annotations
Output: Native PDF with translated text

This mode preserves the original PDF structure and is much faster than OCR.

OCR Mode

PDF to Images: Each page is rendered as a high-resolution image
OCR: Surya detects and recognizes text regions with their positions
Translation: Text is batch-translated via Google Translate
Rendering: Original text is erased and replaced with translations
Output: Processed images are combined back into a PDF

The tool samples background colors around each text region to cleanly erase the original text, then automatically chooses black or white text for maximum contrast.

Performance Tips

Lower DPI (150-200) for faster processing, higher (300+) for better quality
Increase batch size if you have more GPU memory
Use page ranges to translate only what you need
GPU acceleration provides 5-10x speedup over CPU

Development

Setup

# Clone the repo
git clone https://github.com/bhanurathore/pdf-translator.git
cd pdf-translator

# Create virtual environment
python -m venv .venv

# Activate it
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Set up pre-commit hooks
pre-commit install

Running Tests

# Run all tests
make test

# Run tests with coverage report
make test-cov

# Or using pytest directly
pytest tests/ -v

# Run specific test file
pytest tests/core/test_config.py -v

# Run with coverage
pytest tests/ --cov=pdf_translator --cov-report=term-missing

Linting and Formatting

# Check code style with ruff
make lint

# Auto-fix linting issues
make lint-fix

# Format code with black
make format

# Check formatting without changes
make format-check

# Run type checker
make typecheck

# Run all checks (format, lint, typecheck, test)
make check

Pre-commit Hooks

Pre-commit hooks run automatically on git commit. To run manually:

# Run on all files
pre-commit run --all-files

# Run specific hook
pre-commit run black --all-files

Building and Publishing

# Build package
make build

# Publish to Test PyPI (for testing)
make publish-test

# Publish to PyPI
make publish

Project Structure

pdf-translator/
├── pdf_translator/              # Main package
│   ├── __init__.py              # Package exports
│   ├── cli.py                   # Command-line interface
│   ├── py.typed                 # PEP 561 type marker
│   ├── core/                    # Core functionality
│   │   ├── config.py            # Configuration management
│   │   ├── ocr.py               # Surya OCR wrapper
│   │   ├── pdf_extractor.py     # Digital PDF text extraction
│   │   ├── pdf_renderer.py      # Digital PDF text replacement
│   │   ├── renderer.py          # Image-based text rendering
│   │   ├── text_translator.py   # Google Translate wrapper
│   │   └── translator.py        # Main orchestrator
│   └── utils/                   # Utilities
│       ├── fonts.py             # Cross-platform font discovery
│       └── page_range.py        # Page range parsing
├── tests/                       # Test suite (mirrors package structure)
│   ├── conftest.py              # Shared fixtures
│   ├── test_cli.py
│   ├── core/                    # Tests for core modules
│   │   ├── test_config.py
│   │   ├── test_ocr.py
│   │   ├── test_pdf_extractor.py
│   │   ├── test_pdf_renderer.py
│   │   ├── test_renderer.py
│   │   └── test_text_translator.py
│   └── utils/                   # Tests for utilities
│       ├── test_fonts.py
│       └── test_page_range.py
├── docs/                        # Documentation
│   └── PROJECT_STRUCTURE.md     # Explains all project files
├── pyproject.toml               # Package configuration (main config!)
├── setup.py                     # Legacy compatibility
├── Makefile                     # Command shortcuts
├── MANIFEST.in                  # Package manifest
├── .pre-commit-config.yaml      # Pre-commit hooks
├── .gitignore                   # Git ignore rules
├── .editorconfig                # Editor settings
├── .python-version              # Python version (for pyenv)
└── LICENSE                      # MIT License

License

MIT License - see LICENSE for details.

Maintainer

Bhanu Pratap Singh Rathore

Acknowledgments

Surya OCR for the excellent OCR engine
PyMuPDF for digital PDF text extraction and manipulation
deep-translator for the translation API wrapper
pdf2image for PDF rendering

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

BhanuJodha

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Feb 7, 2026

0.1.0

Feb 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translatepdf-1.0.0.tar.gz (29.5 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

translatepdf-1.0.0-py3-none-any.whl (28.2 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file translatepdf-1.0.0.tar.gz.

File metadata

Download URL: translatepdf-1.0.0.tar.gz
Upload date: Feb 7, 2026
Size: 29.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translatepdf-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ed80d7c5993139be018216e661aea11dfabd1b10bc2f43867eb143f2ee66215d`
MD5	`b41c17155c185c785eeecd79697d48d4`
BLAKE2b-256	`75f13f7fc671eca19e0a2c79e548930104bd44afee618474339da232f6ffbf36`

See more details on using hashes here.

Provenance

The following attestation bundles were made for translatepdf-1.0.0.tar.gz:

Publisher: publish.yml on BhanuJodha/PDF-Translator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: translatepdf-1.0.0.tar.gz
- Subject digest: ed80d7c5993139be018216e661aea11dfabd1b10bc2f43867eb143f2ee66215d
- Sigstore transparency entry: 927195292
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: BhanuJodha/PDF-Translator@e39c601a9e58c8dc2a423011748ff4824bb9d51c
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/BhanuJodha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e39c601a9e58c8dc2a423011748ff4824bb9d51c
- Trigger Event: release

File details

Details for the file translatepdf-1.0.0-py3-none-any.whl.

File metadata

Download URL: translatepdf-1.0.0-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 28.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translatepdf-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fffd264c4eaeab7a94c31577bc202d27da001df74f11e95509dd23b63a6b5892`
MD5	`4a2dbf21b7f3e814c83e26f3f63309f9`
BLAKE2b-256	`e011725b593513441bcd206cda9876db026fa5b2df0cfa3e80710dcf001b4a83`

See more details on using hashes here.

Provenance

The following attestation bundles were made for translatepdf-1.0.0-py3-none-any.whl:

Publisher: publish.yml on BhanuJodha/PDF-Translator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: translatepdf-1.0.0-py3-none-any.whl
- Subject digest: fffd264c4eaeab7a94c31577bc202d27da001df74f11e95509dd23b63a6b5892
- Sigstore transparency entry: 927195304
- Sigstore integration time: Feb 7, 2026
Source repository:
- Permalink: BhanuJodha/PDF-Translator@e39c601a9e58c8dc2a423011748ff4824bb9d51c
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/BhanuJodha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e39c601a9e58c8dc2a423011748ff4824bb9d51c
- Trigger Event: release

translatepdf 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TranslatePDF

Features

Installation

System Dependencies

Quick Start

Command Line

Python API

Device-Specific Configs

CLI Options

Translation Modes

Supported Languages

How It Works

Auto Mode (Default)

Digital Mode

OCR Mode

Performance Tips

Development

Setup

Running Tests

Linting and Formatting

Pre-commit Hooks

Building and Publishing

Project Structure

License

Maintainer

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance