Translate PDF documents using OCR and machine translation
Project description
TranslatePDF
Translate PDFs using OCR and machine translation—supports both scanned documents and digital PDFs.
This tool extracts text from PDF pages using Surya OCR (for scanned documents) or PyMuPDF (for digital PDFs), translates it via Google Translate, and renders the translated text back onto the original document—preserving layout, colors, and formatting.
Features
- Dual-mode translation: Automatically detects PDF type and uses the optimal method
- Digital PDFs: Fast text extraction and in-place replacement (no OCR needed)
- Scanned PDFs: High-quality OCR powered by Surya (supports 90+ languages)
- Automatic text color detection ensures readability on any background
- Batch processing for faster translation of multi-page documents
- GPU acceleration on NVIDIA (CUDA) and Apple Silicon (MPS)
- Page range selection to translate specific pages
- Preserves formatting including bold and underlined text
Installation
pip install translatepdf
System Dependencies
You'll need poppler for PDF rendering:
# macOS
brew install poppler
# Ubuntu/Debian
sudo apt-get install poppler-utils
# Windows (via conda)
conda install -c conda-forge poppler
Quick Start
Command Line
# Basic usage - auto-detects PDF type, translates English to Hindi
translatepdf document.pdf
# Specify languages
translatepdf document.pdf --source en --target hi
# Force digital mode (faster for PDFs with embedded text)
translatepdf document.pdf --mode digital
# Force OCR mode (for scanned documents)
translatepdf document.pdf --mode ocr
# Translate specific pages
translatepdf document.pdf --pages 1-5
# Custom output path
translatepdf document.pdf -o translated.pdf
# Higher quality OCR (slower)
translatepdf document.pdf --dpi 300
Python API
from pdf_translator import PDFTranslator, TranslationConfig
# Simple usage - auto-detects PDF type
translator = PDFTranslator()
translator.translate("document.pdf", target_lang="hi")
# Force digital mode for PDFs with embedded text
config = TranslationConfig(
source_lang="en",
target_lang="hi",
mode="digital", # "auto", "ocr", or "digital"
)
translator = PDFTranslator(config)
translator.translate("document.pdf")
# Force OCR mode with custom settings
config = TranslationConfig(
source_lang="en",
target_lang="hi",
mode="ocr",
device="mps", # or "cuda", "cpu"
dpi=200,
)
translator = PDFTranslator(config)
output_path = translator.translate(
"document.pdf",
output_path="translated.pdf",
page_range="1-10",
)
Device-Specific Configs
from pdf_translator import TranslationConfig
# For Apple Silicon Macs
config = TranslationConfig.for_apple_silicon()
# For NVIDIA GPUs
config = TranslationConfig.for_nvidia_gpu()
# For CPU-only systems
config = TranslationConfig.for_cpu()
CLI Options
| Option | Short | Description | Default |
|---|---|---|---|
--output |
-o |
Output file path | input_translated.pdf |
--source |
-s |
Source language code | en |
--target |
-t |
Target language code | hi |
--mode |
-m |
Translation mode: auto, ocr, digital |
auto |
--pages |
-p |
Page range (e.g., "1-5", "1,3,5") | all |
--dpi |
-d |
Rendering resolution (OCR mode only) | 200 |
--batch-size |
-b |
Pages per OCR batch (OCR mode only) | 4 |
--device |
Compute device (OCR mode only) | auto |
Translation Modes
| Mode | Description | Best For |
|---|---|---|
auto |
Auto-detect PDF type | Most PDFs (default) |
digital |
Extract embedded text directly | Digital/native PDFs, Word exports, LaTeX |
ocr |
Use Surya OCR on page images | Scanned documents, image-based PDFs |
Digital mode is significantly faster and produces better results for PDFs with embedded text (e.g., documents created in Word, LaTeX, or other text editors).
OCR mode is required for scanned documents or image-based PDFs where text is not selectable.
Supported Languages
The source language can be any of the 90+ languages supported by Surya OCR. Target languages depend on Google Translate availability.
Common language codes:
en- Englishhi- Hindies- Spanishfr- Frenchde- Germanzh- Chineseja- Japaneseko- Koreanar- Arabicru- Russian
How It Works
Auto Mode (Default)
The tool first checks if the PDF contains extractable text. If it does, it uses digital mode; otherwise, it falls back to OCR mode.
Digital Mode
- Text Extraction: PyMuPDF extracts text with position and formatting info
- Translation: Text is batch-translated via Google Translate
- In-Place Replacement: Original text is replaced with translations using redaction annotations
- Output: Native PDF with translated text
This mode preserves the original PDF structure and is much faster than OCR.
OCR Mode
- PDF to Images: Each page is rendered as a high-resolution image
- OCR: Surya detects and recognizes text regions with their positions
- Translation: Text is batch-translated via Google Translate
- Rendering: Original text is erased and replaced with translations
- Output: Processed images are combined back into a PDF
The tool samples background colors around each text region to cleanly erase the original text, then automatically chooses black or white text for maximum contrast.
Performance Tips
- Lower DPI (150-200) for faster processing, higher (300+) for better quality
- Increase batch size if you have more GPU memory
- Use page ranges to translate only what you need
- GPU acceleration provides 5-10x speedup over CPU
Development
Setup
# Clone the repo
git clone https://github.com/bhanurathore/pdf-translator.git
cd pdf-translator
# Create virtual environment
python -m venv .venv
# Activate it
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install in development mode with dev dependencies
pip install -e ".[dev]"
# Set up pre-commit hooks
pre-commit install
Running Tests
# Run all tests
make test
# Run tests with coverage report
make test-cov
# Or using pytest directly
pytest tests/ -v
# Run specific test file
pytest tests/core/test_config.py -v
# Run with coverage
pytest tests/ --cov=pdf_translator --cov-report=term-missing
Linting and Formatting
# Check code style with ruff
make lint
# Auto-fix linting issues
make lint-fix
# Format code with black
make format
# Check formatting without changes
make format-check
# Run type checker
make typecheck
# Run all checks (format, lint, typecheck, test)
make check
Pre-commit Hooks
Pre-commit hooks run automatically on git commit. To run manually:
# Run on all files
pre-commit run --all-files
# Run specific hook
pre-commit run black --all-files
Building and Publishing
# Build package
make build
# Publish to Test PyPI (for testing)
make publish-test
# Publish to PyPI
make publish
Project Structure
pdf-translator/
├── pdf_translator/ # Main package
│ ├── __init__.py # Package exports
│ ├── cli.py # Command-line interface
│ ├── py.typed # PEP 561 type marker
│ ├── core/ # Core functionality
│ │ ├── config.py # Configuration management
│ │ ├── ocr.py # Surya OCR wrapper
│ │ ├── pdf_extractor.py # Digital PDF text extraction
│ │ ├── pdf_renderer.py # Digital PDF text replacement
│ │ ├── renderer.py # Image-based text rendering
│ │ ├── text_translator.py # Google Translate wrapper
│ │ └── translator.py # Main orchestrator
│ └── utils/ # Utilities
│ ├── fonts.py # Cross-platform font discovery
│ └── page_range.py # Page range parsing
├── tests/ # Test suite (mirrors package structure)
│ ├── conftest.py # Shared fixtures
│ ├── test_cli.py
│ ├── core/ # Tests for core modules
│ │ ├── test_config.py
│ │ ├── test_ocr.py
│ │ ├── test_pdf_extractor.py
│ │ ├── test_pdf_renderer.py
│ │ ├── test_renderer.py
│ │ └── test_text_translator.py
│ └── utils/ # Tests for utilities
│ ├── test_fonts.py
│ └── test_page_range.py
├── docs/ # Documentation
│ └── PROJECT_STRUCTURE.md # Explains all project files
├── pyproject.toml # Package configuration (main config!)
├── setup.py # Legacy compatibility
├── Makefile # Command shortcuts
├── MANIFEST.in # Package manifest
├── .pre-commit-config.yaml # Pre-commit hooks
├── .gitignore # Git ignore rules
├── .editorconfig # Editor settings
├── .python-version # Python version (for pyenv)
└── LICENSE # MIT License
License
MIT License - see LICENSE for details.
Maintainer
Bhanu Pratap Singh Rathore
Acknowledgments
- Surya OCR for the excellent OCR engine
- PyMuPDF for digital PDF text extraction and manipulation
- deep-translator for the translation API wrapper
- pdf2image for PDF rendering
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file translatepdf-1.0.0.tar.gz.
File metadata
- Download URL: translatepdf-1.0.0.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed80d7c5993139be018216e661aea11dfabd1b10bc2f43867eb143f2ee66215d
|
|
| MD5 |
b41c17155c185c785eeecd79697d48d4
|
|
| BLAKE2b-256 |
75f13f7fc671eca19e0a2c79e548930104bd44afee618474339da232f6ffbf36
|
Provenance
The following attestation bundles were made for translatepdf-1.0.0.tar.gz:
Publisher:
publish.yml on BhanuJodha/PDF-Translator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
translatepdf-1.0.0.tar.gz -
Subject digest:
ed80d7c5993139be018216e661aea11dfabd1b10bc2f43867eb143f2ee66215d - Sigstore transparency entry: 927195292
- Sigstore integration time:
-
Permalink:
BhanuJodha/PDF-Translator@e39c601a9e58c8dc2a423011748ff4824bb9d51c -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/BhanuJodha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e39c601a9e58c8dc2a423011748ff4824bb9d51c -
Trigger Event:
release
-
Statement type:
File details
Details for the file translatepdf-1.0.0-py3-none-any.whl.
File metadata
- Download URL: translatepdf-1.0.0-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fffd264c4eaeab7a94c31577bc202d27da001df74f11e95509dd23b63a6b5892
|
|
| MD5 |
4a2dbf21b7f3e814c83e26f3f63309f9
|
|
| BLAKE2b-256 |
e011725b593513441bcd206cda9876db026fa5b2df0cfa3e80710dcf001b4a83
|
Provenance
The following attestation bundles were made for translatepdf-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on BhanuJodha/PDF-Translator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
translatepdf-1.0.0-py3-none-any.whl -
Subject digest:
fffd264c4eaeab7a94c31577bc202d27da001df74f11e95509dd23b63a6b5892 - Sigstore transparency entry: 927195304
- Sigstore integration time:
-
Permalink:
BhanuJodha/PDF-Translator@e39c601a9e58c8dc2a423011748ff4824bb9d51c -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/BhanuJodha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e39c601a9e58c8dc2a423011748ff4824bb9d51c -
Trigger Event:
release
-
Statement type: