Translate .docx files while preserving all text formatting
Project description
Translate docx
A CLI tool and python library for translating .docx files, with a focus on preserving all text formatting.
Key Features
- Lossless round-trip - Extract and rebuild preserves all formatting
- Citation preservation - Superscripts (references) stay in original language
- Bypass markers - Protect specific content from translation with custom markers
- Pluggable translators - Use any translation backend
- Section-based - Documents split by bold headers automatically
Installation
pip install translate-docx
Usage from Command Line
# Basic translation e.g. from spanish to english
translate-docx input.docx output.docx -s es -t en
# With options
translate-docx input.docx output.docx -s es -t en --delay 1.0 --verbose
# Show document info
translate-docx info document.docx
Usage as a Package
from translate_docx import (
extract_document,
translate_document,
rebuild_document,
GoogleTranslatorWrapper
)
doc = extract_document("input.docx")
translator = GoogleTranslatorWrapper(delay_between_calls=0.5, max_retries=3)
translated = translate_document(doc, translator, "es", "en")
rebuild_document(translated, "output.docx", template_path="input.docx")
Protecting Content with Bypass Markers
Sometimes you want to prevent specific content from being translated (like timestamps, references, or technical terms). You can use bypass markers to protect this content.
How It Works
Wrap content in your source document with [[ marker: content ]] syntax, where marker is any alphanumeric name you choose (e.g., tc, note, ref). Then configure your translator to recognize these markers.
Command Line Usage
# Protect timecodes marked with [[ tc: ... ]]
translate-docx translate input.docx output.docx -s nl -t en --bypass-markers tc
# Protect multiple marker types
translate-docx translate input.docx output.docx -s nl -t en --bypass-markers tc,note,ref
Python API Usage
from translate_docx import GoogleTranslatorWrapper, extract_translate_rebuild
# Configure translator with bypass markers
translator = GoogleTranslatorWrapper(
delay_between_calls=0.5,
bypass_markers=['tc', 'note', 'ref'] # Protect these marker types
)
extract_translate_rebuild('input.docx', 'output.docx', translator, 'nl', 'en')
Example Document Markup
In your Word document, mark content to protect:
Original text with [[ tc: 00:06:01, 00:06:09 ]] timestamps.
Important [[ note: Technical term - do not translate ]] for reference.
See citation [[ ref: Smith et al. 2020 ]] for details.
After translation, the markers and their content are preserved:
Translated English text with [[ tc: 00:06:01, 00:06:09 ]] timestamps.
Important [[ note: Technical term - do not translate ]] for reference.
See citation [[ ref: Smith et al. 2020 ]] for details.
Marker Rules
- Marker names must be alphanumeric only (letters and numbers, no special characters)
- Marker names are case-insensitive (
tc,TC, andTcare all the same) - You can use any marker names that make sense for your use case
- Common examples:
tc(timecodes),note,ref(references),term,cite
Use Cases
- Timestamps:
[[ tc: 00:06:01 ]]for video/film timecodes - Technical terms:
[[ term: API endpoint ]]for specialized vocabulary - References:
[[ ref: Smith2020 ]]for citations - Notes:
[[ note: internal comment ]]for content that shouldn't be translated - Code:
[[ code: function_name() ]]for code snippets in documentation
Supported Language Codes
ar - Arabic
zh - Chinese (Simplified)
nl - Dutch
en - English
fr - French
de - German
it - Italian
ja - Japanese
ko - Korean
pl - Polish
pt - Portuguese
ru - Russian
es - Spanish
tr - Turkish
Known Limitations
- Tables and images not yet supported
- Headers/footers not yet supported
- Translated text may reflow (layout not guaranteed)
License
MIT License. This project is for personal use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file translate_docx-2026.1.12rc1.tar.gz.
File metadata
- Download URL: translate_docx-2026.1.12rc1.tar.gz
- Upload date:
- Size: 98.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa67ec6254e1de87a12ccfa81ce11f4b6f5c995ffa42641bc738c14a73b17ab8
|
|
| MD5 |
f897e4cb668d54cbb6aefccef9d2752d
|
|
| BLAKE2b-256 |
5d0adeb9b8a8f27c8a36da7ea8e8b686905756e39adcc6dc1a7d77c77bc92814
|
File details
Details for the file translate_docx-2026.1.12rc1-py3-none-any.whl.
File metadata
- Download URL: translate_docx-2026.1.12rc1-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eea0e8a378bf279b240e3441058502bd39d7ae546f901fff70ea014eb8afb0d4
|
|
| MD5 |
129f2777555dfc80d015ec88ced9fdea
|
|
| BLAKE2b-256 |
52a1c848506157d5f6185fcb902cc5ccc99f3adedff24cb7cc0ac1365910f9e6
|