Skip to main content

Translate .docx files while preserving all text formatting

Project description

Translate docx

Tests Status

A CLI tool and python library for translating .docx files, with a focus on preserving all text formatting.

Key Features

  • Lossless round-trip - Extract and rebuild preserves all formatting
  • Citation preservation - Superscripts (references) stay in original language
  • Bypass markers - Protect specific content from translation with custom markers
  • Pluggable translators - Use any translation backend
  • Section-based - Documents split by bold headers automatically

Installation

pip install translate-docx

Usage from Command Line

# Basic translation e.g. from spanish to english
translate-docx input.docx output.docx -s es -t en

# With options
translate-docx input.docx output.docx -s es -t en --delay 1.0 --verbose

# Show document info
translate-docx info document.docx

Usage as a Package

from translate_docx import (
    extract_document, 
    translate_document, 
    rebuild_document, 
    GoogleTranslatorWrapper
)

doc = extract_document("input.docx")
translator = GoogleTranslatorWrapper(delay_between_calls=0.5, max_retries=3)
translated = translate_document(doc, translator, "es", "en")
rebuild_document(translated, "output.docx", template_path="input.docx")

Protecting Content with Bypass Markers

Sometimes you want to prevent specific content from being translated (like timestamps, references, or technical terms). You can use bypass markers to protect this content.

How It Works

Wrap content in your source document with [[ marker: content ]] syntax, where marker is any alphanumeric name you choose (e.g., tc, note, ref). Then configure your translator to recognize these markers.

Command Line Usage

# Protect timecodes marked with [[ tc: ... ]]
translate-docx translate input.docx output.docx -s nl -t en --bypass-markers tc

# Protect multiple marker types
translate-docx translate input.docx output.docx -s nl -t en --bypass-markers tc,note,ref

Python API Usage

from translate_docx import GoogleTranslatorWrapper, extract_translate_rebuild

# Configure translator with bypass markers
translator = GoogleTranslatorWrapper(
    delay_between_calls=0.5,
    bypass_markers=['tc', 'note', 'ref']  # Protect these marker types
)

extract_translate_rebuild('input.docx', 'output.docx', translator, 'nl', 'en')

Example Document Markup

In your Word document, mark content to protect:

Original text with [[ tc: 00:06:01, 00:06:09 ]] timestamps.

Important [[ note: Technical term - do not translate ]] for reference.

See citation [[ ref: Smith et al. 2020 ]] for details.

After translation, the markers and their content are preserved:

Translated English text with [[ tc: 00:06:01, 00:06:09 ]] timestamps.

Important [[ note: Technical term - do not translate ]] for reference.

See citation [[ ref: Smith et al. 2020 ]] for details.

Marker Rules

  • Marker names must be alphanumeric only (letters and numbers, no special characters)
  • Marker names are case-insensitive (tc, TC, and Tc are all the same)
  • You can use any marker names that make sense for your use case
  • Common examples: tc (timecodes), note, ref (references), term, cite

Use Cases

  • Timestamps: [[ tc: 00:06:01 ]] for video/film timecodes
  • Technical terms: [[ term: API endpoint ]] for specialized vocabulary
  • References: [[ ref: Smith2020 ]] for citations
  • Notes: [[ note: internal comment ]] for content that shouldn't be translated
  • Code: [[ code: function_name() ]] for code snippets in documentation

Supported Language Codes

ar - Arabic
zh - Chinese (Simplified)
nl - Dutch
en - English
fr - French
de - German
it - Italian
ja - Japanese
ko - Korean
pl - Polish
pt - Portuguese
ru - Russian
es - Spanish
tr - Turkish

Known Limitations

  • Tables and images not yet supported
  • Headers/footers not yet supported
  • Translated text may reflow (layout not guaranteed)

License

MIT License. This project is for personal use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translate_docx-2026.1.12rc1.tar.gz (98.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

translate_docx-2026.1.12rc1-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file translate_docx-2026.1.12rc1.tar.gz.

File metadata

  • Download URL: translate_docx-2026.1.12rc1.tar.gz
  • Upload date:
  • Size: 98.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for translate_docx-2026.1.12rc1.tar.gz
Algorithm Hash digest
SHA256 fa67ec6254e1de87a12ccfa81ce11f4b6f5c995ffa42641bc738c14a73b17ab8
MD5 f897e4cb668d54cbb6aefccef9d2752d
BLAKE2b-256 5d0adeb9b8a8f27c8a36da7ea8e8b686905756e39adcc6dc1a7d77c77bc92814

See more details on using hashes here.

File details

Details for the file translate_docx-2026.1.12rc1-py3-none-any.whl.

File metadata

File hashes

Hashes for translate_docx-2026.1.12rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 eea0e8a378bf279b240e3441058502bd39d7ae546f901fff70ea014eb8afb0d4
MD5 129f2777555dfc80d015ec88ced9fdea
BLAKE2b-256 52a1c848506157d5f6185fcb902cc5ccc99f3adedff24cb7cc0ac1365910f9e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page