Skip to main content

Translate .docx files while preserving all text formatting

Project description

Translate docx

Tests Status

A CLI tool and python library for translating .docx files, with a focus on preserving all text formatting.

Key Features

  • Lossless round-trip - Extract and rebuild preserves all formatting
  • Citation preservation - Superscripts (references) stay in original language
  • Bypass markers - Protect specific content from translation with custom markers
  • Pluggable translators - Use any translation backend
  • Section-based - Documents split by bold headers automatically

Installation

pip install translate-docx

Usage from Command Line

# Basic translation e.g. from spanish to english
translate-docx input.docx output.docx -s es -t en

# With options
translate-docx input.docx output.docx -s es -t en --delay 1.0 --verbose

# Show document info
translate-docx info document.docx

Usage as a Package

from translate_docx import (
    extract_document, 
    translate_document, 
    rebuild_document, 
    GoogleTranslatorWrapper
)

doc = extract_document("input.docx")
translator = GoogleTranslatorWrapper(delay_between_calls=0.5, max_retries=3)
translated = translate_document(doc, translator, "es", "en")
rebuild_document(translated, "output.docx", template_path="input.docx")

Protecting Content with Bypass Markers

Sometimes you want to prevent specific content from being translated (like timestamps, references, or technical terms). You can use bypass markers to protect this content.

How It Works

Wrap content in your source document with [[ marker: content ]] syntax, where marker is any alphanumeric name you choose (e.g., tc, note, ref). Then configure your translator to recognize these markers.

Command Line Usage

# Protect timecodes marked with [[ tc: ... ]]
translate-docx translate input.docx output.docx -s nl -t en --bypass-markers tc

# Protect multiple marker types
translate-docx translate input.docx output.docx -s nl -t en --bypass-markers tc,note,ref

Python API Usage

from translate_docx import GoogleTranslatorWrapper, extract_translate_rebuild

# Configure translator with bypass markers
translator = GoogleTranslatorWrapper(
    delay_between_calls=0.5,
    bypass_markers=['tc', 'note', 'ref']  # Protect these marker types
)

extract_translate_rebuild('input.docx', 'output.docx', translator, 'nl', 'en')

Example Document Markup

In your Word document, mark content to protect:

Original text with [[ tc: 00:06:01, 00:06:09 ]] timestamps.

Important [[ note: Technical term - do not translate ]] for reference.

See citation [[ ref: Smith et al. 2020 ]] for details.

After translation, the markers and their content are preserved:

Translated English text with [[ tc: 00:06:01, 00:06:09 ]] timestamps.

Important [[ note: Technical term - do not translate ]] for reference.

See citation [[ ref: Smith et al. 2020 ]] for details.

Marker Rules

  • Marker names must be alphanumeric only (letters and numbers, no special characters)
  • Marker names are case-insensitive (tc, TC, and Tc are all the same)
  • You can use any marker names that make sense for your use case
  • Common examples: tc (timecodes), note, ref (references), term, cite

Use Cases

  • Timestamps: [[ tc: 00:06:01 ]] for video/film timecodes
  • Technical terms: [[ term: API endpoint ]] for specialized vocabulary
  • References: [[ ref: Smith2020 ]] for citations
  • Notes: [[ note: internal comment ]] for content that shouldn't be translated
  • Code: [[ code: function_name() ]] for code snippets in documentation

Supported Language Codes

ar - Arabic
zh - Chinese (Simplified)
nl - Dutch
en - English
fr - French
de - German
it - Italian
ja - Japanese
ko - Korean
pl - Polish
pt - Portuguese
ru - Russian
es - Spanish
tr - Turkish

Known Limitations

  • Tables and images not yet supported
  • Headers/footers not yet supported
  • Translated text may reflow (layout not guaranteed)

License

MIT License. This project is for personal use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translate_docx-2026.1.13rc1.tar.gz (99.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

translate_docx-2026.1.13rc1-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file translate_docx-2026.1.13rc1.tar.gz.

File metadata

  • Download URL: translate_docx-2026.1.13rc1.tar.gz
  • Upload date:
  • Size: 99.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for translate_docx-2026.1.13rc1.tar.gz
Algorithm Hash digest
SHA256 2e2bbe8355efab971e3ca6c86a51178f2c38f5c75e8b602d877882562d7aac0f
MD5 3be37dda1a704264acdd56bc90e13823
BLAKE2b-256 eb19060ac00d3a888623a1f89a9254fa3be9755afb6d14fcecbe804af7f5ed2e

See more details on using hashes here.

File details

Details for the file translate_docx-2026.1.13rc1-py3-none-any.whl.

File metadata

File hashes

Hashes for translate_docx-2026.1.13rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 078d496c53d3bae9dff9daeeea8823a2b6e1735e6db62c7e79d295f5cdde24c0
MD5 3f4103db60605a216bb7242cb00ed7f9
BLAKE2b-256 4c42fbe60ce920d6815e9e13b20bff0033d55de89d31a0cf997f4f0f3b4b4ae1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page