Skip to main content

Open-source, production-grade LaTeX -> Microsoft Word (.docx) converter with native OMML math and live fields

Project description

tex2word

An open-source, cross-platform LaTeX → Microsoft Word (.docx) converter that produces genuinely editable Word: native paragraph styles, native OMML equations (editable in Word's equation editor, not images), and live, auto-renumbering fields for equation/figure/table numbers and cross-references.

Status: production-grade. Foundation, math core (direct LaTeX→OMML), the live cross-reference/field plumbing (the differentiator), image embedding, the BibTeX bibliography, and the robustness layer (math cascade, coverage report, OOXML validator, round-trip manifest) are all in. See CHANGELOG.md for the release history.

Why

Pandoc/texmath is the open-source reference but drops equation numbers, can dump raw LaTeX for labelled equations, and emits static cross-references. No open tool produces editable styles and native OMML and live field-based numbering. That gap is the product.

Install & use

Requires Python 3.12+.

From PyPI:

pip install tex2word                 # core (PNG/JPEG figures)
pip install "tex2word[pdf]"          # + PDF figure rasterisation (pypdfium2, Apache-2.0)
pip install "tex2word[mathml]"       # + LaTeX->MathML->OMML for hard math (latex2mathml)
pip install "tex2word[csl]"          # + real CSL citation styles (citeproc-py)
pip install "tex2word[pdf,mathml,csl,mathimg]"   # everything

tex2word convert paper.tex -o paper.docx
tex2word convert paper.tex -o paper.docx --report report.json
tex2word convert paper.tex -o paper.docx --reference-doc journal.docx

Or, for a development checkout with uv:

uv sync --all-extras
uv run tex2word convert paper.tex -o paper.docx

Or from Python:

from tex2word import convert_source, convert_file

out_path, result = convert_file("paper.tex")
print(result.report.summary())   # math coverage + warnings

What works today

  • Reference Word templates ★: --reference-doc TEMPLATE.docx adopts a journal/corporate template's styles, theme and page geometry (size + margins), so the output matches the required look — while keeping the live fields below. Our custom styles are merged in so nothing renders unstyled.

  • Structure & styles: \title/\author/\date/abstract, \section\subparagraph → Word Title/Heading 1–4 (visible in the Navigation pane), paragraphs, \textbf/\emph/\texttt/\underline/\textsc, quotes, code. Sections are auto-numbered (multilevel 1 / 1.1 / 1.1.1) like LaTeX, with \section* unnumbered; \ref to a section shows its live number. In book/report documents \chapter is the top level (sections nest under it) and \appendix switches to lettered headings (A, A.1).

  • Math (direct LaTeX→OMML): inline $…$, display \[…\], equation/align/gather; fractions, sub/superscripts, roots, \sum/\int with limits, accents, \left…\right delimiters, matrices/cases, Greek and hundreds of symbols, \mathbb/\mathcal/\mathbf, functions (\sin, \lim). align*/aligned line up at the & (a column-justified matrix); numbered align keeps a live number per line.

  • Live fields ★: numbered equations get SEQ Equation fields inside bookmarks; \ref/\eqref/\pageref become REF/PAGEREF fields; figure and table captions get SEQ Figure/SEQ Table. Numbers auto-renumber in Word on field refresh. --number-by-section switches to N.M per-section numbering (STYLEREF + SEQ \s), book/report style.

  • Table of contents ★: \tableofcontents → a live Word TOC field (rebuilds from heading styles on refresh); \listoffigures/\listoftables → caption- sequence lists. Schema-valid and round-tripping.

  • Lists, tables, figures: itemize/enumerate, tabular/longtable with booktabs, \multicolumn→column span, \multirow→vertical merge, and repeating header rows; captioned figure/table, \includegraphics (PNG/JPEG embedded directly; PDF figures rasterised to PNG when the optional tex2word[pdf] extra — pypdfium2 — is installed). An \includegraphics in running text (an icon/logo) is embedded inline.

  • Custom macros: \newcommand/\renewcommand/\def are expanded before parsing. Common mathtools/physics math (\abs, \norm, \dv, \ket, …) and siunitx (\SI{9.81}{\meter\per\second\squared}9.81 m/s², \num, \ang) work as built-ins when not user-defined. Acronyms (glossaries): \newacronym + \gls/\acrshort/\acrlong/\acrfull expand with the first-use "long (short)" rule.

  • Footnotes: \footnote → native Word footnotes (footnotes.xml), not inlined text; footnote bodies keep their formatting and math.

  • Inline verbatim & smart refs: \verb|...| → literal monospace; \cref/\Cref/\autoref add cleveref-style type prefixes ("fig. N" / "Figure N").

  • Theorem environments: theorem/lemma/proof/definition/… render with a bold numbered lead (live SEQ per kind), optional [title], and a QED mark for proofs; \ref to a theorem shows its number.

  • Algorithms: algorithm + algorithmic/algpseudocode/algorithm2e → numbered, indented pseudocode with bold keywords, inline OMML math, and a live SEQ Algorithm caption.

  • Graceful degradation: unknown constructs never abort; they pass through best-effort and are logged to the conversion report (math coverage telemetry included). The math decision-cascade (direct OMML → LaTeX→MathML→OMML secondary path → image fallback --math-image-fallback → raw) records which path each equation took.

  • Round-trip: the IR is embedded as a JSON manifest custom part, so the exact IR can be recovered from the .docx (tex2word.roundtrip.recover_ir) and converted back to LaTeX (tex2word to-latex out.docx); the corpus latex→docx→latex keeps the same block structure. Reconcile (on by default) merges Word edits against the manifest, and Word Track Changes are accepted on read (insertions kept, deletions dropped).

  • Reports & validation: --report report.json|report.html writes a coverage report; tex2word.validate.validate_docx structurally validates output; tex2word benchmark <dir> reports a quantitative baseline (math-OMML %, validity, warnings, 0-abort) across a paper set (CI-gated on the corpus + UATs: currently 100% native-OMML math, 100% valid, 0 aborts).

  • Reproducible: set SOURCE_DATE_EPOCH and the same input yields byte-identical output (the .docx ZIP is built deterministically).

  • Live citations (opt-in --citations zotero): emit ADDIN ZOTERO_ITEM CSL_CITATION / CSL_BIBLIOGRAPHY fields so citations are editable by Zotero/Mendeley in Word (default is static formatted text).

  • Real CSL styles (opt-in --csl style.csl, needs tex2word[csl]): a genuine citeproc-py engine formats in-text citations and the reference list against any .csl style, with proper sorting; the built-in heuristic is the fallback. \nocite{key}/\nocite{*} are honoured.

  • Front-end choice: the default pure front-end (pylatexenc-based) is the validated engine — it converts the corpus and three real-paper UATs at 100% native-OMML math, 100% valid output, 0 aborts. --frontend latexml is experimental: it shells out to a real latexml install for genuine TeX expansion, but is not yet proven end-to-end (it silently falls back to pure on any failure; see the advisory real-tool CI lane).

Architecture

LaTeX ─▶ front-end (preprocess, macro-expand, pylatexenc walk) ─▶ IR
      ─▶ transforms (cross-reference resolution) ─▶ IR
      ─▶ back-end (raw OOXML via lxml: document/styles/numbering) ─▶ .docx

The IR (src/tex2word/ir.py) is the format-neutral seam, so a LaTeXML front-end can replace the static parser post-V1 without touching the back-end.

Development

uv run pytest          # tests
uv run ruff check src tests
uv run mypy src
uv run pre-commit install   # optional: run the lint/type gate on every commit

Releases: pushing a vX.Y.Z tag builds the wheel/sdist and publishes to PyPI (via the Release workflow, using PyPI Trusted Publishing). Notable changes are recorded in CHANGELOG.md.

License

MIT — see LICENSE.

Author

Yifan Yang yfyang.86@hotmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tex2word-0.8.2.tar.gz (121.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tex2word-0.8.2-py3-none-any.whl (142.0 kB view details)

Uploaded Python 3

File details

Details for the file tex2word-0.8.2.tar.gz.

File metadata

  • Download URL: tex2word-0.8.2.tar.gz
  • Upload date:
  • Size: 121.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tex2word-0.8.2.tar.gz
Algorithm Hash digest
SHA256 16606c9b3043a38a0c4d8ce4afdfe114fd3cd9b269c1c99ec7e50ac5e5e8c286
MD5 55fcadfd20bc557979e19bfd0245d2d5
BLAKE2b-256 170ff4a69408cedf6f9eaa2f47c3017c6b24597f610f453a38e75d7f178fc2a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for tex2word-0.8.2.tar.gz:

Publisher: release.yml on yfyang86/tex2word

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tex2word-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: tex2word-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 142.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tex2word-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6dca2423caf44dcf2e6fcc5b0064f853215f0ab31c74c9b499b5dc5f20dc264b
MD5 c6c25b1e1c8e4661b5b3d65b0e370703
BLAKE2b-256 8b8f5f271cffac48382fd9fcdd261af09c8e89b238278eefb1592bc2d94907d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for tex2word-0.8.2-py3-none-any.whl:

Publisher: release.yml on yfyang86/tex2word

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page