Open-source, production-grade LaTeX -> Microsoft Word (.docx) converter with native OMML math and live fields
Project description
tex2word
An open-source, cross-platform LaTeX → Microsoft Word (.docx) converter
that produces genuinely editable Word: native paragraph styles, native OMML
equations (editable in Word's equation editor, not images), and live,
auto-renumbering fields for equation/figure/table numbers and
cross-references.
Status: production-grade. Foundation, math core (direct LaTeX→OMML), the live cross-reference/field plumbing (the differentiator), image embedding, the BibTeX bibliography, and the robustness layer (math cascade, coverage report, OOXML validator, round-trip manifest) are all in. See
CHANGELOG.mdfor the release history.
Why
Pandoc/texmath is the open-source reference but drops equation numbers,
can dump raw LaTeX for labelled equations, and emits static cross-references.
No open tool produces editable styles and native OMML and live
field-based numbering. That gap is the product.
Install & use
Requires Python 3.12+.
From PyPI:
pip install tex2word # core (PNG/JPEG figures)
pip install "tex2word[pdf]" # + PDF figure rasterisation (pypdfium2, Apache-2.0)
pip install "tex2word[mathml]" # + LaTeX->MathML->OMML for hard math (latex2mathml)
pip install "tex2word[csl]" # + real CSL citation styles (citeproc-py)
pip install "tex2word[pdf,mathml,csl,mathimg]" # everything
tex2word convert paper.tex -o paper.docx
tex2word convert paper.tex -o paper.docx --report report.json
tex2word convert paper.tex -o paper.docx --reference-doc journal.docx
Or, for a development checkout with uv:
uv sync --all-extras
uv run tex2word convert paper.tex -o paper.docx
Or from Python:
from tex2word import convert_source, convert_file
out_path, result = convert_file("paper.tex")
print(result.report.summary()) # math coverage + warnings
What works today
-
Reference Word templates ★:
--reference-doc TEMPLATE.docxadopts a journal/corporate template's styles, theme and page geometry (size + margins), so the output matches the required look — while keeping the live fields below. Our custom styles are merged in so nothing renders unstyled. -
Structure & styles:
\title/\author/\date/abstract,\section…\subparagraph→ Word Title/Heading 1–4 (visible in the Navigation pane), paragraphs,\textbf/\emph/\texttt/\underline/\textsc, quotes, code. Sections are auto-numbered (multilevel1/1.1/1.1.1) like LaTeX, with\section*unnumbered;\refto a section shows its live number. In book/report documents\chapteris the top level (sections nest under it) and\appendixswitches to lettered headings (A,A.1). -
Math (direct LaTeX→OMML): inline
$…$, display\[…\],equation/align/gather; fractions, sub/superscripts, roots,\sum/\intwith limits, accents,\left…\rightdelimiters, matrices/cases, Greek and hundreds of symbols,\mathbb/\mathcal/\mathbf, functions (\sin,\lim).align*/alignedline up at the&(a column-justified matrix); numberedalignkeeps a live number per line. -
Live fields ★: numbered equations get
SEQ Equationfields inside bookmarks;\ref/\eqref/\pagerefbecomeREF/PAGEREFfields; figure and table captions getSEQ Figure/SEQ Table. Numbers auto-renumber in Word on field refresh.--number-by-sectionswitches toN.Mper-section numbering (STYLEREF+SEQ \s), book/report style. -
Table of contents ★:
\tableofcontents→ a live WordTOCfield (rebuilds from heading styles on refresh);\listoffigures/\listoftables→ caption- sequence lists. Schema-valid and round-tripping. -
Lists, tables, figures:
itemize/enumerate,tabular/longtablewithbooktabs,\multicolumn→column span,\multirow→vertical merge, and repeating header rows; captionedfigure/table,\includegraphics(PNG/JPEG embedded directly; PDF figures rasterised to PNG when the optionaltex2word[pdf]extra — pypdfium2 — is installed). An\includegraphicsin running text (an icon/logo) is embedded inline. -
Custom macros:
\newcommand/\renewcommand/\defare expanded before parsing. Commonmathtools/physicsmath (\abs,\norm,\dv,\ket, …) andsiunitx(\SI{9.81}{\meter\per\second\squared}→9.81 m/s²,\num,\ang) work as built-ins when not user-defined. Acronyms (glossaries):\newacronym+\gls/\acrshort/\acrlong/\acrfullexpand with the first-use "long (short)" rule. -
Footnotes:
\footnote→ native Word footnotes (footnotes.xml), not inlined text; footnote bodies keep their formatting and math. -
Inline verbatim & smart refs:
\verb|...|→ literal monospace;\cref/\Cref/\autorefadd cleveref-style type prefixes ("fig. N" / "Figure N"). -
Theorem environments:
theorem/lemma/proof/definition/… render with a bold numbered lead (liveSEQper kind), optional[title], and a QED mark for proofs;\refto a theorem shows its number. -
Algorithms:
algorithm+algorithmic/algpseudocode/algorithm2e→ numbered, indented pseudocode with bold keywords, inline OMML math, and a liveSEQ Algorithmcaption. -
Graceful degradation: unknown constructs never abort; they pass through best-effort and are logged to the conversion report (math coverage telemetry included). The math decision-cascade (direct OMML → LaTeX→MathML→OMML secondary path → image fallback
--math-image-fallback→ raw) records which path each equation took. -
Round-trip: the IR is embedded as a JSON manifest custom part, so the exact IR can be recovered from the
.docx(tex2word.roundtrip.recover_ir) and converted back to LaTeX (tex2word to-latex out.docx); the corpuslatex→docx→latexkeeps the same block structure. Reconcile (on by default) merges Word edits against the manifest, and Word Track Changes are accepted on read (insertions kept, deletions dropped). -
Reports & validation:
--report report.json|report.htmlwrites a coverage report;tex2word.validate.validate_docxstructurally validates output;tex2word benchmark <dir>reports a quantitative baseline (math-OMML %, validity, warnings, 0-abort) across a paper set (CI-gated on the corpus + UATs: currently 100% native-OMML math, 100% valid, 0 aborts). -
Reproducible: set
SOURCE_DATE_EPOCHand the same input yields byte-identical output (the.docxZIP is built deterministically). -
Live citations (opt-in
--citations zotero): emitADDIN ZOTERO_ITEM CSL_CITATION/CSL_BIBLIOGRAPHYfields so citations are editable by Zotero/Mendeley in Word (default is static formatted text). -
Real CSL styles (opt-in
--csl style.csl, needstex2word[csl]): a genuineciteproc-pyengine formats in-text citations and the reference list against any.cslstyle, with proper sorting; the built-in heuristic is the fallback.\nocite{key}/\nocite{*}are honoured. -
Front-end choice: the default
purefront-end (pylatexenc-based) is the validated engine — it converts the corpus and three real-paper UATs at 100% native-OMML math, 100% valid output, 0 aborts.--frontend latexmlis experimental: it shells out to a reallatexmlinstall for genuine TeX expansion, but is not yet proven end-to-end (it silently falls back topureon any failure; see the advisoryreal-toolCI lane).
Architecture
LaTeX ─▶ front-end (preprocess, macro-expand, pylatexenc walk) ─▶ IR
─▶ transforms (cross-reference resolution) ─▶ IR
─▶ back-end (raw OOXML via lxml: document/styles/numbering) ─▶ .docx
The IR (src/tex2word/ir.py) is the format-neutral seam, so a LaTeXML front-end can replace the static parser post-V1 without touching the back-end.
Development
uv run pytest # tests
uv run ruff check src tests
uv run mypy src
uv run pre-commit install # optional: run the lint/type gate on every commit
Releases: pushing a vX.Y.Z tag builds the wheel/sdist and publishes to PyPI
(via the Release workflow, using PyPI Trusted Publishing). Notable changes are
recorded in CHANGELOG.md.
License
MIT — see LICENSE.
Author
Yifan Yang yfyang.86@hotmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tex2word-0.8.2.tar.gz.
File metadata
- Download URL: tex2word-0.8.2.tar.gz
- Upload date:
- Size: 121.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16606c9b3043a38a0c4d8ce4afdfe114fd3cd9b269c1c99ec7e50ac5e5e8c286
|
|
| MD5 |
55fcadfd20bc557979e19bfd0245d2d5
|
|
| BLAKE2b-256 |
170ff4a69408cedf6f9eaa2f47c3017c6b24597f610f453a38e75d7f178fc2a3
|
Provenance
The following attestation bundles were made for tex2word-0.8.2.tar.gz:
Publisher:
release.yml on yfyang86/tex2word
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tex2word-0.8.2.tar.gz -
Subject digest:
16606c9b3043a38a0c4d8ce4afdfe114fd3cd9b269c1c99ec7e50ac5e5e8c286 - Sigstore transparency entry: 1731852226
- Sigstore integration time:
-
Permalink:
yfyang86/tex2word@61415269c9dcc25fd2f3c31fd6fce733f5b7a474 -
Branch / Tag:
refs/tags/v0.8.2 - Owner: https://github.com/yfyang86
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@61415269c9dcc25fd2f3c31fd6fce733f5b7a474 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tex2word-0.8.2-py3-none-any.whl.
File metadata
- Download URL: tex2word-0.8.2-py3-none-any.whl
- Upload date:
- Size: 142.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dca2423caf44dcf2e6fcc5b0064f853215f0ab31c74c9b499b5dc5f20dc264b
|
|
| MD5 |
c6c25b1e1c8e4661b5b3d65b0e370703
|
|
| BLAKE2b-256 |
8b8f5f271cffac48382fd9fcdd261af09c8e89b238278eefb1592bc2d94907d6
|
Provenance
The following attestation bundles were made for tex2word-0.8.2-py3-none-any.whl:
Publisher:
release.yml on yfyang86/tex2word
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tex2word-0.8.2-py3-none-any.whl -
Subject digest:
6dca2423caf44dcf2e6fcc5b0064f853215f0ab31c74c9b499b5dc5f20dc264b - Sigstore transparency entry: 1731852328
- Sigstore integration time:
-
Permalink:
yfyang86/tex2word@61415269c9dcc25fd2f3c31fd6fce733f5b7a474 -
Branch / Tag:
refs/tags/v0.8.2 - Owner: https://github.com/yfyang86
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@61415269c9dcc25fd2f3c31fd6fce733f5b7a474 -
Trigger Event:
push
-
Statement type: