Skip to main content

Extract text and mathematical equations from .docx files

Project description

docxlatex is a lightweight Python package for extracting text and mathematical equations from .docx files.

It does not convert the entire .docx file to a LaTeX source file, only the inserted equations.

Installation

Install docxlatex using pip:

$ pip install docxlatex

Usage

API

Usage is straightforward. For almost all cases you will only need the Document class.

from docxlatex import Document

Create a Document object by giving it the path to a .docx file, a path-like object, or a file-like object, and call the get_text() method:

doc = Document("path/to/your/document.docx")
text = doc.get_text()
equations = doc.equations # A list of strings containing the LaTeX code of the equations
print(equations)

CLI

docxlatex also provides a CLI for quick extraction of text and equations from .docx files. You can use it as follows:

$ docxlatex path/to/your/document.docx

It also provides some options to customize the output:

$ docxlatex --help
usage: docxlatex [-h] [--op OP] [--xml] [-l] ip

positional arguments:
  ip          An absolute or relative path to the input .docx file

options:
  -h, --help  show this help message and exit
  --op OP     An absolute or relative path to the output file (defaults to stdout)
  --xml       Dump the document's XML instead of converting to text
  -l          Specifies that the document has been converted to "Linear" format

Examples

API

Here is a simple example of how an equation is extracted from a .docx file containing the Fourier series equation:

Fourier series equation in a .docx file

Using the API as shown above:

from docxlatex import Document
doc = Document("./fourier_series.docx")
text = doc.get_text()
print(text)

This will output:

$ f\left( x \right)={a}_{0}+\sum_{n=1}^{}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $

CLI

Here is how you can use the CLI on the same file:

$ docxlatex ./fourier_series.docx

This will output the same LaTeX code to the console:

$ f\left( x \right)={a}_{0}+\sum_{n=1}^{}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $

Issues

docxlatex is not perfect, and you may encounter issues with certain .docx files, especially those with complex formatting or non-standard elements, or on older versions of Word.

If you find a bug or have a feature request, please open an issue on the GitHub repository. All bug reports and feature requests are welcome and greatly appreciated!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docxlatex-1.2.3.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docxlatex-1.2.3-py3-none-any.whl (6.9 MB view details)

Uploaded Python 3

File details

Details for the file docxlatex-1.2.3.tar.gz.

File metadata

  • Download URL: docxlatex-1.2.3.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for docxlatex-1.2.3.tar.gz
Algorithm Hash digest
SHA256 3ab8e724f3a39c8597b07a20b2a2329886c2a7579eeacd81d578702be51f8e12
MD5 6a8b9144eff47845d21a749ccfbcb3ac
BLAKE2b-256 2633df89a39a0397a7d6e27ed5c1a09e121eafbedfb827a6f3bc7984c2453676

See more details on using hashes here.

File details

Details for the file docxlatex-1.2.3-py3-none-any.whl.

File metadata

  • Download URL: docxlatex-1.2.3-py3-none-any.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for docxlatex-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6ee464749941d8325312ae0b7eba18e25cc71aedb8bc8e96ea182f4fe632c910
MD5 4dbde758b4ce910d09552603f1110377
BLAKE2b-256 0179bf20379c9c26a0482ae8edb185d7d7ee25da37bc6f92bc51bfee6c377899

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page