Skip to main content

Extract text and mathematical equations from .docx files

Project description

docxlatex is a lightweight Python package for extracting text and mathematical equations from .docx files.

It does not convert the entire .docx file to a LaTeX source file, only the inserted equations.

Installation

Install docxlatex using pip:

$ pip install docxlatex

Usage

API

Usage is straightforward. For almost all cases you will only need the Document class.

from docxlatex import Document

Create a Document object by giving it the path to a .docx file, a path-like object, or a file-like object, and call the get_text() method:

doc = Document("path/to/your/document.docx")
text = doc.get_text()
equations = doc.equations # A list of strings containing the LaTeX code of the equations
print(equations)

CLI

docxlatex also provides a CLI for quick extraction of text and equations from .docx files. You can use it as follows:

$ docxlatex path/to/your/document.docx

It also provides some options to customize the output:

$ docxlatex --help
usage: docxlatex [-h] [--op OP] [--xml] [-l] ip

positional arguments:
  ip          An absolute or relative path to the input .docx file

options:
  -h, --help  show this help message and exit
  --op OP     An absolute or relative path to the output file (defaults to stdout)
  --xml       Dump the document's XML instead of converting to text
  -l          Specifies that the document has been converted to "Linear" format

Examples

API

Here is a simple example of how an equation is extracted from a .docx file containing the Fourier series equation:

Fourier series equation in a .docx file

Using the API as shown above:

from docxlatex import Document
doc = Document("./fourier_series.docx")
text = doc.get_text()
print(text)

This will output:

$ f\left( x \right)={a}_{0}+\sum_{n=1}^{}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $

CLI

Here is how you can use the CLI on the same file:

$ docxlatex ./fourier_series.docx

This will output the same LaTeX code to the console:

$ f\left( x \right)={a}_{0}+\sum_{n=1}^{}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $

Issues

docxlatex is not perfect, and you may encounter issues with certain .docx files, especially those with complex formatting or non-standard elements, or on older versions of Word.

If you find a bug or have a feature request, please open an issue on the GitHub repository. All bug reports and feature requests are welcome and greatly appreciated!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docxlatex-1.2.2.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docxlatex-1.2.2-py3-none-any.whl (6.9 MB view details)

Uploaded Python 3

File details

Details for the file docxlatex-1.2.2.tar.gz.

File metadata

  • Download URL: docxlatex-1.2.2.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for docxlatex-1.2.2.tar.gz
Algorithm Hash digest
SHA256 06e837e12ab23ef353f345a454d12b08e701c39f2e76c05affe1daa468547656
MD5 cf29569ef1ddfdf69c4496cc0325dcd9
BLAKE2b-256 1aae9bf5d08e362bd6c41045211fc3ed7e1932d63f6df9ce8dc4fb412a727310

See more details on using hashes here.

File details

Details for the file docxlatex-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: docxlatex-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for docxlatex-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b6fe89db0783198a9e0c325a2ed6e121b6bfff77a621e5479eb39c71800e6f3a
MD5 2249b063ccd131232c62dfc4c35f7a19
BLAKE2b-256 520da4de26c71d00df40c18ec979fee8610c37efd1a9e146c9a8c3abb664ea5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page