Skip to main content

Extract text and mathematical equations from .docx files

Project description

docxlatex is a lightweight Python package for extracting text and mathematical equations from .docx files.

It does not convert the entire .docx file to a LaTeX source file, only the inserted equations.

Installation

Install docxlatex using pip:

$ pip install docxlatex

Usage

API

Usage is straightforward. For almost all cases you will only need the Document class.

from docxlatex import Document

Create a Document object by giving it the path to a .docx file, a path-like object, or a file-like object, and call the get_text() method:

doc = Document("path/to/your/document.docx")
text = doc.get_text()
equations = doc.equations # A list of strings containing the LaTeX code of the equations
print(equations)

CLI

docxlatex also provides a CLI for quick extraction of text and equations from .docx files. You can use it as follows:

$ docxlatex path/to/your/document.docx

It also provides some options to customize the output:

$ docxlatex --help
usage: docxlatex [-h] [--op OP] [--xml] [-l] ip

positional arguments:
  ip          An absolute or relative path to the input .docx file

options:
  -h, --help  show this help message and exit
  --op OP     An absolute or relative path to the output file (defaults to stdout)
  --xml       Dump the document's XML instead of converting to text
  -l          Specifies that the document has been converted to "Linear" format

Examples

API

Here is a simple example of how an equation is extracted from a .docx file containing the Fourier series equation:

Fourier series equation in a .docx file

Using the API as shown above:

from docxlatex import Document
doc = Document("./fourier_series.docx")
text = doc.get_text()
print(text)

This will output:

$ f\left( x \right)={a}_{0}+\sum_{n=1}^{}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $

CLI

Here is how you can use the CLI on the same file:

$ docxlatex ./fourier_series.docx

This will output the same LaTeX code to the console:

$ f\left( x \right)={a}_{0}+\sum_{n=1}^{}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $

Issues

docxlatex is not perfect, and you may encounter issues with certain .docx files, especially those with complex formatting or non-standard elements, or on older versions of Word.

If you find a bug or have a feature request, please open an issue on the GitHub repository. All bug reports and feature requests are welcome and greatly appreciated!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docxlatex-1.2.1.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docxlatex-1.2.1-py3-none-any.whl (6.9 MB view details)

Uploaded Python 3

File details

Details for the file docxlatex-1.2.1.tar.gz.

File metadata

  • Download URL: docxlatex-1.2.1.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for docxlatex-1.2.1.tar.gz
Algorithm Hash digest
SHA256 20c54d89da2e271190e6e7da1e28ac10007e47c87da52f36df34d2ee66309848
MD5 94c69e099855e92be48bf8926fd38615
BLAKE2b-256 06aedc066121ead128d13b35383ccdd60661bb4c1a7d0cb2a11299111580b119

See more details on using hashes here.

File details

Details for the file docxlatex-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: docxlatex-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for docxlatex-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a35af629568231541484513020b8ac98a8a1def7e67af9b90ec491ed3c474e3a
MD5 f51397755a353fd3bb42439e46e10b6a
BLAKE2b-256 653ca7fb62bbcef64c95832bb2ae68010b6d1f037f6f9a971b93c14c8dd1c103

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page