Extract text and mathematical equations from .docx files
Project description
docxlatex is a lightweight Python package for extracting text and mathematical equations from .docx files.
It does not convert the entire .docx file to a LaTeX source file, only the inserted equations.
Installation
Install docxlatex using pip:
$ pip install docxlatex
Usage
API
Usage is straightforward. For almost all cases you will only need the Document class.
from docxlatex import Document
Create a Document object by giving it the path to a .docx file, a path-like object, or a file-like object, and call the get_text() method:
doc = Document("path/to/your/document.docx")
text = doc.get_text()
equations = doc.equations # A list of strings containing the LaTeX code of the equations
print(equations)
CLI
docxlatex also provides a CLI for quick extraction of text and equations from .docx files. You can use it as follows:
$ docxlatex path/to/your/document.docx
It also provides some options to customize the output:
$ docxlatex --help
usage: docxlatex [-h] [--op OP] [--xml] [-l] ip
positional arguments:
ip An absolute or relative path to the input .docx file
options:
-h, --help show this help message and exit
--op OP An absolute or relative path to the output file (defaults to stdout)
--xml Dump the document's XML instead of converting to text
-l Specifies that the document has been converted to "Linear" format
Examples
API
Here is a simple example of how an equation is extracted from a .docx file containing the Fourier series equation:
Using the API as shown above:
from docxlatex import Document
doc = Document("./fourier_series.docx")
text = doc.get_text()
print(text)
This will output:
$ f\left( x \right)={a}_{0}+\sum_{n=1}^{∞}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $
CLI
Here is how you can use the CLI on the same file:
$ docxlatex ./fourier_series.docx
This will output the same LaTeX code to the console:
$ f\left( x \right)={a}_{0}+\sum_{n=1}^{∞}{\left( {a}_{n}\cos_{}^{}{\frac{nπx}{L}}+{b}_{n}\sin_{}^{}{\frac{nπx}{L}} \right)} $
Issues
docxlatex is not perfect, and you may encounter issues with certain .docx files, especially those with complex formatting or non-standard elements, or on older versions of Word.
If you find a bug or have a feature request, please open an issue on the GitHub repository. All bug reports and feature requests are welcome and greatly appreciated!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docxlatex-1.2.3.tar.gz.
File metadata
- Download URL: docxlatex-1.2.3.tar.gz
- Upload date:
- Size: 5.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ab8e724f3a39c8597b07a20b2a2329886c2a7579eeacd81d578702be51f8e12
|
|
| MD5 |
6a8b9144eff47845d21a749ccfbcb3ac
|
|
| BLAKE2b-256 |
2633df89a39a0397a7d6e27ed5c1a09e121eafbedfb827a6f3bc7984c2453676
|
File details
Details for the file docxlatex-1.2.3-py3-none-any.whl.
File metadata
- Download URL: docxlatex-1.2.3-py3-none-any.whl
- Upload date:
- Size: 6.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ee464749941d8325312ae0b7eba18e25cc71aedb8bc8e96ea182f4fe632c910
|
|
| MD5 |
4dbde758b4ce910d09552603f1110377
|
|
| BLAKE2b-256 |
0179bf20379c9c26a0482ae8edb185d7d7ee25da37bc6f92bc51bfee6c377899
|