Extract text from .docx and .odt files to strings in pure python.
Project description
docx2txt2
Extract text from .docx and .odt files to strings in pure python.
My personal replacement for docx2txt.
It's intended to be very simple and provide some utilities to match the functionality of the original lib.
Usage
Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etc)
pip install docx2txt2
Use with any PathLike
object, like a filepath or IO stream.
import io
from pathlib import Path
import docx2txt2
# path
text = docx2txt2.extract_text("path/to/my.docx")
image_paths = docx2txt2.extract_images("path/to/my.docx", "path/to/images/out")
# actual Paths
docx_path = Path(__file__).parent / "my.docx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)
text2 = docx2txt2.extract_text(docx_path)
image_paths2 = docx2txt2.extract_images(docx_path, image_out)
# bytestreams
docx_bytes = b"..."
bytes_io = io.BytesIO(docx_bytes)
text3 = docx2txt2.extract_text(bytes_io)
image_paths3 = docx2txt2.extract_images(bytes_io, "path/to/images/out")
Compatability & Motivation
docx2txt2 provides a superset of all data returned by docx2txt with some caveats (below), so the below is true:
import docx2txt
import docx2txt2
orig_content = docx2txt.process("my/file.docx").split()
new_content = docx2txt2.process("my/file.docx").split()
assert all(orig in new_content for orig in orig_content)
This is a test in test_extract_data.test_docx2txt_compatability
Compatability & Caveats
- Doesn't preserve whitespace or styling like the original; new pages, tabs and the like are now just spaces.
- headers and footers contain "PAGE" where there would be a page number, unlike the original which removed them.
Motivations for rewrite:
- Speed, I have lots of word docs to process and I saw some efficiency gains over the original lib.
- Formatting, I didn't want to do whitespace removal for every run; this preformats output to only include spaces.
Benchmarks
Basic benchmarking using pytest-benchmark with a basic test document on my M1 macbook and on GithubActions. From these tests it appears this lib is a sneak under ~2x faster on average.
Macbook:
----------------------------------------------------------------------------------- benchmark: 2 tests ----------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2 1.1498 (1.0) 6.2305 (1.0) 1.1949 (1.0) 0.3096 (1.0) 1.1685 (1.0) 0.0142 (1.0) 3;74 836.9124 (1.0) 724 1
test_benchmark_docx2txt 2.1684 (1.89) 7.5298 (1.21) 2.2469 (1.88) 0.3941 (1.27) 2.2044 (1.89) 0.0231 (1.62) 2;41 445.0671 (0.53) 365 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GitHub Actions, python 3.12:
----------------------------------------------------------------------------------- benchmark: 2 tests -----------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2 1.5368 (1.0) 8.6408 (1.0) 1.6104 (1.0) 0.4961 (1.0) 1.5697 (1.0) 0.0349 (1.0) 3;11 620.9509 (1.0) 565 1
test_benchmark_docx2txt 3.0235 (1.97) 10.1797 (1.18) 3.1365 (1.95) 0.5956 (1.20) 3.0822 (1.96) 0.0356 (1.02) 2;10 318.8220 (0.51) 279 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Disclaimer: More thorough benchmarking could be conducted. This is a faster lib in general but I haven't tested edge cases.
Also see:
- pptx2txt2 for pptx/odp conversion
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file docx2txt2-1.0.4.tar.gz
.
File metadata
- Download URL: docx2txt2-1.0.4.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62e3c508726f668a21bc2cfa4c376714c9074edced492a9b9760ed0dafb20db5 |
|
MD5 | d4c00f9cc13e12aed8d5063f1199a443 |
|
BLAKE2b-256 | d1fcc07c6013a66b74f428a1ec841d8898f10fd4b387f98bb0ae98789e908edd |
File details
Details for the file docx2txt2-1.0.4-py3-none-any.whl
.
File metadata
- Download URL: docx2txt2-1.0.4-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59c3ea13eaf15613224b7912c241fca455ba16abe93e493e6c9e05c8e59d17fa |
|
MD5 | 1ed5a8f9b57278c30e8891a3f23a473f |
|
BLAKE2b-256 | eabd19e106b5e5225d9214445fc0dbdf2600279f359c9b5fb5aca54c267cfba7 |