Skip to main content

Extract text from .docx and .odt files to strings in pure python.

Project description

docx2txt2

Extract text from .docx and .odt files to strings in pure python.

codecov GitHub Actions Workflow Status GitHub file size in bytes PyPI - License PyPI - Version Python Version from PEP 621 TOML

My personal replacement for docx2txt.

It's intended to be very simple and provide some utilities to match the functionality of the original lib.

Usage

Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etc)

pip install docx2txt2

Use with any PathLike object, like a filepath or IO stream.

import io
from pathlib import Path
import docx2txt2

# path
text = docx2txt2.extract_text("path/to/my.docx")
image_paths = docx2txt2.extract_images("path/to/my.docx", "path/to/images/out")

# actual Paths
docx_path = Path(__file__).parent / "my.docx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)

text2 = docx2txt2.extract_text(docx_path)
image_paths2 = docx2txt2.extract_images(docx_path, image_out)

# bytestreams
docx_bytes = b"..."
bytes_io = io.BytesIO(docx_bytes)
text3 = docx2txt2.extract_text(bytes_io)
image_paths3 = docx2txt2.extract_images(bytes_io, "path/to/images/out")

Compatability & Motivation

docx2txt2 provides a superset of all data returned by docx2txt with some caveats (below), so the below is true:

import docx2txt

import docx2txt2

orig_content = docx2txt.process("my/file.docx").split()
new_content = docx2txt2.process("my/file.docx").split()

assert all(orig in new_content for orig in orig_content)

This is a test in test_extract_data.test_docx2txt_compatability

Compatability & Caveats

  • Doesn't preserve whitespace or styling like the original; new pages, tabs and the like are now just spaces.
  • headers and footers contain "PAGE" where there would be a page number, unlike the original which removed them.

Motivations for rewrite:

  • Speed, I have lots of word docs to process and I saw some efficiency gains over the original lib.
  • Formatting, I didn't want to do whitespace removal for every run; this preformats output to only include spaces.

Benchmarks

Basic benchmarking using pytest-benchmark with a basic test document on my M1 macbook and on GithubActions. From these tests it appears this lib is a sneak under ~2x faster on average.

Macbook:

----------------------------------------------------------------------------------- benchmark: 2 tests ----------------------------------------------------------------------------------
Name (time in ms)               Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2     1.1498 (1.0)      6.2305 (1.0)      1.1949 (1.0)      0.3096 (1.0)      1.1685 (1.0)      0.0142 (1.0)          3;74  836.9124 (1.0)         724           1
test_benchmark_docx2txt      2.1684 (1.89)     7.5298 (1.21)     2.2469 (1.88)     0.3941 (1.27)     2.2044 (1.89)     0.0231 (1.62)         2;41  445.0671 (0.53)        365           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

GitHub Actions, python 3.12:

----------------------------------------------------------------------------------- benchmark: 2 tests -----------------------------------------------------------------------------------
Name (time in ms)               Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2     1.5368 (1.0)       8.6408 (1.0)      1.6104 (1.0)      0.4961 (1.0)      1.5697 (1.0)      0.0349 (1.0)          3;11  620.9509 (1.0)         565           1
test_benchmark_docx2txt      3.0235 (1.97)     10.1797 (1.18)     3.1365 (1.95)     0.5956 (1.20)     3.0822 (1.96)     0.0356 (1.02)         2;10  318.8220 (0.51)        279           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Disclaimer: More thorough benchmarking could be conducted. This is a faster lib in general but I haven't tested edge cases.

Also see:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2txt2-1.0.4.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

docx2txt2-1.0.4-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file docx2txt2-1.0.4.tar.gz.

File metadata

  • Download URL: docx2txt2-1.0.4.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for docx2txt2-1.0.4.tar.gz
Algorithm Hash digest
SHA256 62e3c508726f668a21bc2cfa4c376714c9074edced492a9b9760ed0dafb20db5
MD5 d4c00f9cc13e12aed8d5063f1199a443
BLAKE2b-256 d1fcc07c6013a66b74f428a1ec841d8898f10fd4b387f98bb0ae98789e908edd

See more details on using hashes here.

File details

Details for the file docx2txt2-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: docx2txt2-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for docx2txt2-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 59c3ea13eaf15613224b7912c241fca455ba16abe93e493e6c9e05c8e59d17fa
MD5 1ed5a8f9b57278c30e8891a3f23a473f
BLAKE2b-256 eabd19e106b5e5225d9214445fc0dbdf2600279f359c9b5fb5aca54c267cfba7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page