Skip to main content

Extract text from .pptx and .odp files to strings in pure python.

Project description

pptx2txt2

Extract text from .pptx and .odp files to strings in pure python.

codecov GitHub Actions Workflow Status GitHub file size in bytes PyPI - License PyPI - Version Python Version from PEP 621 TOML

My personal replacement for pptx2txt.

It's intended to be very simple and provide some utilities to extract content similar to the original lib.

Usage

Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etx)

pip install pptx2txt2

Use with any PathLike object, like a filepath or IO stream.

There are 3 methods

  • extract_text_per_slide returns a dict[int, str] of per slide content & notes
  • extract_text utility to join all slide content
  • extract_images copy images over to another dir
import io
from pathlib import Path
import pptx2txt2

# path
text = pptx2txt2.extract_text("path/to/my.pptx")
text_per_slide = pptx2txt2.extract_text_per_slide("path/to/my.pptx")
image_paths = pptx2txt2.extract_images("path/to/my.pptx", "path/to/images/out")

# actual Paths
pptx_path = Path(__file__).parent / "my.pptx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)

text2 = pptx2txt2.extract_text(pptx_path)
text_per_slide2 = pptx2txt2.extract_text_per_slide(pptx_path)
image_paths2 = pptx2txt2.extract_images(pptx_path, image_out)

# bytestreams
pptx_bytes = b"..."
bytes_io = io.BytesIO(pptx_bytes)
text3 = pptx2txt2.extract_text(bytes_io)
text_per_slide3 = pptx2txt2.extract_text_per_slide(bytes_io)
image_paths3 = pptx2txt2.extract_images(bytes_io, "path/to/images/out")

Considerations

  • Doesn't preserve whitespace or styling like the original; new slides, tabs and the like are now just spaces.
  • headers and footers contain "<#>" of "" where there would be a number, unlike the original which removed them
  • pptx files have a UUID in text where images were.

Benchmarks

Basic benchmarking using pytest-benchmark with a basic test document on my M1 macbook and on GithubActions.

Macbook:

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)               Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2     2.4470  7.1815  2.5762  0.4344  2.4987  0.1050       2;7  388.1666     122           1
-------------------------------------------------------------------------------------------------------------------

GitHub Actions, python 3.12:

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)               Min      Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2     4.0548  11.4523  4.2387  0.8312  4.1343  0.0484      3;11  235.9197     217           1
--------------------------------------------------------------------------------------------------------------------

Also See

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pptx2txt2-1.1.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distribution

pptx2txt2-1.1.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file pptx2txt2-1.1.0.tar.gz.

File metadata

  • Download URL: pptx2txt2-1.1.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pptx2txt2-1.1.0.tar.gz
Algorithm Hash digest
SHA256 fa71cf0799c60266c3ffa415bfb5b8c61364cb4814a92a4ce7e96901f7aecfcc
MD5 00b4b7d20ef864a83548d32d68de25c7
BLAKE2b-256 2012fe9375794a8287fe7d477711fc1b4cea5941f869ed077f6809fccb36a8e5

See more details on using hashes here.

File details

Details for the file pptx2txt2-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pptx2txt2-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for pptx2txt2-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0041b57f814039399b317367b5e0e54ca7d3ca5800b170fb906788116f834141
MD5 423f9fea72daa2eabd37dfa9ea9389ee
BLAKE2b-256 090e8d810dd84b90381ee4f29110a742e523609a35d8bb255c163d14322349b0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page