Extract text from .pptx and .odp files to strings in pure python.
Project description
pptx2txt2
Extract text from .pptx and .odp files to strings in pure python.
My personal replacement for pptx2txt.
It's intended to be very simple and provide some utilities to extract content similar to the original lib.
Usage
Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etx)
pip install pptx2txt2
Use with any PathLike
object, like a filepath or IO stream.
There are 3 methods
extract_text_per_slide
returns adict[int, str]
of per slide content & notesextract_text
utility to join all slide contentextract_images
copy images over to another dir
import io
from pathlib import Path
import pptx2txt2
# path
text = pptx2txt2.extract_text("path/to/my.pptx")
text_per_slide = pptx2txt2.extract_text_per_slide("path/to/my.pptx")
image_paths = pptx2txt2.extract_images("path/to/my.pptx", "path/to/images/out")
# actual Paths
pptx_path = Path(__file__).parent / "my.pptx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)
text2 = pptx2txt2.extract_text(pptx_path)
text_per_slide2 = pptx2txt2.extract_text_per_slide(pptx_path)
image_paths2 = pptx2txt2.extract_images(pptx_path, image_out)
# bytestreams
pptx_bytes = b"..."
bytes_io = io.BytesIO(pptx_bytes)
text3 = pptx2txt2.extract_text(bytes_io)
text_per_slide3 = pptx2txt2.extract_text_per_slide(bytes_io)
image_paths3 = pptx2txt2.extract_images(bytes_io, "path/to/images/out")
Considerations
- Doesn't preserve whitespace or styling like the original; new slides, tabs and the like are now just spaces.
- headers and footers contain "<#>" of "" where there would be a number, unlike the original which removed them
- pptx files have a UUID in text where images were.
Benchmarks
Basic benchmarking using pytest-benchmark with a basic test document on my M1 macbook and on GithubActions.
Macbook:
------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2 2.4470 7.1815 2.5762 0.4344 2.4987 0.1050 2;7 388.1666 122 1
-------------------------------------------------------------------------------------------------------------------
GitHub Actions, python 3.12:
------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2 4.0548 11.4523 4.2387 0.8312 4.1343 0.0484 3;11 235.9197 217 1
--------------------------------------------------------------------------------------------------------------------
Also See
- docx2txt2 for docx conversion
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pptx2txt2-1.1.0.tar.gz
.
File metadata
- Download URL: pptx2txt2-1.1.0.tar.gz
- Upload date:
- Size: 6.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa71cf0799c60266c3ffa415bfb5b8c61364cb4814a92a4ce7e96901f7aecfcc |
|
MD5 | 00b4b7d20ef864a83548d32d68de25c7 |
|
BLAKE2b-256 | 2012fe9375794a8287fe7d477711fc1b4cea5941f869ed077f6809fccb36a8e5 |
File details
Details for the file pptx2txt2-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: pptx2txt2-1.1.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0041b57f814039399b317367b5e0e54ca7d3ca5800b170fb906788116f834141 |
|
MD5 | 423f9fea72daa2eabd37dfa9ea9389ee |
|
BLAKE2b-256 | 090e8d810dd84b90381ee4f29110a742e523609a35d8bb255c163d14322349b0 |