A Python library for extracting text from different types of files (PDF, DOCX, PPTX, XLSX, ODT, ecc.).
Project description
PyxTxt
PyxTxt is a simple and powerful Python library to extract text from various file formats.
It supports PDF, DOCX, XLSX, PPTX, ODT, HTML, XML, TXT, legacy XLS files, and more.
✨ Features
- Extracts text from both file paths and in-memory buffers (
io.BytesIO). - Supports multiple formats: PDF, DOCX, PPTX, XLSX, ODT, HTML, XML, TXT, legacy Office files (.xls, .doc, .ppt).
- Automatically detects MIME type using
python-magic. - Compatible with modern and legacy formats.
- Can handle streamed content without saving to disk (with some limitations).
📦 Installation
pip install pyxtxt
⚠️ Note: You must have libmagic installed on your system (required by python-magic).
On Ubuntu/Debian:
sudo apt install libmagic1
On Mac (Homebrew):
brew install libmagic
On Windows:
Use python-magic-bin instead of python-magic for easier installation.
🛠️ Dependencies
-
PyMuPDF (fitz)
-
beautifulsoup4
-
python-docx
-
python-pptx
-
odfpy
-
openpyxl
-
lxml
-
xlrd (<2.0.0)
-
python-magic
Dependencies are automatically installed from pyproject.toml.
📚 Usage Example
Extract text from a file path:
from pyxtxt import xtxt
text = xtxt("document.pdf")
print(text)
Extract text from a file-like buffer:
import io
with open("document.docx", "rb") as f:
buffer = io.BytesIO(f.read())
from pyxtxt import xtxt
text = xtxt(buffer)
print(text)
##⚠️ Known Limitations When passing a raw stream (io.BytesIO) without a filename, legacy files (.doc, .xls, .ppt) may not be correctly detected.
This is a limitation of libmagic, not of pyxtxt.
If available, passing the original filename along with the buffer is highly recommended.
🔒 License
Distributed under the MIT License.
The software is provided "as is" without any warranty of any kind.
Pull requests, issues, and feedback are warmly welcome! 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyxtxt-0.1.9.tar.gz.
File metadata
- Download URL: pyxtxt-0.1.9.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ccaf25f6b779f6452e11e6ebb1da8f536381dd75ead9da7e55e388f44e04976
|
|
| MD5 |
bcb66f20dcbce2cf96ded95c98d6a0e3
|
|
| BLAKE2b-256 |
1cfebdd014f7a0d4bf1bd78c22c85f9064aadd3d83101c992807d58db03fe403
|
File details
Details for the file pyxtxt-0.1.9-py3-none-any.whl.
File metadata
- Download URL: pyxtxt-0.1.9-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3d7f430a6827442db5d07fec644851a25370b012aa8af42a474cb89dea8d73d
|
|
| MD5 |
7a28b1b6c2c4744dae6f4500a6652cda
|
|
| BLAKE2b-256 |
fde28c9c12d02b164bbaa2476d916f0529e18d202d85741f75babffbafe8ad47
|