A Python library for extracting text from different types of files (PDF, DOCX, PPTX, XLSX, ODT, ecc.).
Project description
PyxTxt
PyxTxt is a simple and powerful Python library to extract text from various file formats.
It supports PDF, DOCX, XLSX, PPTX, ODT, HTML, XML, TXT, legacy Office files, and more.
NEW in v0.2.1+: Enhanced support for web content, byte streams, and requests integration!
✨ Features
- Multiple input types: File paths,
io.BytesIObuffers, rawbytesobjects, andrequests.Responseobjects - Wide format support: PDF, DOCX, PPTX, XLSX, ODT, HTML, XML, TXT, legacy Office files (.xls, .ppt, .doc)
- Automatic MIME detection: Uses
python-magicfor intelligent file type recognition - Web-ready: Direct support for downloading and extracting text from URLs
- Memory efficient: Process files without saving to disk
- Modern Python: Full type hints and clean API design
📦 Installation
The library is modular so you can install all modules:
pip install pyxtxt[all]
or just the modules you need:
pip install pyxtxt[pdf,odf,docx,presentation,spreadsheet,html]
Because needed libraries are common, installing the html module will also enable SVG and XML support. The architecture is designed to grow with new modules for additional formats.
⚠️ Note: You must have libmagic installed on your system (required by python-magic).
The pyproject.toml file should select the correct version for your system. But if you have any problem you can install it manually.
On Ubuntu/Debian:
sudo apt install libmagic1
On Mac (Homebrew):
brew install libmagic
On Windows:
Use python-magic-bin instead of python-magic for easier installation.
🛠️ Dependencies
-
PyMuPDF (fitz)
-
beautifulsoup4
-
python-docx
-
python-pptx
-
odfpy
-
openpyxl
-
lxml
-
xlrd (<2.0.0)
-
python-magic
Dependencies are automatically installed from pyproject.toml.
📚 Usage Examples
Basic Usage
from pyxtxt import xtxt
# Extract from file path
text = xtxt("document.pdf")
print(text)
# Extract from BytesIO buffer
import io
with open("document.docx", "rb") as f:
buffer = io.BytesIO(f.read())
text = xtxt(buffer)
print(text)
NEW: Web Content Support
import requests
from pyxtxt import xtxt, xtxt_from_url
# Method 1: Direct from bytes
response = requests.get("https://example.com/document.pdf")
text = xtxt(response.content)
# Method 2: Direct from Response object
text = xtxt(response)
# Method 3: URL helper function
text = xtxt_from_url("https://example.com/document.pdf")
Show Available Formats
from pyxtxt import extxt_available_formats
# List supported MIME types
formats = extxt_available_formats()
print(formats)
# Pretty format names
formats = extxt_available_formats(pretty=True)
print(formats)
🌐 Common Web Use Cases
# API responses
api_response = requests.post("https://api.example.com/generate-pdf")
text = xtxt(api_response.content)
# File uploads (Flask/Django)
uploaded_bytes = request.files['document'].read()
text = xtxt(uploaded_bytes)
# Email attachments
attachment_bytes = email_msg.get_payload(decode=True)
text = xtxt(attachment_bytes)
⚠️ Known Limitations
- Legacy file detection: When using raw streams without filenames, legacy files (.doc, .xls, .ppt) may not be correctly detected due to identical file signatures in libmagic
- Filename hints recommended: When available, providing original filenames improves detection accuracy
- MSWrite .doc files: Require
antiwordinstallation:sudo apt-get update && sudo apt-get install antiword
📖 Full Examples
See examples.py for comprehensive usage examples including:
- Local file processing
- Memory buffer handling
- Web content extraction
- Error handling patterns
- All supported formats demonstration
Accessing Examples After Installation
After installing PyxTxt from PyPI, you can access the examples file:
import pkg_resources
# Get path to examples file
examples_path = pkg_resources.resource_filename('pyxtxt', 'examples.py')
print(f"Examples file location: {examples_path}")
# Or read the content directly
examples_content = pkg_resources.resource_string('pyxtxt', 'examples.py').decode('utf-8')
print(examples_content)
🔒 License
Distributed under the MIT License. See LICENSE file for details.
The software is provided "as is" without any warranty of any kind.
🤝 Contributing
Pull requests, issues, and feedback are warmly welcome! 🚀
- Bug reports: Please include file samples and error details
- Feature requests: Describe your use case and expected behavior
- Code contributions: Follow existing patterns and add tests
📊 Changelog
v0.1.24+
- ✅ Added support for
bytesobjects - ✅ Added support for
requests.Responseobjects - ✅ Added
xtxt_from_url()helper function - ✅ Improved type hints and error handling
- ✅ Enhanced web content processing capabilities
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyxtxt-0.2.1.tar.gz.
File metadata
- Download URL: pyxtxt-0.2.1.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19f7b89b3f9b3e92776b728ae9d554edb92c3f73543616e89ee8583f9f042186
|
|
| MD5 |
85123b890ccca218c04a5ce352c93130
|
|
| BLAKE2b-256 |
9aee0b0a4bdc8a3e963033f4b8a0258948034f52f22d5cbc39dae3e0d787c8ea
|
File details
Details for the file pyxtxt-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pyxtxt-0.2.1-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b3790fc82b6b10492747211074c7815f9d35ef24430ac49093f3b18e830711e
|
|
| MD5 |
0cc1133d77ae53a5b4bfd692e4f27cfb
|
|
| BLAKE2b-256 |
06714515a872aadd3b52077446acc3e7717be2daad655019c60c60e30262bbf8
|