A robust, extensible Python package for synchronous and asynchronous text extraction from PDF, DOCX, DOC, TXT, ZIP, MD, RTF, HTML, and more.
Project description
TextXtract
A robust, extensible Python package for synchronous and asynchronous text extraction from PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML, and more.
Features
- Synchronous and asynchronous extraction APIs
- Modular file type handlers (PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML, and more.)
- Abstract base classes for extensibility
- Custom exception handling and logging
- Configurable encoding, logging, and timeouts
- Easy to add new file type handlers
- Comprehensive unit tests with pytest
Installation
pip install .
Usage Example
from textxtract.sync.extractor import SyncTextExtractor
from textxtract.aio.extractor import AsyncTextExtractor
# Synchronous extraction
extractor = SyncTextExtractor()
text = extractor.extract(file_bytes, filename)
# Asynchronous extraction
import asyncio
async_extractor = AsyncTextExtractor()
text = asyncio.run(async_extractor.extract_async(file_bytes, filename))
API Reference
See ARCHITECTURE_PLAN.md for detailed architecture and module layout.
Running Tests
pytest
Contributing
- Fork the repository.
- Create a new branch.
- Add your feature or fix.
- Write tests.
- Submit a pull request.
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textxtract-0.1.0.tar.gz.
File metadata
- Download URL: textxtract-0.1.0.tar.gz
- Upload date:
- Size: 18.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dadd347aed8234345714fd40031cc0b2eedcdc65f5a42ffd141e8c2ba543f159
|
|
| MD5 |
4e529fea7ff10bd900274d5c8332b314
|
|
| BLAKE2b-256 |
fd9356e84927d12dea17f78882ea0892018ba05aebff340ab6aac688d8f375be
|
File details
Details for the file textxtract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: textxtract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4e0068495994e7a275cad3b20d9dd65904edf1807f68018d9ccf43c34fa3e98
|
|
| MD5 |
54c92698775622205fe99cab69af82e8
|
|
| BLAKE2b-256 |
52e9a726775ea566d5a0edc164e6fe5285d75e78f87ec7e63df830124865dfa3
|