A robust, extensible Python package for synchronous and asynchronous text extraction from PDF, DOCX, DOC, TXT, ZIP, MD, RTF, HTML, and more.
Project description
TextXtract
A professional, extensible Python package for extracting text from multiple file formats with both synchronous and asynchronous support.
🚀 Features
- 🔄 Dual Input Support: Works with file paths or raw bytes
- ⚡ Sync & Async APIs: Choose the right approach for your use case
- 📁 Multiple Formats: PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML
- 🎯 Optional Dependencies: Install only what you need
- 🛡️ Robust Error Handling: Comprehensive exception hierarchy
- 📊 Professional Logging: Detailed debug and info level logging
- 🔒 Thread-Safe: Async operations use thread pools for I/O-bound tasks
- 🧹 Context Manager Support: Automatic resource cleanup
Documentation
For complete documentation, including installation instructions, usage examples, and API reference, please visit our documentation site.
📦 Installation
Basic Installation
pip install textxtract
Install with File Type Support
# Install support for specific formats
pip install textxtract[pdf] # PDF support
pip install textxtract[docx] # Word documents
pip install textxtract[all] # All supported formats
# Multiple formats
pip install textxtract[pdf,docx,html]
🏃 Quick Start
Synchronous Extraction
from textxtract import SyncTextExtractor
extractor = SyncTextExtractor()
# Extract from file path
text = extractor.extract("document.pdf")
print(text)
# Extract from bytes (filename required for type detection)
with open("document.pdf", "rb") as f:
file_bytes = f.read()
text = extractor.extract(file_bytes, "document.pdf")
print(text)
Asynchronous Extraction
from textxtract import AsyncTextExtractor
import asyncio
async def extract_text():
extractor = AsyncTextExtractor()
# Extract from file path
text = await extractor.extract("document.pdf")
return text
# Run async extraction
text = asyncio.run(extract_text())
print(text)
Context Manager Usage
# Automatic resource cleanup
with SyncTextExtractor() as extractor:
text = extractor.extract("document.pdf")
# Async context manager
async with AsyncTextExtractor() as extractor:
text = await extractor.extract("document.pdf")
📋 Supported File Types
| Format | Extensions | Dependencies | Installation |
|---|---|---|---|
| Text | .txt, .text |
Built-in | pip install textxtract |
| Markdown | .md |
Optional | pip install textxtract[md] |
.pdf |
Optional | pip install textxtract[pdf] |
|
| Word | .docx |
Optional | pip install textxtract[docx] |
| Word Legacy | .doc |
Optional | pip install textxtract[doc] |
| Rich Text | .rtf |
Optional | pip install textxtract[rtf] |
| HTML | .html, .htm |
Optional | pip install textxtract[html] |
| CSV | .csv |
Built-in | pip install textxtract |
| JSON | .json |
Built-in | pip install textxtract |
| XML | .xml |
Optional | pip install textxtract[xml] |
| ZIP | .zip |
Built-in | pip install textxtract |
🔧 Advanced Usage
Error Handling
from textxtract import SyncTextExtractor
from textxtract.exceptions import (
FileTypeNotSupportedError,
InvalidFileError,
ExtractionError
)
extractor = SyncTextExtractor()
try:
text = extractor.extract("document.pdf")
print(text)
except FileTypeNotSupportedError:
print("❌ File type not supported")
except InvalidFileError:
print("❌ File is invalid or corrupted")
except ExtractionError:
print("❌ Extraction failed")
Custom Configuration
from textxtract import SyncTextExtractor
from textxtract import ExtractorConfig
# Custom configuration
config = ExtractorConfig(
encoding="utf-8",
max_file_size=50 * 1024 * 1024, # 50MB limit
logging_level="DEBUG"
)
extractor = SyncTextExtractor(config)
text = extractor.extract("document.pdf")
Batch Processing
import asyncio
from pathlib import Path
from textxtract import AsyncTextExtractor
async def process_files(file_paths):
async with AsyncTextExtractor() as extractor:
# Process files concurrently
tasks = [extractor.extract(path) for path in file_paths]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# Process multiple files
files = [Path("doc1.pdf"), Path("doc2.docx"), Path("doc3.txt")]
results = asyncio.run(process_files(files))
for file, result in zip(files, results):
if isinstance(result, Exception):
print(f"❌ {file}: {result}")
else:
print(f"✅ {file}: {len(result)} characters extracted")
Logging Configuration
import logging
from textxtract import SyncTextExtractor
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
extractor = SyncTextExtractor()
text = extractor.extract("document.pdf") # Will show detailed logs
🧪 Testing
# Install test dependencies
pip install textxtract[all] pytest pytest-asyncio
# Run tests
pytest
# Run with coverage
pytest --cov=textxtract
📚 Documentation
- 📖 Complete Documentation
- 🚀 Installation Guide
- 📘 Usage Examples
- 🔍 API Reference
- 🧪 Testing Guide
- 🤝 Contributing Guide
🎯 Use Cases
Document Processing
from textxtract import SyncTextExtractor
def process_document(file_path):
extractor = SyncTextExtractor()
text = extractor.extract(file_path)
# Process extracted text
word_count = len(text.split())
return {
"file": file_path,
"text": text,
"word_count": word_count
}
Content Analysis
import asyncio
from textxtract import AsyncTextExtractor
async def analyze_content(files):
async with AsyncTextExtractor() as extractor:
results = []
for file in files:
try:
text = await extractor.extract(file)
# Perform analysis
analysis = {
"file": file,
"length": len(text),
"words": len(text.split()),
"contains_email": "@" in text
}
results.append(analysis)
except Exception as e:
results.append({"file": file, "error": str(e)})
return results
Data Pipeline Integration
from textxtract import SyncTextExtractor
def extract_and_store(file_path, database):
extractor = SyncTextExtractor()
try:
text = extractor.extract(file_path)
# Store in database
database.store({
"file_path": str(file_path),
"content": text,
"extracted_at": datetime.now(),
"status": "success"
})
except Exception as e:
database.store({
"file_path": str(file_path),
"error": str(e),
"extracted_at": datetime.now(),
"status": "failed"
})
🔧 Requirements
- Python 3.9+
- Optional dependencies for specific file types
- See Installation Guide for details
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Quick Contribution Setup
# Fork and clone the repo
git clone https://github.com/10XScale-in/textxtract.git
cd text-extractor
# Set up development environment
pip install -e .[all]
pip install pytest pytest-asyncio black isort mypy
# Run tests
pytest
# Format code
black textxtract tests
isort textxtract tests
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🌟 Support
- 🐛 Bug Reports: GitHub Issues
- 💡 Feature Requests: GitHub Discussions
- 📧 Questions: GitHub Discussions
🙏 Acknowledgments`
- Thanks to all contributors who have helped improve this project
- Built with Python and the amazing open-source ecosystem
- Special thanks to the maintainers of underlying libraries
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textxtract-0.2.1.tar.gz.
File metadata
- Download URL: textxtract-0.2.1.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cbafc442714727fea3da6a2f9e0d84079f2dcd85dd57d3cdcdde601c3b85d25
|
|
| MD5 |
8410914c6bea789bf081894f49ec0d03
|
|
| BLAKE2b-256 |
ce9f21ae282ddc9df7dc4dc2c329551d0549d31efd6db911f493b844166bada4
|
File details
Details for the file textxtract-0.2.1-py3-none-any.whl.
File metadata
- Download URL: textxtract-0.2.1-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
882bdc312edb9d8532d1107b6bc6341d8ebb87f3ceb51e6739cec73df852d153
|
|
| MD5 |
16da1d3c6a1d0cf4bd4d25a471828752
|
|
| BLAKE2b-256 |
4c118860a6dbba859f695493372a44d99b0aa2b51fa25b05c460d0b7bcb89235
|