A Python library for extracting plain text from various document formats for LLM and NLP purposes
Project description
OSD Text Extractor
A Python library for extracting plain text from various document formats for LLM and NLP purposes.
Features
- Multi-format support: Extract text from PDF, DOCX, XLSX, HTML, XML, JSON, Markdown, RTF, CSV, EPUB, FB2, ODS, ODT, and TXT files
- Clean output: Automatically removes non-Latin characters, normalizes whitespace, and filters out formatting artifacts
- LLM-ready: Produces clean, plain text optimized for language model processing
- Robust error handling: Comprehensive exception handling with detailed error messages
- Memory efficient: Handles large files with appropriate size limits and safeguards
- Type safe: Full type hints and mypy compliance
Installation
pip install osd-text-extractor
Quick Start
from osd_text_extractor import extract_text
# Extract text from a file
with open("document.pdf", "rb") as f:
content = f.read()
text = extract_text(content, "pdf")
print(text)
Supported Formats
| Format | Extension | Description |
|---|---|---|
.pdf |
Portable Document Format | |
| DOCX | .docx |
Microsoft Word documents |
| XLSX | .xlsx |
Microsoft Excel spreadsheets |
| HTML | .html, .htm |
Web pages |
| XML | .xml |
XML documents |
| JSON | .json |
JSON data files |
| Markdown | .md |
Markdown documents |
| RTF | .rtf |
Rich Text Format |
| CSV | .csv |
Comma-separated values |
| TXT | .txt |
Plain text files |
| EPUB | .epub |
Electronic books |
| FB2 | .fb2 |
FictionBook format |
| ODS | .ods |
OpenDocument Spreadsheet |
| ODT | .odt |
OpenDocument Text |
Usage Examples
Basic Text Extraction
from osd_text_extractor import extract_text
# PDF extraction
with open("report.pdf", "rb") as f:
pdf_text = extract_text(f.read(), "pdf")
# HTML extraction
html_content = b"<html><body><h1>Title</h1><p>Content</p></body></html>"
html_text = extract_text(html_content, "html")
# JSON extraction
json_content = b'{"title": "Document", "content": "Text content"}'
json_text = extract_text(json_content, "json")
Working with Different File Types
import os
from osd_text_extractor import extract_text
def extract_from_file(file_path):
# Get file extension
_, ext = os.path.splitext(file_path)
format_name = ext[1:].lower() # Remove dot and lowercase
# Read file content
with open(file_path, "rb") as f:
content = f.read()
# Extract text
try:
text = extract_text(content, format_name)
return text
except Exception as e:
print(f"Failed to extract text from {file_path}: {e}")
return None
# Usage
text = extract_from_file("document.docx")
if text:
print(f"Extracted {len(text)} characters")
Batch Processing
import os
from pathlib import Path
from osd_text_extractor import extract_text
def process_directory(directory_path, output_file):
supported_extensions = {'.pdf', '.docx', '.xlsx', '.html', '.xml',
'.json', '.md', '.rtf', '.csv', '.txt',
'.epub', '.fb2', '.ods', '.odt'}
results = []
for file_path in Path(directory_path).rglob('*'):
if file_path.suffix.lower() in supported_extensions:
try:
with open(file_path, 'rb') as f:
content = f.read()
format_name = file_path.suffix[1:].lower()
text = extract_text(content, format_name)
results.append({
'file': str(file_path),
'text': text,
'length': len(text)
})
print(f"✓ Processed {file_path}")
except Exception as e:
print(f"✗ Failed {file_path}: {e}")
# Save results
with open(output_file, 'w', encoding='utf-8') as f:
for result in results:
f.write(f"=== {result['file']} ===\n")
f.write(f"{result['text']}\n\n")
print(f"Processed {len(results)} files, saved to {output_file}")
# Usage
process_directory("./documents", "extracted_texts.txt")
Text Cleaning
The library automatically cleans extracted text:
- Character filtering: Removes non-Latin characters (Cyrillic, Chinese, Arabic, emojis, etc.)
- Whitespace normalization: Collapses multiple spaces, tabs, and line breaks
- Artifact removal: Strips HTML tags, markdown syntax, and formatting codes
- Emoji removal: Filters out emoji characters
Example of text cleaning:
# Input text with mixed content
raw_text = "English text Русский 中文 with symbols @#$% and emojis 🌍"
# After extraction and cleaning
cleaned_text = "English text with symbols and emojis"
Error Handling
The library provides specific exceptions for different error scenarios:
from osd_text_extractor import extract_text
from osd_text_extractor.application.exceptions import UnsupportedFormatError
from osd_text_extractor.domain.exceptions import TextLengthError
from osd_text_extractor.infrastructure.exceptions import ExtractionError
try:
text = extract_text(content, format_name)
except UnsupportedFormatError:
print("File format not supported")
except TextLengthError:
print("No valid text content found")
except ExtractionError as e:
print(f"Extraction failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Security Features
The library includes several security protections:
- Size limits: Prevents processing of excessively large files
- XML bomb protection: Guards against malicious XML with excessive nesting or entity expansion
- Memory safeguards: Limits memory usage during processing
- Input validation: Validates file formats and content structure
Performance Considerations
- Memory usage: Files are processed in memory, consider available RAM for large files
- Processing speed: Varies by format complexity (TXT > HTML > PDF > DOCX)
- Concurrent processing: Library is thread-safe for concurrent usage
Dependencies
Core dependencies:
beautifulsoup4- HTML/XML parsinglxml- XML processingpymupdf- PDF processingpython-docx- DOCX processingopenpyxl- XLSX processingstriprtf- RTF processingodfpy- ODS/ODT processingemoji- Emoji handlingdishka- Dependency injection
Development
Setting up development environment
# Clone repository
git clone https://github.com/OneSlap/osd-text-extractor.git
cd osd-text-extractor
# Install UV (package manager)
pip install uv
# Install dependencies
uv sync --dev
# Run tests
uv run pytest
# Run linting
uv run ruff check src/ tests/
uv run ruff format src/ tests/
# Run type checking
uv run mypy src/
Running tests
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src/osd_text_extractor --cov-report=html
# Run specific test file
uv run pytest tests/unit/test_domain/test_domain_entities.py
# Run integration tests only
uv run pytest tests/integration/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run the test suite (
uv run pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Changelog
v0.1.0
- Initial release
- Support for 14 document formats
- Clean architecture with dependency injection
- Comprehensive test suite
- Type safety with mypy
- Security protections for XML processing
Support
- Issues: GitHub Issues
- Documentation: GitHub README
- Source Code: GitHub Repository
Roadmap
- Add support for PowerPoint (PPTX) files
- Implement streaming processing for very large files
- Add OCR support for image-based PDFs
- Improve text structure preservation
- Add configuration options for text cleaning
- Performance optimizations for batch processing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file osd_text_extractor-0.1.1.tar.gz.
File metadata
- Download URL: osd_text_extractor-0.1.1.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7a53db0bf8f051f1db60553960e4902ac8770ec9283fd89ac12f602c1ccdc2b
|
|
| MD5 |
ab7d504181406fe4900943a57225105d
|
|
| BLAKE2b-256 |
a15968ae67cb11983c8320bc965644dcc4b036d42d0c2bfe09e2385c5b43fdde
|
File details
Details for the file osd_text_extractor-0.1.1-py3-none-any.whl.
File metadata
- Download URL: osd_text_extractor-0.1.1-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6708fcef76da43f71426a8e3319280c1469fc6a137fc6413aa1a24f6d704872c
|
|
| MD5 |
f5c851468d43f396fb63eb9cd5717298
|
|
| BLAKE2b-256 |
c2b05806b28ca3a1d8660733d257a23979ec8970a08f45bda83c690f5b608b73
|