Skip to main content

A Python library for extracting plain text from various document formats for LLM and NLP purposes

Project description

OSD Text Extractor

A Python library for extracting plain text from various document formats for LLM and NLP purposes.

Features

  • Multi-format support: Extract text from PDF, DOCX, XLSX, HTML, XML, JSON, Markdown, RTF, CSV, EPUB, FB2, ODS, ODT, and TXT files
  • Clean output: Automatically removes non-Latin characters, normalizes whitespace, and filters out formatting artifacts
  • LLM-ready: Produces clean, plain text optimized for language model processing
  • Robust error handling: Comprehensive exception handling with detailed error messages
  • Memory efficient: Handles large files with appropriate size limits and safeguards
  • Type safe: Full type hints and mypy compliance

Installation

pip install osd-text-extractor

Quick Start

from osd_text_extractor import extract_text

# Extract text from a file
with open("document.pdf", "rb") as f:
    content = f.read()

text = extract_text(content, "pdf")
print(text)

Supported Formats

Format Extension Description
PDF .pdf Portable Document Format
DOCX .docx Microsoft Word documents
XLSX .xlsx Microsoft Excel spreadsheets
HTML .html, .htm Web pages
XML .xml XML documents
JSON .json JSON data files
Markdown .md Markdown documents
RTF .rtf Rich Text Format
CSV .csv Comma-separated values
TXT .txt Plain text files
EPUB .epub Electronic books
FB2 .fb2 FictionBook format
ODS .ods OpenDocument Spreadsheet
ODT .odt OpenDocument Text

Usage Examples

Basic Text Extraction

from osd_text_extractor import extract_text

# PDF extraction
with open("report.pdf", "rb") as f:
    pdf_text = extract_text(f.read(), "pdf")

# HTML extraction
html_content = b"<html><body><h1>Title</h1><p>Content</p></body></html>"
html_text = extract_text(html_content, "html")

# JSON extraction
json_content = b'{"title": "Document", "content": "Text content"}'
json_text = extract_text(json_content, "json")

Working with Different File Types

import os
from osd_text_extractor import extract_text

def extract_from_file(file_path):
    # Get file extension
    _, ext = os.path.splitext(file_path)
    format_name = ext[1:].lower()  # Remove dot and lowercase

    # Read file content
    with open(file_path, "rb") as f:
        content = f.read()

    # Extract text
    try:
        text = extract_text(content, format_name)
        return text
    except Exception as e:
        print(f"Failed to extract text from {file_path}: {e}")
        return None

# Usage
text = extract_from_file("document.docx")
if text:
    print(f"Extracted {len(text)} characters")

Batch Processing

import os
from pathlib import Path
from osd_text_extractor import extract_text

def process_directory(directory_path, output_file):
    supported_extensions = {'.pdf', '.docx', '.xlsx', '.html', '.xml',
                          '.json', '.md', '.rtf', '.csv', '.txt',
                          '.epub', '.fb2', '.ods', '.odt'}

    results = []

    for file_path in Path(directory_path).rglob('*'):
        if file_path.suffix.lower() in supported_extensions:
            try:
                with open(file_path, 'rb') as f:
                    content = f.read()

                format_name = file_path.suffix[1:].lower()
                text = extract_text(content, format_name)

                results.append({
                    'file': str(file_path),
                    'text': text,
                    'length': len(text)
                })
                print(f"✓ Processed {file_path}")

            except Exception as e:
                print(f"✗ Failed {file_path}: {e}")

    # Save results
    with open(output_file, 'w', encoding='utf-8') as f:
        for result in results:
            f.write(f"=== {result['file']} ===\n")
            f.write(f"{result['text']}\n\n")

    print(f"Processed {len(results)} files, saved to {output_file}")

# Usage
process_directory("./documents", "extracted_texts.txt")

Text Cleaning

The library automatically cleans extracted text:

  • Character filtering: Removes non-Latin characters (Cyrillic, Chinese, Arabic, emojis, etc.)
  • Whitespace normalization: Collapses multiple spaces, tabs, and line breaks
  • Artifact removal: Strips HTML tags, markdown syntax, and formatting codes
  • Emoji removal: Filters out emoji characters

Example of text cleaning:

# Input text with mixed content
raw_text = "English text Русский 中文 with symbols @#$% and emojis 🌍"

# After extraction and cleaning
cleaned_text = "English text with symbols and emojis"

Error Handling

The library provides specific exceptions for different error scenarios:

from osd_text_extractor import extract_text
from osd_text_extractor.application.exceptions import UnsupportedFormatError
from osd_text_extractor.domain.exceptions import TextLengthError
from osd_text_extractor.infrastructure.exceptions import ExtractionError

try:
    text = extract_text(content, format_name)
except UnsupportedFormatError:
    print("File format not supported")
except TextLengthError:
    print("No valid text content found")
except ExtractionError as e:
    print(f"Extraction failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Security Features

The library includes several security protections:

  • Size limits: Prevents processing of excessively large files
  • XML bomb protection: Guards against malicious XML with excessive nesting or entity expansion
  • Memory safeguards: Limits memory usage during processing
  • Input validation: Validates file formats and content structure

Performance Considerations

  • Memory usage: Files are processed in memory, consider available RAM for large files
  • Processing speed: Varies by format complexity (TXT > HTML > PDF > DOCX)
  • Concurrent processing: Library is thread-safe for concurrent usage

Dependencies

Core dependencies:

  • beautifulsoup4 - HTML/XML parsing
  • lxml - XML processing
  • pymupdf - PDF processing
  • python-docx - DOCX processing
  • openpyxl - XLSX processing
  • striprtf - RTF processing
  • odfpy - ODS/ODT processing
  • emoji - Emoji handling
  • dishka - Dependency injection

Development

Setting up development environment

# Clone repository
git clone https://github.com/OneSlap/osd-text-extractor.git
cd osd-text-extractor

# Install UV (package manager)
pip install uv

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Run type checking
uv run mypy src/

Running tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/osd_text_extractor --cov-report=html

# Run specific test file
uv run pytest tests/unit/test_domain/test_domain_entities.py

# Run integration tests only
uv run pytest tests/integration/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (uv run pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

Changelog

v0.1.0

  • Initial release
  • Support for 14 document formats
  • Clean architecture with dependency injection
  • Comprehensive test suite
  • Type safety with mypy
  • Security protections for XML processing

Support

Roadmap

  • Add support for PowerPoint (PPTX) files
  • Implement streaming processing for very large files
  • Add OCR support for image-based PDFs
  • Improve text structure preservation
  • Add configuration options for text cleaning
  • Performance optimizations for batch processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osd_text_extractor-0.1.1.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

osd_text_extractor-0.1.1-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file osd_text_extractor-0.1.1.tar.gz.

File metadata

  • Download URL: osd_text_extractor-0.1.1.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for osd_text_extractor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c7a53db0bf8f051f1db60553960e4902ac8770ec9283fd89ac12f602c1ccdc2b
MD5 ab7d504181406fe4900943a57225105d
BLAKE2b-256 a15968ae67cb11983c8320bc965644dcc4b036d42d0c2bfe09e2385c5b43fdde

See more details on using hashes here.

File details

Details for the file osd_text_extractor-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for osd_text_extractor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6708fcef76da43f71426a8e3319280c1469fc6a137fc6413aa1a24f6d704872c
MD5 f5c851468d43f396fb63eb9cd5717298
BLAKE2b-256 c2b05806b28ca3a1d8660733d257a23979ec8970a08f45bda83c690f5b608b73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page