Skip to main content

A robust, extensible Python package for synchronous and asynchronous text extraction from PDF, DOCX, DOC, TXT, ZIP, MD, RTF, HTML, and more.

Project description

TextXtract

A robust, extensible Python package for synchronous and asynchronous text extraction from PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML, and more.

Documentation

Full documentation is available at: https://10xscale-in.github.io/textxtract/

Features

  • Synchronous and asynchronous extraction APIs
  • Modular file type handlers (PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML, and more.)
  • Abstract base classes for extensibility
  • Custom exception handling and logging
  • Configurable encoding, logging, and timeouts
  • Easy to add new file type handlers
  • Comprehensive unit tests with pytest

Installation

pip install textxtract

Usage Example

from textxtract.sync.extractor import SyncTextExtractor
from textxtract.aio.extractor import AsyncTextExtractor

# Synchronous extraction
extractor = SyncTextExtractor()
text = extractor.extract(file_bytes, filename)

# Asynchronous extraction
import asyncio
async_extractor = AsyncTextExtractor()
text = asyncio.run(async_extractor.extract_async(file_bytes, filename))

API Reference

See ARCHITECTURE_PLAN.md for detailed architecture and module layout.

Running Tests

pytest

Contributing

  1. Fork the repository.
  2. Create a new branch.
  3. Add your feature or fix.
  4. Write tests.
  5. Submit a pull request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textxtract-0.1.1.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textxtract-0.1.1-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file textxtract-0.1.1.tar.gz.

File metadata

  • Download URL: textxtract-0.1.1.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.4

File hashes

Hashes for textxtract-0.1.1.tar.gz
Algorithm Hash digest
SHA256 aa7151a44c1654d8eb8e54ca4ad8f6b6e1919b9551db8381119d9f0ab0f64398
MD5 4c603c4f977a523cdb7559ffdb6f35d6
BLAKE2b-256 b000f6d29564ee87aafae6a857bba1cbad325f79cce56dc30bbf1e07d273c052

See more details on using hashes here.

File details

Details for the file textxtract-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: textxtract-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.4

File hashes

Hashes for textxtract-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0ba2ced4cb1d2d52f978717dc1552bc26ad52a782be732197ee19ea6dfded162
MD5 c16438005405874fdce5f2295197e4b2
BLAKE2b-256 1296f5efcef22f462568a7e53b37ffdb166ee7c6d0447b7bf196d7f5d6cc16f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page