Skip to main content

A Python library for extracting text from different types of files (PDF, DOCX, PPTX, XLSX, ODT, ecc.).

Project description

PyxTxt

PyPI version Python versions License: MIT

PyxTxt is a simple and powerful Python library to extract text from various file formats.
It supports PDF, DOCX, XLSX, PPTX, ODT, HTML, XML, TXT, legacy Office files, audio transcription, OCR from images, and more.

NEW in v0.2.3+: Added audio transcription (Whisper) and OCR from images (EasyOCR)!


✨ Features

  • Multiple input types: File paths, io.BytesIO buffers, raw bytes objects, and requests.Response objects
  • Wide format support: PDF, DOCX, PPTX, XLSX, ODT, HTML, XML, TXT, Markdown, EPUB, RTF, EML, MSG, LaTeX, legacy Office files (.xls, .ppt, .doc)
  • Audio transcription: MP3, WAV, M4A, FLAC and more using OpenAI Whisper
  • OCR from images: JPEG, PNG, TIFF, BMP using EasyOCR with multilingual support
  • Automatic MIME detection: Uses python-magic for intelligent file type recognition
  • Web-ready: Direct support for downloading and extracting text from URLs
  • Memory efficient: Process files without saving to disk
  • Modern Python: Full type hints and clean API design

📦 Installation

The library is modular so you can install all modules:

pip install pyxtxt[all]

or just the modules you need:

pip install pyxtxt[pdf,docx,presentation,spreadsheet,html,markdown,epub,email]

Audio & OCR (Heavy Dependencies)

# Audio transcription (~2GB download for Whisper models)
pip install pyxtxt[audio]

# OCR from images (~1GB download for EasyOCR models)
pip install pyxtxt[ocr]

# Both audio and OCR
pip install pyxtxt[audio,ocr]

Because needed libraries are common, installing the html module will also enable SVG and XML support. The architecture is designed to grow with new modules for additional formats.

⚠️ Note: You must have libmagic installed on your system (required by python-magic).

The pyproject.toml file should select the correct version for your system. But if you have any problem you can install it manually.

On Ubuntu/Debian:

sudo apt install libmagic1

On Mac (Homebrew):

brew install libmagic

On Windows:

Use python-magic-bin instead of python-magic for easier installation.

🛠️ Dependencies

Core Dependencies

  • python-magic (automatic file type detection)

Optional Dependencies by Format

  • PDF: PyMuPDF
  • Office: python-docx, python-pptx, openpyxl, xlrd
  • Web/HTML: beautifulsoup4, lxml
  • OpenDocument: odfpy
  • Markdown: markdown
  • EPUB: ebooklib
  • RTF: striprtf
  • Email: extract-msg (for MSG files)
  • LaTeX: pylatexenc
  • Audio: openai-whisper (heavy ~2GB models)
  • OCR: easyocr, pillow (heavy ~1GB models)

Dependencies are automatically installed based on selected optional groups.

📚 Usage Examples

Basic Usage

from pyxtxt import xtxt

# Extract from file path
text = xtxt("document.pdf")
print(text)

# Extract from BytesIO buffer
import io
with open("document.docx", "rb") as f:
    buffer = io.BytesIO(f.read())
text = xtxt(buffer)
print(text)

NEW: Web Content Support

import requests
from pyxtxt import xtxt, xtxt_from_url

# Method 1: Direct from bytes
response = requests.get("https://example.com/document.pdf")
text = xtxt(response.content)

# Method 2: Direct from Response object  
text = xtxt(response)

# Method 3: URL helper function
text = xtxt_from_url("https://example.com/document.pdf")

Audio Transcription (NEW)

from pyxtxt import xtxt

# Transcribe audio files
text = xtxt("meeting_recording.mp3")
text = xtxt("interview.wav")
text = xtxt("podcast.m4a")

# From web audio
import requests
audio_response = requests.get("https://example.com/audio.mp3")
text = xtxt(audio_response.content)

OCR from Images (NEW)

from pyxtxt import xtxt

# Extract text from images
text = xtxt("scanned_document.png")
text = xtxt("screenshot.jpg")
text = xtxt("invoice.tiff")

# From web images
import requests
image_response = requests.get("https://example.com/document.png")
text = xtxt(image_response.content)

Show Available Formats

from pyxtxt import extxt_available_formats

# List supported MIME types
formats = extxt_available_formats()
print(formats)

# Pretty format names
formats = extxt_available_formats(pretty=True)
print(formats)

🌐 Common Web Use Cases

# API responses
api_response = requests.post("https://api.example.com/generate-pdf")
text = xtxt(api_response.content)

# File uploads (Flask/Django)
uploaded_bytes = request.files['document'].read()
text = xtxt(uploaded_bytes)

# Audio/video transcription services
audio_response = requests.get("https://api.example.com/recording.mp3")
transcript = xtxt(audio_response.content)

# OCR for uploaded images
image_bytes = request.files['receipt'].read()
text = xtxt(image_bytes)

# Email attachments
attachment_bytes = email_msg.get_payload(decode=True)
text = xtxt(attachment_bytes)

⚠️ Known Limitations

  • Legacy file detection: When using raw streams without filenames, legacy files (.doc, .xls, .ppt) may not be correctly detected due to identical file signatures in libmagic
  • Filename hints recommended: When available, providing original filenames improves detection accuracy
  • MSWrite .doc files: Require antiword installation:
    sudo apt-get update && sudo apt-get install antiword
    

📖 Full Examples

See examples.py for comprehensive usage examples including:

  • Local file processing
  • Memory buffer handling
  • Web content extraction
  • Error handling patterns
  • All supported formats demonstration

Accessing Examples After Installation

After installing PyxTxt from PyPI, you can access the examples file:

import pkg_resources

# Get path to examples file
examples_path = pkg_resources.resource_filename('pyxtxt', 'examples.py')
print(f"Examples file location: {examples_path}")

# Or read the content directly
examples_content = pkg_resources.resource_string('pyxtxt', 'examples.py').decode('utf-8')
print(examples_content)

🔒 License

Distributed under the MIT License. See LICENSE file for details.

The software is provided "as is" without any warranty of any kind.

🤝 Contributing

Pull requests, issues, and feedback are warmly welcome! 🚀

  • Bug reports: Please include file samples and error details
  • Feature requests: Describe your use case and expected behavior
  • Code contributions: Follow existing patterns and add tests

📊 Changelog

v0.2.3+

  • NEW: Audio transcription support (MP3, WAV, M4A, FLAC, etc.)
  • NEW: OCR from images (JPEG, PNG, TIFF, BMP, WebP)
  • NEW: Markdown, EPUB, RTF, EML, MSG, LaTeX support
  • ✅ Separate optional dependencies for heavy features (audio/OCR)
  • ✅ Performance optimizations with model caching
  • ✅ Improved multilingual OCR support (Italian/English)

v0.1.24+

  • ✅ Added support for bytes objects
  • ✅ Added support for requests.Response objects
  • ✅ Added xtxt_from_url() helper function
  • ✅ Improved type hints and error handling
  • ✅ Enhanced web content processing capabilities

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxtxt-0.2.3.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyxtxt-0.2.3-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file pyxtxt-0.2.3.tar.gz.

File metadata

  • Download URL: pyxtxt-0.2.3.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.2.3.tar.gz
Algorithm Hash digest
SHA256 2c721d6f22fc67366e6b4868dbd6bed6000fb7db4ae61d3defee3caaa351157c
MD5 01d433ac906d1123c841917940e82cef
BLAKE2b-256 359e5988f48f22090fc18b6eb053449495426411613ba6e6fe95e9b5f24185e2

See more details on using hashes here.

File details

Details for the file pyxtxt-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: pyxtxt-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b168ae49a26dd3f76e208eb5b03d0f8832da3790937cab54a8265c5617a4f5ca
MD5 244d8fbb1deac61e02c1aaf061cc7789
BLAKE2b-256 6131e55d8d6dd92afbe94133ab99f6ed7ac823543965d863553754b6223a4902

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page