Skip to main content

A Python library for extracting text from different types of files (PDF, DOCX, PPTX, XLSX, ODT, ecc.).

Project description

PyxTxt

PyPI version Python versions License: MIT

PyxTxt is a simple and powerful Python library to extract text from various file formats.
It supports PDF, DOCX, XLSX, PPTX, ODT, HTML, XML, TXT, legacy XLS files, and more.


✨ Features

  • Extracts text from both file paths and in-memory buffers (io.BytesIO).
  • Supports multiple formats: PDF, DOCX, PPTX, XLSX, ODT, HTML, XML, TXT, legacy Office files (.xls, .doc, .ppt).
  • Automatically detects MIME type using python-magic.
  • Compatible with modern and legacy formats.
  • Can handle streamed content without saving to disk (with some limitations).

📦 Installation

pip install pyxtxt

⚠️ Note: You must have libmagic installed on your system (required by python-magic).

On Ubuntu/Debian:

sudo apt install libmagic1

On Mac (Homebrew):

brew install libmagic

On Windows:

Use python-magic-bin instead of python-magic for easier installation.

🛠️ Dependencies

  • PyMuPDF (fitz)

  • beautifulsoup4

  • python-docx

  • python-pptx

  • odfpy

  • openpyxl

  • lxml

  • xlrd (<2.0.0)

  • python-magic

Dependencies are automatically installed from pyproject.toml.

📚 Usage Example

Extract text from a file path:

from pyxtxt import xtxt

text = xtxt("document.pdf")
print(text)

Extract text from a file-like buffer:

import io

with open("document.docx", "rb") as f:
    buffer = io.BytesIO(f.read())

from pyxtxt import xtxt
text = xtxt(buffer)
print(text)

##⚠️ Known Limitations When passing a raw stream (io.BytesIO) without a filename, legacy files (.doc, .xls, .ppt) may not be correctly detected.

This is a limitation of libmagic, not of pyxtxt.

If available, passing the original filename along with the buffer is highly recommended.

🔒 License

Distributed under the MIT License.

The software is provided "as is" without any warranty of any kind.

Pull requests, issues, and feedback are warmly welcome! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxtxt-0.1.9.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyxtxt-0.1.9-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file pyxtxt-0.1.9.tar.gz.

File metadata

  • Download URL: pyxtxt-0.1.9.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.1.9.tar.gz
Algorithm Hash digest
SHA256 6ccaf25f6b779f6452e11e6ebb1da8f536381dd75ead9da7e55e388f44e04976
MD5 bcb66f20dcbce2cf96ded95c98d6a0e3
BLAKE2b-256 1cfebdd014f7a0d4bf1bd78c22c85f9064aadd3d83101c992807d58db03fe403

See more details on using hashes here.

File details

Details for the file pyxtxt-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: pyxtxt-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 a3d7f430a6827442db5d07fec644851a25370b012aa8af42a474cb89dea8d73d
MD5 7a28b1b6c2c4744dae6f4500a6652cda
BLAKE2b-256 fde28c9c12d02b164bbaa2476d916f0529e18d202d85741f75babffbafe8ad47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page