Skip to main content

A Python library for extracting text from different types of files (PDF, DOCX, PPTX, XLSX, ODT, ecc.).

Project description

PyxTxt

PyPI version Python versions License: MIT

PyxTxt is a simple and powerful Python library to extract text from various file formats.
It supports PDF, DOCX, XLSX, PPTX, ODT, HTML, XML, TXT, legacy XLS files, and more.


✨ Features

  • Extracts text from both file paths and in-memory buffers (io.BytesIO).
  • Supports multiple formats: PDF, DOCX, PPTX, XLSX, ODT, HTML, XML, TXT, legacy Office files (.xls,.ppt).
  • Automatically detects MIME type using python-magic.
  • Compatible with modern and legacy formats.
  • Can handle streamed content without saving to disk (with some limitations).

📦 Installation

The library i modular so you can install all modules:

pip install pyxtxt[all]

or just the modules you need:

pip install pyxtxt[pdf,odf,docx,presentation,spreadsheet,html]

Beause needed libraries are common installing the html module will enable also SVG and XML. The architecture is designed to be able to grow with new modules to work with other formats as well.

⚠️ Note: You must have libmagic installed on your system (required by python-magic).

The pyproject.toml file should select the correct version for your system. But if you have any problem you can install it manually.

On Ubuntu/Debian:

sudo apt install libmagic1

On Mac (Homebrew):

brew install libmagic

On Windows:

Use python-magic-bin instead of python-magic for easier installation.

🛠️ Dependencies

  • PyMuPDF (fitz)

  • beautifulsoup4

  • python-docx

  • python-pptx

  • odfpy

  • openpyxl

  • lxml

  • xlrd (<2.0.0)

  • python-magic

Dependencies are automatically installed from pyproject.toml.

📚 Usage Example

Extract text from a file path:

from pyxtxt import xtxt

text = xtxt("document.pdf")
print(text)

Extract text from a file-like buffer:

import io

with open("document.docx", "rb") as f:
    buffer = io.BytesIO(f.read())

from pyxtxt import xtxt
text = xtxt(buffer)
print(text)

Show available formats: from pyxtxt import extxt_available_formats

from pyxtxt import extxt_available_formats
text = extxt_available_formats()
print(text)
# For a pretty printing
text = extxt_available_formats(True)
print(text)

⚠️ Known Limitations

When passing a raw stream (io.BytesIO) without a filename, legacy files (.doc, .xls, .ppt) may not be correctly detected.

This is a limitation of libmagic beacuse the signature byte sequence at the start of doc/xls/ppt is exactly the same (b'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'), not of pyxtxt.

If available, using the original filename is highly recommended.

To extract text from documents in MSWrite's old .doc format, it is necessary to install antiword.

sudo apt-get update
sudo apt-get -y install antiword

🔒 License

Distributed under the MIT License.

The software is provided "as is" without any warranty of any kind.

Pull requests, issues, and feedback are warmly welcome! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxtxt-0.1.23.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyxtxt-0.1.23-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file pyxtxt-0.1.23.tar.gz.

File metadata

  • Download URL: pyxtxt-0.1.23.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.1.23.tar.gz
Algorithm Hash digest
SHA256 490d5fb2be61e6a0dbf8b749f388d581e845d05e3ecb2f0550d7ecd5aa5690fd
MD5 bfbc3e20031c8fc16382a831b17cfc26
BLAKE2b-256 cf1605853e5f3bff1dfd9d9f28f89e3ebdb42c4eb3e34f26ed0f30d9f3fda00b

See more details on using hashes here.

File details

Details for the file pyxtxt-0.1.23-py3-none-any.whl.

File metadata

  • Download URL: pyxtxt-0.1.23-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.1.23-py3-none-any.whl
Algorithm Hash digest
SHA256 b44a645884912458e6e34f4174b42ca9f53886d1d6b73649d1a79d7fcce259ba
MD5 784254353501457165988f2b72bc83e4
BLAKE2b-256 caed9938e17734abf8336dc5a0e9e13effe96acdc521bd8119c789a51409d6b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page