Skip to main content

Una libreria Python per estrarre testo da diversi tipi di file (PDF, DOCX, PPTX, XLSX, ODT, ecc.).

Project description

PyxTxt

PyPI version Python versions License: MIT

PyxTxt is a simple and powerful Python library to extract text from various file formats.
It supports PDF, DOCX, XLSX, PPTX, ODT, HTML, XML, TXT, legacy XLS files, and more.


✨ Features

  • Extracts text from both file paths and in-memory buffers (io.BytesIO).
  • Supports multiple formats: PDF, DOCX, PPTX, XLSX, ODT, HTML, XML, TXT, legacy Office files (.xls, .doc, .ppt).
  • Automatically detects MIME type using python-magic.
  • Compatible with modern and legacy formats.
  • Can handle streamed content without saving to disk (with some limitations).

📦 Installation

pip install pyxtxt

⚠️ Note: You must have libmagic installed on your system (required by python-magic).

On Ubuntu/Debian:

sudo apt install libmagic1

On Mac (Homebrew):

brew install libmagic

On Windows:

Use python-magic-bin instead of python-magic for easier installation.

🛠️ Dependencies

  • PyMuPDF (fitz)

  • beautifulsoup4

  • python-docx

  • python-pptx

  • odfpy

  • openpyxl

  • lxml

  • xlrd (<2.0.0)

  • python-magic

Dependencies are automatically installed from pyproject.toml.

📚 Usage Example

Extract text from a file path:

from pyxtxt import xtxt

text = xtxt("document.pdf")
print(text)

Extract text from a file-like buffer:

import io

with open("document.docx", "rb") as f:
    buffer = io.BytesIO(f.read())

from pyxtxt import xtxt
text = xtxt(buffer)
print(text)

##⚠️ Known Limitations When passing a raw stream (io.BytesIO) without a filename, legacy files (.doc, .xls, .ppt) may not be correctly detected.

This is a limitation of libmagic, not of pyxtxt.

If available, passing the original filename along with the buffer is highly recommended.

🔒 License

Distributed under the MIT License.

The software is provided "as is" without any warranty of any kind.

Pull requests, issues, and feedback are warmly welcome! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxtxt-0.1.8.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyxtxt-0.1.8-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file pyxtxt-0.1.8.tar.gz.

File metadata

  • Download URL: pyxtxt-0.1.8.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.1.8.tar.gz
Algorithm Hash digest
SHA256 f7740fe134b04e2590a7c9326f5467ae4dfc1b22e5d38dabf915cd11c83122be
MD5 80f5a626f2b27088d260df4222c65cff
BLAKE2b-256 a97597e9f2060cafb92edd010fb8166a5dc181ed5cf903a37f638667fb5f6cdb

See more details on using hashes here.

File details

Details for the file pyxtxt-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: pyxtxt-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pyxtxt-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f0125406e8c1f174e0f93226914be7cdfc06151c3f5160c1f582ee4dbcc73f06
MD5 6aa878b411496902a9be79774ed61737
BLAKE2b-256 461202e3ebbd36e779eb912eeb007447c6a60a9e85519c842328713734a78846

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page