Skip to main content

Document utilities for SRX: extract text from PDF, DOCX, PPTX, XLSX, and audio files (MP3, M4A, WAV)

Project description

srx-lib-docs

Small helpers to extract plain text from common office document formats used by SRX services.

What it includes:

  • extract_text(path_or_bytes, mime_type=None) supports PDF, DOCX, PPTX, XLSX
  • DocumentMarkdownConverter to download and convert PDF/DOCX/PPTX/XLSX to Markdown

Install

PyPI (public):

  • pip install srx-lib-docs

uv (pyproject):

[project]
dependencies = ["srx-lib-docs>=0.1.0"]

Usage

from srx_lib_docs import extract_text
text = extract_text("/path/to/file.pdf")

Markdown conversion with download:

from srx_lib_docs.markdown import DocumentMarkdownConverter

conv = DocumentMarkdownConverter()
result = await conv.process_document(url, mimetype="application/pdf")
print(result["markdown_content"])  # plus file_type, file_size, success

Notes

  • For XLSX, the first 20 rows of each sheet are read to keep it lightweight; adjust in code if needed.

License

Proprietary © SRX

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srx_lib_docs-0.1.7.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srx_lib_docs-0.1.7-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file srx_lib_docs-0.1.7.tar.gz.

File metadata

  • Download URL: srx_lib_docs-0.1.7.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for srx_lib_docs-0.1.7.tar.gz
Algorithm Hash digest
SHA256 f5b33a6af600fb98f932eb1e5cb0b2a38f096378705db26b0bc8bf547d2cdb4d
MD5 ee2d730dc780c018bc2d1a75abc96d0d
BLAKE2b-256 d0c5a74b29e3c046cd1f44df2794ec3f42d263ac4d01df9e8ce07d740b393844

See more details on using hashes here.

File details

Details for the file srx_lib_docs-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: srx_lib_docs-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for srx_lib_docs-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 832583a9e263de48b4c47a13d2979d4bc4280705d9526fefe2b9dc59f5ceb8fe
MD5 14bb95354d8327064bcc96be5d71433f
BLAKE2b-256 6c394c3b818e5fd854d023c86485abf0b628b1c0f4f9550b7826f1dc2cbadea1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page