Skip to main content

Document utilities for SRX: extract text from PDF, DOCX, PPTX, XLSX

Project description

srx-lib-docs

Small helpers to extract plain text from common office document formats used by SRX services.

What it includes:

  • extract_text(path_or_bytes, mime_type=None) supports PDF, DOCX, PPTX, XLSX
  • DocumentMarkdownConverter to download and convert PDF/DOCX/PPTX/XLSX to Markdown

Install

PyPI (public):

  • pip install srx-lib-docs

uv (pyproject):

[project]
dependencies = ["srx-lib-docs>=0.1.0"]

Usage

from srx_lib_docs import extract_text
text = extract_text("/path/to/file.pdf")

Markdown conversion with download:

from srx_lib_docs.markdown import DocumentMarkdownConverter

conv = DocumentMarkdownConverter()
result = await conv.process_document(url, mimetype="application/pdf")
print(result["markdown_content"])  # plus file_type, file_size, success

Notes

  • For XLSX, the first 20 rows of each sheet are read to keep it lightweight; adjust in code if needed.

License

Proprietary © SRX

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srx_lib_docs-0.1.5.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srx_lib_docs-0.1.5-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file srx_lib_docs-0.1.5.tar.gz.

File metadata

  • Download URL: srx_lib_docs-0.1.5.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for srx_lib_docs-0.1.5.tar.gz
Algorithm Hash digest
SHA256 996a12d962427a8217feb6873c909611cea54fdf05e2fc27382abc02c2f8c831
MD5 25ba21257132a56859bc5bf93cf2c7b2
BLAKE2b-256 2e007066823bf7f3bf34ec123cce11570216d453de11729c2e54404f862ab1c6

See more details on using hashes here.

File details

Details for the file srx_lib_docs-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: srx_lib_docs-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for srx_lib_docs-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9b9d1f8d319fc56d175bd722c7c98a99cdda3e5cc7a60f5195c639689cc6c9c6
MD5 2219c67c1c65fb1b36f6ff1e94b58c90
BLAKE2b-256 48722a7675d468c4639d02336aebbd615aa55bb6d916d07d91bb3ea71c4d6653

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page