Skip to main content

Document utilities for SRX: extract text from PDF, DOCX, PPTX, XLSX

Project description

srx-lib-docs

Small helpers to extract plain text from common office document formats used by SRX services.

What it includes:

  • extract_text(path_or_bytes, mime_type=None) supports PDF, DOCX, PPTX, XLSX
  • DocumentMarkdownConverter to download and convert PDF/DOCX/PPTX/XLSX to Markdown

Install

PyPI (public):

  • pip install srx-lib-docs

uv (pyproject):

[project]
dependencies = ["srx-lib-docs>=0.1.0"]

Usage

from srx_lib_docs import extract_text
text = extract_text("/path/to/file.pdf")

Markdown conversion with download:

from srx_lib_docs.markdown import DocumentMarkdownConverter

conv = DocumentMarkdownConverter()
result = await conv.process_document(url, mimetype="application/pdf")
print(result["markdown_content"])  # plus file_type, file_size, success

Notes

  • For XLSX, the first 20 rows of each sheet are read to keep it lightweight; adjust in code if needed.

License

Proprietary © SRX

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srx_lib_docs-0.1.6.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srx_lib_docs-0.1.6-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file srx_lib_docs-0.1.6.tar.gz.

File metadata

  • Download URL: srx_lib_docs-0.1.6.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for srx_lib_docs-0.1.6.tar.gz
Algorithm Hash digest
SHA256 1c39977d225a2fd5c38fc6321d03c864b673024b209e9aad00b6ca3f2ac89cc0
MD5 e6cb364c3ac91afb972e3c69571a1ad4
BLAKE2b-256 eddfc47415bcc5e0aa2260b184ce6d571bd1548b02630decc5c97fee88e94307

See more details on using hashes here.

File details

Details for the file srx_lib_docs-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: srx_lib_docs-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for srx_lib_docs-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9e5efd519b4443f0750a144c3a09412812150b8a09b667dcd4ca68065672e35d
MD5 b32cba2eabcac3a8d7a5e65aeaf7f08e
BLAKE2b-256 68be52689799df34767581577c1f5947cc7f878d7e2f50be057a0a468e938216

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page