Skip to main content

Python utilities to simplify document files management

Project description

polytext

polytext

PyPI - Version PyPI Build PyPI - Downloads PyPI Downloads PyPI - Python Version

Doc Utils

A Python package for document conversion and text extraction.

Features

  • Convert various document formats (DOCX, ODT, PPT, etc.) to PDF
  • Extract text from PDF, Markdown, IMAGE, and audio files
  • Support for both local files and S3/GCS cloud storage
  • Multiple PDF parsing backends (PyPDF, PyMuPDF)
  • Transcribe audio & video files (local or cloud) to text/markdown
  • Extract YouTube video transcripts
  • Extract text from URLs

Installation

# Library only – assumes system requirements are already present
pip install polytext

Heads-up: Polytext’s PDF generator relies on [WeasyPrint] under the hood.
The PyPI wheel contains only Python code; you still need WeasyPrint’s native libraries (Pango, Cairo, GDK-PixBuf, HarfBuzz, Fontconfig) installed at the OS level.

System requirements

Requirement Notes macOS (Homebrew) Ubuntu / Debian
Python ✔️ Tested on 3.12
Older versions may fail to locate WeasyPrint’s dylibs
brew install python@3.12 sudo apt install python3.12
WeasyPrint – native stack installs Pango, Cairo, etc. brew install weasyprint sudo apt install weasyprint
LibreOffice used for Office → PDF conversion brew install --cask libreoffice sudo apt install libreoffice

Usage

Converting Documents to PDF

from polytext import convert_to_pdf, ConversionError

try:
    # Convert a document to PDF
    pdf_path = convert_to_pdf('input.docx', 'output.pdf')
    print(f"PDF saved to: {pdf_path}")
except ConversionError as e:
    print(f"Conversion failed: {e}")

Features that require the API key for Google Gemini are:

  • audio
  • video
  • image
  • youtube
from polytext.loader.base import BaseLoader

llm_api_key = "your_google_gemini_api_key"  # Set your Google Gemini API key here

# Instantiate the loader 
loader = BaseLoader(llm_api_key=llm_api_key)

Text or Markdown Extraction

from polytext.loader.base import BaseLoader

markdown_output = False # Change if you want to extract text as markdown
source = "local" # Change to "cloud" if you want to extract from cloud storage (s3 or GCS)

# Instantiate the loader (optionally set markdown_output, llm_api_key, etc.)
loader = BaseLoader(markdown_output=markdown_output, source=source)

# Extract text from a local file
result = loader.get_text(input_list=["/path/to/document.docx"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.docx"])
print(result["text"])

# Extract text from a markdown file (local)
result = loader.get_text(input_list=["/path/to/document.md"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.md"])
print(result["text"])

# Extract text from an audio file (local)
result = loader.get_text(input_list=["/path/to/audio.mp3"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/audio.mp3"])
print(result["text"])

# Extract text from a video file (local)
result = loader.get_text(input_list=["/path/to/video.mp4"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/video.mp4"])
print(result["text"])

# Extract text from Image (local)
result = loader.get_text(input_list=["/path/to/image.jpg"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/image.jpg"])
print(result["text"])

# Extract transcript from a YouTube video
result = loader.get_text(input_list=["https://www.youtube.com/watch?v=xxxx"])
print(result["text"])

# Extract text from a URL
result = loader.get_text(input_list=["https://www.domain-name.com/path"])
print(result["text"])

License

MIT Licence

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polytext-0.2.1.tar.gz (107.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polytext-0.2.1-py3-none-any.whl (99.8 kB view details)

Uploaded Python 3

File details

Details for the file polytext-0.2.1.tar.gz.

File metadata

  • Download URL: polytext-0.2.1.tar.gz
  • Upload date:
  • Size: 107.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for polytext-0.2.1.tar.gz
Algorithm Hash digest
SHA256 509dbb18a1a397283a3c4ad90bafcc0c130959af42fa47910ba0a7cbcadd33e1
MD5 4293159a619f0fa546507a69d720657d
BLAKE2b-256 a7f4b5035595a63abe492082b32b70e7f2f9adab5cccbfff83976157f67d5f8d

See more details on using hashes here.

File details

Details for the file polytext-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: polytext-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 99.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for polytext-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 afa756f1eb7b89f99fb861012cd9c4ff7af85316fdf96814e51695a0f94d9b47
MD5 5699b88b3c6d1ce809fb0cefcd6e5987
BLAKE2b-256 20185f7451bca77e7783e1a0777274f939f28358144972e2602456fd05efd39e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page