Skip to main content

Python utilities to simplify document files management

Project description

polytext

polytext

PyPI - Version PyPI Build PyPI - Downloads PyPI Downloads PyPI - Python Version

Doc Utils

A Python package for document conversion and text extraction.

Features

  • Convert various document formats (DOCX, ODT, PPT, etc.) to PDF
  • Extract text from PDF, Markdown, IMAGE, and audio files
  • Support for both local files and S3/GCS cloud storage
  • Multiple PDF parsing backends (PyPDF, PyMuPDF)
  • Transcribe audio & video files (local or cloud) to text/markdown
  • Extract YouTube video transcripts
  • Extract text from URLs

Installation

# Library only – assumes system requirements are already present
pip install polytext

Heads-up: Polytext’s PDF generator relies on [WeasyPrint] under the hood.
The PyPI wheel contains only Python code; you still need WeasyPrint’s native libraries (Pango, Cairo, GDK-PixBuf, HarfBuzz, Fontconfig) installed at the OS level.

System requirements

Requirement Notes macOS (Homebrew) Ubuntu / Debian
Python ✔️ Tested on 3.12
Older versions may fail to locate WeasyPrint’s dylibs
brew install python@3.12 sudo apt install python3.12
WeasyPrint – native stack installs Pango, Cairo, etc. brew install weasyprint sudo apt install weasyprint
LibreOffice used for Office → PDF conversion brew install --cask libreoffice sudo apt install libreoffice

Usage

Converting Documents to PDF

from polytext import convert_to_pdf, ConversionError

try:
    # Convert a document to PDF
    pdf_path = convert_to_pdf('input.docx', 'output.pdf')
    print(f"PDF saved to: {pdf_path}")
except ConversionError as e:
    print(f"Conversion failed: {e}")

Features that require the API key for Google Gemini are:

  • audio
  • video
  • image
  • youtube
from polytext.loader.base import BaseLoader

llm_api_key = "your_google_gemini_api_key"  # Set your Google Gemini API key here

# Instantiate the loader 
loader = BaseLoader(llm_api_key=llm_api_key)

Text or Markdown Extraction

from polytext.loader.base import BaseLoader

markdown_output = False # Change if you want to extract text as markdown
source = "local" # Change to "cloud" if you want to extract from cloud storage (s3 or GCS)

# Instantiate the loader (optionally set markdown_output, llm_api_key, etc.)
loader = BaseLoader(markdown_output=markdown_output, source=source)

# Extract text from a local file
result = loader.get_text(input_list=["/path/to/document.docx"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.docx"])
print(result["text"])

# Extract text from a markdown file (local)
result = loader.get_text(input_list=["/path/to/document.md"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.md"])
print(result["text"])

# Extract text from an audio file (local)
result = loader.get_text(input_list=["/path/to/audio.mp3"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/audio.mp3"])
print(result["text"])

# Extract text from a video file (local)
result = loader.get_text(input_list=["/path/to/video.mp4"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/video.mp4"])
print(result["text"])

# Extract text from Image (local)
result = loader.get_text(input_list=["/path/to/image.jpg"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/image.jpg"])
print(result["text"])

# Extract transcript from a YouTube video
result = loader.get_text(input_list=["https://www.youtube.com/watch?v=xxxx"])
print(result["text"])

# Extract text from a URL
result = loader.get_text(input_list=["https://www.domain-name.com/path"])
print(result["text"])

License

MIT Licence

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polytext-0.2.0.tar.gz (100.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polytext-0.2.0-py3-none-any.whl (96.1 kB view details)

Uploaded Python 3

File details

Details for the file polytext-0.2.0.tar.gz.

File metadata

  • Download URL: polytext-0.2.0.tar.gz
  • Upload date:
  • Size: 100.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for polytext-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1fef60334908283591a259deb7765c814006a30ed7fa95c777dfea4b0147b6df
MD5 957e56f91e605c2ab2f4a06ce35690b1
BLAKE2b-256 5ec865b2937e7b15f43b9d48598dbd08a962faa9a61b38e6c74757cf68faf8ed

See more details on using hashes here.

File details

Details for the file polytext-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: polytext-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 96.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for polytext-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66ef9297693b6e8e58531cbd48cf51e3aff8227c7ba0d10af430c65ebe4d24be
MD5 db5865177af6f112cb807ccd4e3e3955
BLAKE2b-256 99c3a8dbf0a6cd163a27ccfd04e1226aaf6af0ecf2a2bd3c422a89f12beafc6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page