Python utilities to simplify document files management

These details have not been verified by PyPI

Project links

Homepage

Project description

polytext

Doc Utils

A Python package for document conversion and text extraction.

Features

Convert various document formats (DOCX, ODT, PPT, etc.) to PDF
Extract text from PDF, Markdown, IMAGE, and audio files
Support for both local files and S3/GCS cloud storage
Multiple PDF parsing backends (PyPDF, PyMuPDF)
Transcribe audio & video files (local or cloud) to text/markdown
Extract YouTube video transcripts
Extract text from URLs

Installation

# Library only – assumes system requirements are already present
pip install polytext

Heads-up: Polytext’s PDF generator relies on [WeasyPrint] under the hood.
The PyPI wheel contains only Python code; you still need WeasyPrint’s native libraries (Pango, Cairo, GDK-PixBuf, HarfBuzz, Fontconfig) installed at the OS level.

System requirements

Requirement	Notes	macOS (Homebrew)	Ubuntu / Debian
Python	Supported on 3.11 – 3.13 WeasyPrint still requires its native libraries	`brew install python@3.11`	`sudo apt install python3.11`
WeasyPrint – native stack	installs Pango, Cairo, etc.	`brew install weasyprint`	`sudo apt install weasyprint`
LibreOffice	used for Office → PDF conversion	`brew install --cask libreoffice`	`sudo apt install libreoffice`

Usage

Converting Documents to PDF

from polytext import convert_to_pdf, ConversionError

try:
    # Convert a document to PDF
    pdf_path = convert_to_pdf('input.docx', 'output.pdf')
    print(f"PDF saved to: {pdf_path}")
except ConversionError as e:
    print(f"Conversion failed: {e}")

Features that require the API key for Google Gemini are:

audio
video
image
youtube

from polytext.loader.base import BaseLoader

llm_api_key = "your_google_gemini_api_key"  # Set your Google Gemini API key here

# Instantiate the loader 
loader = BaseLoader(llm_api_key=llm_api_key)

Text or Markdown Extraction

from polytext.loader.base import BaseLoader

markdown_output = False # Change if you want to extract text as markdown
source = "local" # Change to "cloud" if you want to extract from cloud storage (s3 or GCS)

# Instantiate the loader (optionally set markdown_output, llm_api_key, etc.)
loader = BaseLoader(markdown_output=markdown_output, source=source)

# Extract text from a local file
result = loader.get_text(input_list=["/path/to/document.docx"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.docx"])
print(result["text"])

# Extract text from a markdown file (local)
result = loader.get_text(input_list=["/path/to/document.md"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.md"])
print(result["text"])

# Extract text from an audio file (local)
result = loader.get_text(input_list=["/path/to/audio.mp3"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/audio.mp3"])
print(result["text"])

# Extract text from a video file (local)
result = loader.get_text(input_list=["/path/to/video.mp4"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/video.mp4"])
print(result["text"])

# Extract text from Image (local)
result = loader.get_text(input_list=["/path/to/image.jpg"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/image.jpg"])
print(result["text"])

# Extract transcript from a YouTube video
result = loader.get_text(input_list=["https://www.youtube.com/watch?v=xxxx"])
print(result["text"])

# Extract text from a URL
result = loader.get_text(input_list=["https://www.domain-name.com/path"])
print(result["text"])

License

MIT Licence

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.7

Jun 30, 2026

0.2.7b0 pre-release

Jun 30, 2026

0.2.6

Jun 9, 2026

0.2.5

Jun 5, 2026

0.2.4

Jun 3, 2026

0.2.3

May 27, 2026

0.2.2

May 26, 2026

0.2.2b2 pre-release

May 26, 2026

0.2.2b1 pre-release

May 26, 2026

0.2.1

May 12, 2026

0.2.0

Apr 2, 2026

0.1.5b9 pre-release

Mar 10, 2026

0.1.5b8 pre-release

Mar 10, 2026

0.1.5b7 pre-release

Mar 10, 2026

0.1.5b6 pre-release

Mar 9, 2026

0.1.5b5 pre-release

Mar 4, 2026

0.1.5b4 pre-release

Jan 9, 2026

0.1.5b3 pre-release

Dec 22, 2025

0.1.5b2 pre-release

Dec 22, 2025

0.1.5b1 pre-release

Oct 24, 2025

0.1.4

Oct 17, 2025

0.1.3b5 pre-release

Aug 1, 2025

0.1.3b4 pre-release

Jun 10, 2025

0.1.3b3 pre-release

Jun 5, 2025

0.1.3b2 pre-release

Jun 5, 2025

0.1.3b1 pre-release

Jun 4, 2025

0.1.2

Mar 6, 2025

0.1.1

Mar 5, 2025

0.1.0

Feb 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polytext-0.2.7.tar.gz (118.4 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

polytext-0.2.7-py3-none-any.whl (105.8 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file polytext-0.2.7.tar.gz.

File metadata

Download URL: polytext-0.2.7.tar.gz
Upload date: Jun 30, 2026
Size: 118.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polytext-0.2.7.tar.gz
Algorithm	Hash digest
SHA256	`bfcbe88cf726e68c10a7f7dd8ca330f398e133cf16bd4ebbb73638a9aaa60557`
MD5	`1db7bf36d2bfdba60025c39f451ba378`
BLAKE2b-256	`def84adde6ee20c00b1d4930e0a798a51c15daf986e4ca3fcc7def66d5dda6a3`

See more details on using hashes here.

File details

Details for the file polytext-0.2.7-py3-none-any.whl.

File metadata

Download URL: polytext-0.2.7-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 105.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polytext-0.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`431ac3db2ca61ec01aece5facb416ce4b62e6f87663c653e5c91abc2101f5808`
MD5	`1ee8264325688f50dc172d05e48d49ba`
BLAKE2b-256	`121f33c00d5240645323b0f0a1987bac539b694f262e32aacf658235bea41baa`

See more details on using hashes here.

polytext 0.2.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

polytext

Doc Utils

Features

Installation

System requirements

Usage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes