Python utilities to simplify document files management
Project description
polytext
Doc Utils
A Python package for document conversion and text extraction.
Features
- Convert various document formats (DOCX, ODT, PPT, etc.) to PDF
- Extract text from PDF, Markdown, IMAGE, and audio files
- Support for both local files and S3/GCS cloud storage
- Multiple PDF parsing backends (PyPDF, PyMuPDF)
- Transcribe audio & video files (local or cloud) to text/markdown
- Extract YouTube video transcripts
- Extract text from URLs
Installation
# Library only – assumes system requirements are already present
pip install polytext
Heads-up: Polytext’s PDF generator relies on [WeasyPrint] under the hood.
The PyPI wheel contains only Python code; you still need WeasyPrint’s native libraries (Pango, Cairo, GDK-PixBuf, HarfBuzz, Fontconfig) installed at the OS level.
System requirements
| Requirement | Notes | macOS (Homebrew) | Ubuntu / Debian |
|---|---|---|---|
| Python | ✔️ Tested on 3.12 Older versions may fail to locate WeasyPrint’s dylibs |
brew install python@3.12 |
sudo apt install python3.12 |
| WeasyPrint – native stack | installs Pango, Cairo, etc. | brew install weasyprint |
sudo apt install weasyprint |
| LibreOffice | used for Office → PDF conversion | brew install --cask libreoffice |
sudo apt install libreoffice |
Usage
Converting Documents to PDF
from polytext import convert_to_pdf, ConversionError
try:
# Convert a document to PDF
pdf_path = convert_to_pdf('input.docx', 'output.pdf')
print(f"PDF saved to: {pdf_path}")
except ConversionError as e:
print(f"Conversion failed: {e}")
Features that require the API key for Google Gemini are:
- audio
- video
- image
- youtube
from polytext.loader.base import BaseLoader
llm_api_key = "your_google_gemini_api_key" # Set your Google Gemini API key here
# Instantiate the loader
loader = BaseLoader(llm_api_key=llm_api_key)
Text or Markdown Extraction
from polytext.loader.base import BaseLoader
markdown_output = False # Change if you want to extract text as markdown
source = "local" # Change to "cloud" if you want to extract from cloud storage (s3 or GCS)
# Instantiate the loader (optionally set markdown_output, llm_api_key, etc.)
loader = BaseLoader(markdown_output=markdown_output, source=source)
# Extract text from a local file
result = loader.get_text(input_list=["/path/to/document.docx"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.docx"])
print(result["text"])
# Extract text from a markdown file (local)
result = loader.get_text(input_list=["/path/to/document.md"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/document.md"])
print(result["text"])
# Extract text from an audio file (local)
result = loader.get_text(input_list=["/path/to/audio.mp3"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/audio.mp3"])
print(result["text"])
# Extract text from a video file (local)
result = loader.get_text(input_list=["/path/to/video.mp4"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/video.mp4"])
print(result["text"])
# Extract text from Image (local)
result = loader.get_text(input_list=["/path/to/image.jpg"])
print(result["text"])
# Extract text from cloud file
result = loader.get_text(input_list=["s3://your-bucket/path/to/image.jpg"])
print(result["text"])
# Extract transcript from a YouTube video
result = loader.get_text(input_list=["https://www.youtube.com/watch?v=xxxx"])
print(result["text"])
# Extract text from a URL
result = loader.get_text(input_list=["https://www.domain-name.com/path"])
print(result["text"])
License
MIT Licence
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polytext-0.2.0.tar.gz.
File metadata
- Download URL: polytext-0.2.0.tar.gz
- Upload date:
- Size: 100.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fef60334908283591a259deb7765c814006a30ed7fa95c777dfea4b0147b6df
|
|
| MD5 |
957e56f91e605c2ab2f4a06ce35690b1
|
|
| BLAKE2b-256 |
5ec865b2937e7b15f43b9d48598dbd08a962faa9a61b38e6c74757cf68faf8ed
|
File details
Details for the file polytext-0.2.0-py3-none-any.whl.
File metadata
- Download URL: polytext-0.2.0-py3-none-any.whl
- Upload date:
- Size: 96.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66ef9297693b6e8e58531cbd48cf51e3aff8227c7ba0d10af430c65ebe4d24be
|
|
| MD5 |
db5865177af6f112cb807ccd4e3e3955
|
|
| BLAKE2b-256 |
99c3a8dbf0a6cd163a27ccfd04e1226aaf6af0ecf2a2bd3c422a89f12beafc6b
|