Comprehensive PDF manipulation and conversion toolkit designed for RPA (Robotic Process Automation) workflows.
Project description
PDF documents operations library
Python module that delivers a comprehensive set of actions to manipulate and convert files to PDF. It is specifically designed to be easily integrated into Robotic Process Automation (RPA) workflows.
The module is natively compatible with Robocorp and provides streamlined abstractions over complex PDF tasks.
Business Capabilities
rpa-pdf accelerates your automation projects by providing out-of-the-box features for:
- Universal Document Conversion: Convert Office documents (Word, Excel, PowerPoint), HTML, Text, Emails (
.msg,.eml), and Images into standard PDF formats without manual intervention. - Document Manipulation: Merge, compress, split, and rotate PDF files.
- Content Extraction: Rip raw text or embedded images directly out of PDF pages.
- Compliance & Stamping: Automatically stamp PDFs with text (watermarks) or Code39 barcodes (useful for invoice tracking, archiving, and indexing).
- Automated Printing: Send PDFs directly to physical or virtual printers without UI prompts.
Architecture & Structure
The package is built with a modular architecture leveraging robust Python libraries (pypdf, fpdf2, Pillow, comtypes, xhtml2pdf, extract-msg) abstracted behind a simple, RPA-friendly API.
rpa_pdf.Pdf: The core facade exposing primary manipulation methods (merge,compress,print) and text-to-PDF capabilities.rpa_pdf.Stamp: Utilities focused entirely on adding text and barcode overlays onto existing documents.rpa_pdf.Converter: A universal registry-based format converter that dynamically proxies out to appropriate engines based on file extensions.rpa_pdf.TextExtractor&rpa_pdf.ImageExtractor: Extract embedded content from PDFs.rpa_pdf.PdfSplitter&rpa_pdf.PdfRotator: Handle page-level PDF mutations.- Bundled Executables: Bundles necessary fonts and lightweight viewers (like
SumatraPDF) to guarantee predictable cross-environment execution, especially for silent printing operations.
Installation
We recommend using a modern package manager like uv or standard pip.
# Using pip
pip install rpa-pdf
# Using uv
uv pip install rpa-pdf
Examples
1. Universal File Conversions (Word/Excel/PowerPoint/HTML/Email/Images)
The simplest way to convert any supported format into a PDF is using the universal Converter class.
(Note: Office conversion requires Microsoft Office to be installed on the machine)
from rpa_pdf import Converter
converter = Converter()
# Converts Word documents
converter.convert('c:/temp/document.docx', 'c:/temp/document.pdf')
# Converts Excel spreadsheets
converter.convert('c:/temp/data.xlsx') # Defaults to c:/temp/data.xlsx.pdf
# Converts Emails (.eml, .msg)
converter.convert('c:/temp/email.msg', 'c:/temp/email.pdf')
# Converts HTML & Images
converter.convert('c:/temp/index.html', 'c:/temp/website.pdf')
converter.convert('c:/temp/scan.png', 'c:/temp/scan.pdf')
2. Merging PDF Files
Combine multiple invoice documents into one package.
from rpa_pdf import Pdf
pdf = Pdf()
pdf.merge(['c:/temp/invoice_1.pdf', 'c:/temp/invoice_2.pdf'], 'c:/temp/merged_invoices.pdf')
3. Adding Text and Barcode Stamps
Great for adding tracking codes or "CONFIDENTIAL" watermarks to documents before they are printed or archived.
from rpa_pdf import Stamp
stamp = Stamp()
# Add a text watermark to the center of the first page
stamp.add_text_stamp(
input_pdf_file_path='c:/temp/input.pdf',
output_pdf_file_path='c:/temp/watermarked.pdf',
text='CONFIDENTIAL',
font_size=40,
text_horizontal_position='center',
text_vertical_position='center'
)
# Add a Code39 tracking barcode
stamp.add_code39_stamp(
input_pdf_file_path='c:/temp/input.pdf',
output_pdf_file_path='c:/temp/barcoded.pdf',
code='12345678',
vertical_position='top',
horizontal_position='right'
)
4. Splitting & Rotating PDFs
Extract individual pages from a large batch document or fix scanned orientations.
from rpa_pdf import PdfSplitter, PdfRotator
# Split all pages into individual files
splitter = PdfSplitter()
splitter.split('c:/temp/batch_invoices.pdf', 'c:/temp/output_folder')
# Rotate the first page 90 degrees
rotator = PdfRotator()
rotator.rotate('c:/temp/scanned_doc.pdf', angle=90, pages='first')
5. Content Extraction
Pull text and embedded images out of a PDF.
from rpa_pdf import TextExtractor, ImageExtractor
text_extractor = TextExtractor()
text = text_extractor.extract_text('c:/temp/document.pdf', pages=[0, 1])
img_extractor = ImageExtractor()
img_extractor.extract_images('c:/temp/document.pdf', 'c:/temp/images')
6. Generate PDF from Raw Text
Directly write string contents to a brand new PDF document.
from rpa_pdf import Pdf
pdf = Pdf()
pdf.text_to_pdf(text="Automated report execution successful.", output_file_path="c:/temp/report.pdf")
7. Compressing a PDF File
Reduce the file size of a PDF before sending it via email.
from rpa_pdf import Pdf
pdf = Pdf()
pdf.compress('c:/temp/large_file.pdf')
8. Print a PDF Silently
Print directly to a named printer using the bundled SumatraPDF engine.
from rpa_pdf import Pdf
pdf = Pdf()
pdf.print('c:/temp/document.pdf', 'printer_name')
Development & Testing
This project uses uv for dependency management and ruff for linting. Tests are run via pytest.
# Install dependencies including development extras
uv sync --all-extras
# Run tests
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rpa_pdf-2.0.3.tar.gz.
File metadata
- Download URL: rpa_pdf-2.0.3.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7714590f748ea05331f414fba5112d4de17097377ef2f802fd056550a39d82f4
|
|
| MD5 |
c0f0891ae3791fe057e1d29d97d2884e
|
|
| BLAKE2b-256 |
b915480e1beaae464df55274fcb320d6a551fc54912b914bbb882d3bfe048213
|
File details
Details for the file rpa_pdf-2.0.3-py3-none-any.whl.
File metadata
- Download URL: rpa_pdf-2.0.3-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dd4ab6edd91094e468abf67dcbcf01fb6176fef656359af1488082af93a6396
|
|
| MD5 |
af83c4631d23f262e3d3cb6aebbcc3ea
|
|
| BLAKE2b-256 |
707e95dae9748b1099accfb78ed9e60bbbf332be663da39ed93296aa541c9454
|