Convert PDF documents to Markdown format with AI assistance
Project description
PaperShift
A Python library for converting PDF documents to Markdown format with AI assistance. Shift from scanned documents to editable, searchable text.
Features
- Converts PDF documents to well-formatted Markdown
- Process documents in parallel for faster conversion
- Optimized memory usage with batch processing
- Fast mode option for quicker processing with lower resolution
- Detailed progress reporting
- Customizable AI model selection
- Adaptive resolution based on output requirements
Installation
pip install papershift
Usage
from papershift import convert_pdf_to_markdown
# Basic usage
markdown_content = convert_pdf_to_markdown(
pdf_path="path/to/your/document.pdf",
api_key="your-openrouter-api-key"
)
# Advanced usage with options
markdown_content = convert_pdf_to_markdown(
pdf_path="path/to/your/document.pdf",
output_dir="output_folder",
dpi=300,
target_height_px=2048,
model="openrouter/google/gemini-2.0-flash-001",
api_key="your-openrouter-api-key",
max_workers=4,
batch_size=5,
fast_mode=True
)
# Save the output
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_content)
Configuration Options
| Parameter | Description | Default |
|---|---|---|
| pdf_path | Path to the PDF file | (Required) |
| output_dir | Directory to save the output markdown files | None |
| dpi | DPI for image rendering | 300 |
| target_height_px | Target height in pixels | 2048 |
| aspect_threshold | Aspect ratio threshold for height adjustment | 1.5 |
| prompt | Text prompt to send with each page image | "Convert this document to markdown" |
| model | The model to use for processing | "openrouter/google/gemini-2.0-flash-001" |
| api_key | OpenRouter API key | None |
| site_url | Optional site URL for OpenRouter | None |
| app_name | Optional app name for OpenRouter | None |
| combined_output | If True, returns a single string with all pages combined | True |
| verbose | If True, prints progress information | False |
| max_workers | Maximum number of worker processes for PDF conversion | 4 |
| batch_size | Number of pages to process in a single batch | 5 |
| quality | Image quality (1-100) for JPEG compression in fast mode | 95 |
| fast_mode | If True, uses reduced resolution and JPEG format for faster processing | False |
Dependencies
- PyMuPDF: PDF processing library
- litellm: LLM API integration
- openrouter: API for accessing various AI models
- python-dotenv: Environment variable management
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
papershift-0.1.0.tar.gz
(12.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papershift-0.1.0.tar.gz.
File metadata
- Download URL: papershift-0.1.0.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0f2c4aedc008806d61cf57326eb0667293dcb0ad7bb7de500ceec4a2b683e4f
|
|
| MD5 |
a76a492e49988bbadb389252300df5c6
|
|
| BLAKE2b-256 |
af413039804fb79930dac3065832ad3903ff726a17da41ce097eee35b7941b22
|
File details
Details for the file papershift-0.1.0-py3-none-any.whl.
File metadata
- Download URL: papershift-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0de24e2d9741b2dcf4769de85895563d3949cf9d38e0a22c3034f9bb6c540a71
|
|
| MD5 |
a7c0e051404b151bd7fd43470637efaa
|
|
| BLAKE2b-256 |
d825fc38741e445ade63f4eea99c558f99d46a7b8d1e94dfaaed069420b65c77
|