Convert PDF documents to Markdown format with AI assistance
Project description
PaperShift
A Python library for converting PDF documents to Markdown format with AI assistance. Shift from scanned documents to editable, searchable text.
Features
- Converts PDF documents to well-formatted Markdown
- Process documents in parallel for faster conversion
- Optimized memory usage with batch processing
- Fast mode option for quicker processing with lower resolution
- Detailed progress reporting
- Customizable AI model selection
- Adaptive resolution based on output requirements
Installation
pip install papershift
Usage
from papershift import convert_pdf_to_markdown
# Basic usage
markdown_content = convert_pdf_to_markdown(
pdf_path="path/to/your/document.pdf",
api_key="your-openrouter-api-key"
)
# Advanced usage with options
markdown_content = convert_pdf_to_markdown(
pdf_path="path/to/your/document.pdf",
output_dir="output_folder",
dpi=300,
target_height_px=2048,
model="openrouter/google/gemini-2.0-flash-001",
api_key="your-openrouter-api-key",
max_workers=4,
batch_size=5,
fast_mode=True
)
# Save the output
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_content)
Configuration Options
| Parameter | Description | Default |
|---|---|---|
| pdf_path | Path to the PDF file | (Required) |
| output_dir | Directory to save the output markdown files | None |
| dpi | DPI for image rendering | 300 |
| target_height_px | Target height in pixels | 2048 |
| aspect_threshold | Aspect ratio threshold for height adjustment | 1.5 |
| prompt | Text prompt to send with each page image | "Convert this document to markdown" |
| model | The model to use for processing | "openrouter/google/gemini-2.0-flash-001" |
| api_key | OpenRouter API key | None |
| site_url | Optional site URL for OpenRouter | None |
| app_name | Optional app name for OpenRouter | None |
| combined_output | If True, returns a single string with all pages combined | True |
| verbose | If True, prints progress information | False |
| max_workers | Maximum number of worker processes for PDF conversion | 4 |
| batch_size | Number of pages to process in a single batch | 5 |
| quality | Image quality (1-100) for JPEG compression in fast mode | 95 |
| fast_mode | If True, uses reduced resolution and JPEG format for faster processing | False |
Dependencies
- PyMuPDF: PDF processing library
- litellm: LLM API integration
- openrouter: API for accessing various AI models
- python-dotenv: Environment variable management
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
papershift-0.1.1.tar.gz
(12.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papershift-0.1.1.tar.gz.
File metadata
- Download URL: papershift-0.1.1.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50d23e094bf3457c7763cf416c2107c8f484934c0b4741eecff86ec57be8aa76
|
|
| MD5 |
d1e0c1998f522b4ad37404f4281ce688
|
|
| BLAKE2b-256 |
18bd7c812bfb1d472bc8e093cef8e532c48f11605403977fb73c1fc39ac9740f
|
File details
Details for the file papershift-0.1.1-py3-none-any.whl.
File metadata
- Download URL: papershift-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e07b2d4b41d064a0e8cf8ab15120da78edc823db64ccce9d30ed223c8d33928
|
|
| MD5 |
facf8132b97d49c70f69f4546b8f7f87
|
|
| BLAKE2b-256 |
7d194b504e15374bc005705402e29abdf0940b124cca2ee97ad5bf7492a15173
|