Convert PDF documents and images to Markdown format with AI assistance

These details have not been verified by PyPI

Project links

Project description

PaperShift

A Python library for converting PDF documents and images to Markdown format with AI assistance. Shift from scanned documents and images to editable, searchable text.

Features

Converts PDF documents to well-formatted Markdown
Converts image files (PNG, JPG, etc.) to well-formatted Markdown
Process documents and images in parallel for faster conversion
Optimized memory usage with batch processing
Fast mode option for quicker processing with lower resolution
Detailed progress reporting
Customizable AI model selection
Adaptive resolution based on output requirements

Installation

pip install papershift

Usage

PDF to Markdown

from papershift import convert_pdf_to_markdown

# Basic usage
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    api_key="your-openrouter-api-key"
)

# Advanced usage with options
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    output_dir="output_folder",
    dpi=300,
    target_height_px=2048,
    model="openrouter/google/gemini-2.0-flash-001",
    api_key="your-openrouter-api-key",
    max_workers=4,
    batch_size=5,
    fast_mode=True
)

# Save the output
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

Image to Markdown

from papershift import convert_image_to_markdown, convert_images_to_markdown

# Convert a single image
markdown_content = convert_image_to_markdown(
    image_path="path/to/your/image.jpg",
    api_key="your-openrouter-api-key"
)

# Convert multiple images with combined output
markdown_content = convert_images_to_markdown(
    image_paths=["image1.jpg", "image2.png", "image3.jpg"],
    output_dir="output_folder",
    api_key="your-openrouter-api-key",
    combined_output=True
)

# Convert multiple images with separate outputs
markdown_files = convert_images_to_markdown(
    image_paths=["image1.jpg", "image2.png", "image3.jpg"],
    output_dir="output_folder",
    api_key="your-openrouter-api-key",
    combined_output=False
)

Configuration Options

PDF to Markdown Options

Parameter	Description	Default
pdf_path	Path to the PDF file	(Required)
output_dir	Directory to save the output markdown files	None
dpi	DPI for image rendering	300
target_height_px	Target height in pixels	2048
aspect_threshold	Aspect ratio threshold for height adjustment	1.5
prompt	Text prompt to send with each page image	"Convert this document to markdown"
model	The model to use for processing	"openrouter/google/gemini-2.0-flash-001"
api_key	OpenRouter API key	None
site_url	Optional site URL for OpenRouter	None
app_name	Optional app name for OpenRouter	None
combined_output	If True, returns a single string with all pages combined	True
verbose	If True, prints progress information	False
max_workers	Maximum number of worker processes for PDF conversion	4
batch_size	Number of pages to process in a single batch	5
quality	Image quality (1-100) for JPEG compression in fast mode	95
fast_mode	If True, uses reduced resolution and JPEG format for faster processing	False

Image to Markdown Options

Parameter	Description	Default
image_path / image_paths	Path to the image file or list of image paths	(Required)
output_dir	Directory to save the output markdown files	None
target_height_px	Target height in pixels	2048
aspect_threshold	Aspect ratio threshold for height adjustment	1.5
prompt	Text prompt to send with each image	"Convert this image to markdown"
model	The model to use for processing	"openrouter/google/gemini-2.0-flash-001"
api_key	OpenRouter API key	None
site_url	Optional site URL for OpenRouter	None
app_name	Optional app name for OpenRouter	None
combined_output	If True, returns a single string with all images combined	True
verbose	If True, prints progress information	False
max_workers	Maximum number of worker processes for parallel processing	4
quality	Image quality (1-100) for JPEG compression in fast mode	95
fast_mode	If True, uses reduced resolution and JPEG format for faster processing	False

Dependencies

PyMuPDF: PDF processing library
Pillow: Image processing library
litellm: LLM API integration
openrouter: API for accessing various AI models
python-dotenv: Environment variable management

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Apr 23, 2025

0.1.1

Apr 19, 2025

0.1.0

Apr 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papershift-0.1.2.tar.gz (15.8 kB view details)

Uploaded Apr 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

papershift-0.1.2-py3-none-any.whl (13.4 kB view details)

Uploaded Apr 23, 2025 Python 3

File details

Details for the file papershift-0.1.2.tar.gz.

File metadata

Download URL: papershift-0.1.2.tar.gz
Upload date: Apr 23, 2025
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fec434e1d406da352fc683fdeb41b8e926da5d58f3eacbb0cc6628f134199488`
MD5	`126ffd708de53073e73ea6a38b266cd4`
BLAKE2b-256	`6dcea701dfbc0af76840c811bfbba195f9e9ec2c36e7ae438f7bb15d93c25781`

See more details on using hashes here.

File details

Details for the file papershift-0.1.2-py3-none-any.whl.

File metadata

Download URL: papershift-0.1.2-py3-none-any.whl
Upload date: Apr 23, 2025
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5d61531ecad8434e5070f64fb0f0a5ecd87ace49d4391986a2128f24ca62294`
MD5	`296f4c994c78a0f63fb46c7b8650ba7d`
BLAKE2b-256	`3fb33b34e1f5e4aa9ba816281a6f6868d4c37c153b9633e75408ac6c3e76b9db`

See more details on using hashes here.

papershift 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PaperShift

Features

Installation

Usage

PDF to Markdown

Image to Markdown

Configuration Options

PDF to Markdown Options

Image to Markdown Options

Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes