Skip to main content

Convert PDF documents and images to Markdown format with AI assistance

Project description

PaperShift

A Python library for converting PDF documents and images to Markdown format with AI assistance. Shift from scanned documents and images to editable, searchable text.

Features

  • Converts PDF documents to well-formatted Markdown
  • Converts image files (PNG, JPG, etc.) to well-formatted Markdown
  • Process documents and images in parallel for faster conversion
  • Optimized memory usage with batch processing
  • Fast mode option for quicker processing with lower resolution
  • Detailed progress reporting
  • Customizable AI model selection
  • Adaptive resolution based on output requirements

Installation

pip install papershift

Usage

PDF to Markdown

from papershift import convert_pdf_to_markdown

# Basic usage
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    api_key="your-openrouter-api-key"
)

# Advanced usage with options
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    output_dir="output_folder",
    dpi=300,
    target_height_px=2048,
    model="openrouter/google/gemini-2.0-flash-001",
    api_key="your-openrouter-api-key",
    max_workers=4,
    batch_size=5,
    fast_mode=True
)

# Save the output
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

Image to Markdown

from papershift import convert_image_to_markdown, convert_images_to_markdown

# Convert a single image
markdown_content = convert_image_to_markdown(
    image_path="path/to/your/image.jpg",
    api_key="your-openrouter-api-key"
)

# Convert multiple images with combined output
markdown_content = convert_images_to_markdown(
    image_paths=["image1.jpg", "image2.png", "image3.jpg"],
    output_dir="output_folder",
    api_key="your-openrouter-api-key",
    combined_output=True
)

# Convert multiple images with separate outputs
markdown_files = convert_images_to_markdown(
    image_paths=["image1.jpg", "image2.png", "image3.jpg"],
    output_dir="output_folder",
    api_key="your-openrouter-api-key",
    combined_output=False
)

Configuration Options

PDF to Markdown Options

Parameter Description Default
pdf_path Path to the PDF file (Required)
output_dir Directory to save the output markdown files None
dpi DPI for image rendering 300
target_height_px Target height in pixels 2048
aspect_threshold Aspect ratio threshold for height adjustment 1.5
prompt Text prompt to send with each page image "Convert this document to markdown"
model The model to use for processing "openrouter/google/gemini-2.0-flash-001"
api_key OpenRouter API key None
site_url Optional site URL for OpenRouter None
app_name Optional app name for OpenRouter None
combined_output If True, returns a single string with all pages combined True
verbose If True, prints progress information False
max_workers Maximum number of worker processes for PDF conversion 4
batch_size Number of pages to process in a single batch 5
quality Image quality (1-100) for JPEG compression in fast mode 95
fast_mode If True, uses reduced resolution and JPEG format for faster processing False

Image to Markdown Options

Parameter Description Default
image_path / image_paths Path to the image file or list of image paths (Required)
output_dir Directory to save the output markdown files None
target_height_px Target height in pixels 2048
aspect_threshold Aspect ratio threshold for height adjustment 1.5
prompt Text prompt to send with each image "Convert this image to markdown"
model The model to use for processing "openrouter/google/gemini-2.0-flash-001"
api_key OpenRouter API key None
site_url Optional site URL for OpenRouter None
app_name Optional app name for OpenRouter None
combined_output If True, returns a single string with all images combined True
verbose If True, prints progress information False
max_workers Maximum number of worker processes for parallel processing 4
quality Image quality (1-100) for JPEG compression in fast mode 95
fast_mode If True, uses reduced resolution and JPEG format for faster processing False

Dependencies

  • PyMuPDF: PDF processing library
  • Pillow: Image processing library
  • litellm: LLM API integration
  • openrouter: API for accessing various AI models
  • python-dotenv: Environment variable management

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papershift-0.1.2.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papershift-0.1.2-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file papershift-0.1.2.tar.gz.

File metadata

  • Download URL: papershift-0.1.2.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fec434e1d406da352fc683fdeb41b8e926da5d58f3eacbb0cc6628f134199488
MD5 126ffd708de53073e73ea6a38b266cd4
BLAKE2b-256 6dcea701dfbc0af76840c811bfbba195f9e9ec2c36e7ae438f7bb15d93c25781

See more details on using hashes here.

File details

Details for the file papershift-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: papershift-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c5d61531ecad8434e5070f64fb0f0a5ecd87ace49d4391986a2128f24ca62294
MD5 296f4c994c78a0f63fb46c7b8650ba7d
BLAKE2b-256 3fb33b34e1f5e4aa9ba816281a6f6868d4c37c153b9633e75408ac6c3e76b9db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page