Skip to main content

Convert PDF documents to Markdown format with AI assistance

Project description

PaperShift

A Python library for converting PDF documents to Markdown format with AI assistance. Shift from scanned documents to editable, searchable text.

Features

  • Converts PDF documents to well-formatted Markdown
  • Process documents in parallel for faster conversion
  • Optimized memory usage with batch processing
  • Fast mode option for quicker processing with lower resolution
  • Detailed progress reporting
  • Customizable AI model selection
  • Adaptive resolution based on output requirements

Installation

pip install papershift

Usage

from papershift import convert_pdf_to_markdown

# Basic usage
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    api_key="your-openrouter-api-key"
)

# Advanced usage with options
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    output_dir="output_folder",
    dpi=300,
    target_height_px=2048,
    model="openrouter/google/gemini-2.0-flash-001",
    api_key="your-openrouter-api-key",
    max_workers=4,
    batch_size=5,
    fast_mode=True
)

# Save the output
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

Configuration Options

Parameter Description Default
pdf_path Path to the PDF file (Required)
output_dir Directory to save the output markdown files None
dpi DPI for image rendering 300
target_height_px Target height in pixels 2048
aspect_threshold Aspect ratio threshold for height adjustment 1.5
prompt Text prompt to send with each page image "Convert this document to markdown"
model The model to use for processing "openrouter/google/gemini-2.0-flash-001"
api_key OpenRouter API key None
site_url Optional site URL for OpenRouter None
app_name Optional app name for OpenRouter None
combined_output If True, returns a single string with all pages combined True
verbose If True, prints progress information False
max_workers Maximum number of worker processes for PDF conversion 4
batch_size Number of pages to process in a single batch 5
quality Image quality (1-100) for JPEG compression in fast mode 95
fast_mode If True, uses reduced resolution and JPEG format for faster processing False

Dependencies

  • PyMuPDF: PDF processing library
  • litellm: LLM API integration
  • openrouter: API for accessing various AI models
  • python-dotenv: Environment variable management

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papershift-0.1.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papershift-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file papershift-0.1.0.tar.gz.

File metadata

  • Download URL: papershift-0.1.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e0f2c4aedc008806d61cf57326eb0667293dcb0ad7bb7de500ceec4a2b683e4f
MD5 a76a492e49988bbadb389252300df5c6
BLAKE2b-256 af413039804fb79930dac3065832ad3903ff726a17da41ce097eee35b7941b22

See more details on using hashes here.

File details

Details for the file papershift-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: papershift-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0de24e2d9741b2dcf4769de85895563d3949cf9d38e0a22c3034f9bb6c540a71
MD5 a7c0e051404b151bd7fd43470637efaa
BLAKE2b-256 d825fc38741e445ade63f4eea99c558f99d46a7b8d1e94dfaaed069420b65c77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page