Skip to main content

Convert PDF documents to Markdown format with AI assistance

Project description

PaperShift

A Python library for converting PDF documents to Markdown format with AI assistance. Shift from scanned documents to editable, searchable text.

Features

  • Converts PDF documents to well-formatted Markdown
  • Process documents in parallel for faster conversion
  • Optimized memory usage with batch processing
  • Fast mode option for quicker processing with lower resolution
  • Detailed progress reporting
  • Customizable AI model selection
  • Adaptive resolution based on output requirements

Installation

pip install papershift

Usage

from papershift import convert_pdf_to_markdown

# Basic usage
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    api_key="your-openrouter-api-key"
)

# Advanced usage with options
markdown_content = convert_pdf_to_markdown(
    pdf_path="path/to/your/document.pdf",
    output_dir="output_folder",
    dpi=300,
    target_height_px=2048,
    model="openrouter/google/gemini-2.0-flash-001",
    api_key="your-openrouter-api-key",
    max_workers=4,
    batch_size=5,
    fast_mode=True
)

# Save the output
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

Configuration Options

Parameter Description Default
pdf_path Path to the PDF file (Required)
output_dir Directory to save the output markdown files None
dpi DPI for image rendering 300
target_height_px Target height in pixels 2048
aspect_threshold Aspect ratio threshold for height adjustment 1.5
prompt Text prompt to send with each page image "Convert this document to markdown"
model The model to use for processing "openrouter/google/gemini-2.0-flash-001"
api_key OpenRouter API key None
site_url Optional site URL for OpenRouter None
app_name Optional app name for OpenRouter None
combined_output If True, returns a single string with all pages combined True
verbose If True, prints progress information False
max_workers Maximum number of worker processes for PDF conversion 4
batch_size Number of pages to process in a single batch 5
quality Image quality (1-100) for JPEG compression in fast mode 95
fast_mode If True, uses reduced resolution and JPEG format for faster processing False

Dependencies

  • PyMuPDF: PDF processing library
  • litellm: LLM API integration
  • openrouter: API for accessing various AI models
  • python-dotenv: Environment variable management

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papershift-0.1.1.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papershift-0.1.1-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file papershift-0.1.1.tar.gz.

File metadata

  • Download URL: papershift-0.1.1.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.1.tar.gz
Algorithm Hash digest
SHA256 50d23e094bf3457c7763cf416c2107c8f484934c0b4741eecff86ec57be8aa76
MD5 d1e0c1998f522b4ad37404f4281ce688
BLAKE2b-256 18bd7c812bfb1d472bc8e093cef8e532c48f11605403977fb73c1fc39ac9740f

See more details on using hashes here.

File details

Details for the file papershift-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: papershift-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for papershift-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6e07b2d4b41d064a0e8cf8ab15120da78edc823db64ccce9d30ed223c8d33928
MD5 facf8132b97d49c70f69f4546b8f7f87
BLAKE2b-256 7d194b504e15374bc005705402e29abdf0940b124cca2ee97ad5bf7492a15173

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page