Skip to main content

AI-powered PDF to HTML conversion using mistral-ocr and pixtral-12b

Project description

Mistral OCR Processor

A command-line tool for processing PDF documents with Mistral OCR and generating accessible HTML output. This tool is adapted from a Google Colab notebook to run locally on your machine.

Quick Start

Several scripts are provided to help you get started:

  • mistral_ocr.py - The main script for processing PDFs
  • example.py - An interactive example that guides you through the options
  • test_local_pdf.py - A test script for processing a local PDF file
  • test_url_pdf.py - A test script for processing a PDF from a URL

To run the interactive example:

python example.py

To test with a local PDF:

python test_local_pdf.py

To test with a PDF from a URL:

python test_url_pdf.py

Features

  • Process local PDF files or download from URLs
  • OCR processing with Mistral OCR
  • Generate accessible alt text for images using Pixtral 12B
  • Convert to WCAG-compliant accessible HTML
  • Enhance tables with proper accessibility features
  • Save output as HTML file
  • Option to open result in browser

Requirements

  • Python 3.6+
  • Mistral API key
  • Required Python packages:
    • mistralai
    • requests

Installation

  1. Clone the repository:

    git clone https://github.com/mystique920/ai-powered-pdf2html
    cd ai-powered-pdf2html
    
  2. Ensure you have Python installed

  3. Install required packages using pip:

    pip install -r requirements.txt
    

    Or install packages individually:

    pip install mistralai requests
    
  4. Make the script executable (optional):

    chmod +x mistral_ocr.py
    

Usage

Basic Usage

Process a local PDF file:

python mistral_ocr.py --file path/to/your/document.pdf

Process a PDF from a URL:

python mistral_ocr.py --url https://example.com/document.pdf

Command-line Options

  • --file, -f: Path to local PDF file
  • --url, -u: URL to PDF file
  • --api-key, -k: Mistral API key
  • --output, -o: Output HTML file path (default: output.html)
  • --max-images, -m: Maximum number of images to process (default: all)
  • --open-browser, -b: Open the output HTML in browser after processing

Examples

Process a local file with a custom API key and open in browser:

python mistral_ocr.py --file document.pdf --api-key YOUR_API_KEY --open-browser

Process a PDF from URL and save to a custom output file:

python mistral_ocr.py --url https://example.com/document.pdf --output result.html

Process a file but limit image processing to 5 images:

python mistral_ocr.py --file document.pdf --max-images 5

API Key Setup

The script requires a valid Mistral API key to function. There are two ways to provide the API key:

  1. Create a .env file in the same directory as the script with the following content:

    MISTRAL_API_KEY=your_api_key_here
    
  2. Provide the API key directly using the --api-key command-line argument:

    python mistral_ocr.py --file document.pdf --api-key your_api_key_here
    

Notes

  • Processing large PDFs may take some time
  • Image alt text generation uses the Pixtral 12B model
  • The HTML output is designed to be WCAG-compliant for accessibility
  • You can limit the number of images processed to save API usage
  • The code for this application is based on a public Google Colab notebook
  • The original code is from this repository: https://github.com/coldplazma/Accessible-OCR-Mistral-

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2html_ai-0.1.0.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2html_ai-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf2html_ai-0.1.0.tar.gz.

File metadata

  • Download URL: pdf2html_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2f901077a88d994144bc1a20d0786aee7a021a5f18ece1d6c84d356e4ff89fd3
MD5 44a07e0b04b81f0812751c4b2413796b
BLAKE2b-256 b19f7e55d16b226d26b4e62b06d5b443579b146daec4c869647f52abfe80dc54

See more details on using hashes here.

File details

Details for the file pdf2html_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2html_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a38561e428e441cc02c0d9474018273ded7e3192f10c1d7fc1cc21ea25157f61
MD5 d937f130be10845b044586fb1fb0efb7
BLAKE2b-256 1be3f89c9422e226c30f63175f34fd0e3086f3af7abcf1f866ccc03399c352f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page