Skip to main content

AI-powered PDF to HTML conversion using mistral-ocr and pixtral-12b

Project description

PDF2HTML AI

PyPI version

A Python package for converting PDF documents to accessible HTML using Mistral OCR and Pixtral 12B. This tool processes PDFs and generates WCAG-compliant HTML output with enhanced accessibility features.

Features

  • Process local PDF files or download from URLs
  • OCR processing with Mistral OCR
  • Generate accessible alt text for images using Pixtral 12B
  • Convert to WCAG-compliant accessible HTML
  • Enhance tables with proper accessibility features
  • Save output as HTML file
  • Option to open result in browser

Requirements

  • Python 3.10+
  • Mistral API key
  • Required Python packages (automatically installed with pip):
    • mistralai
    • requests
    • python-dotenv
    • PyPDF2

Installation

Option 1: Install from PyPI (Recommended)

pip install pdf2html-ai

Option 2: Install from Repository

  1. Clone the repository:

    git clone https://github.com/mystique920/ai-powered-pdf2html
    cd ai-powered-pdf2html
    
  2. Install required packages:

    pip install -r requirements.txt
    

Quick Start

After installing the package, you can use it in your Python code or via the command line.

Using as a Python Package

from pdf2html_ai import process_pdf_with_ocr, convert_ocr_to_accessible_html
from mistralai import Mistral

# Initialize Mistral client
client = Mistral(api_key="your_api_key_here")

# Process a local PDF file
with open("document.pdf", "rb") as f:
    file_content = f.read()
    
ocr_result = process_pdf_with_ocr(client, file_content, "document.pdf")
html_content = convert_ocr_to_accessible_html(client, ocr_result)

# Save the HTML output
with open("output.html", "w", encoding="utf-8") as f:
    f.write(html_content)

Using the Command Line

Process a local PDF file:

python -m pdf2html_ai.processor --file path/to/your/document.pdf

Process a PDF from a URL:

python -m pdf2html_ai.processor --url https://example.com/document.pdf

Example Scripts

Several example scripts are provided to help you get started:

  • examples/example.py - An interactive example that guides you through the options
  • tests/test_local_pdf.py - A test script for processing a local PDF file
  • tests/test_url_pdf.py - A test script for processing a PDF from a URL

To run the interactive example:

python examples/example.py

Command-line Options

  • --file, -f: Path to local PDF file
  • --url, -u: URL to PDF file
  • --api-key, -k: Mistral API key
  • --output, -o: Output HTML file path (default: output.html)
  • --max-images, -m: Maximum number of images to process (default: all)
  • --open-browser, -b: Open the output HTML in browser after processing

Examples

Process a local file with a custom API key and open in browser:

python -m pdf2html_ai.processor --file document.pdf --api-key YOUR_API_KEY --open-browser

Process a PDF from URL and save to a custom output file:

python -m pdf2html_ai.processor --url https://example.com/document.pdf --output result.html

Process a file but limit image processing to 5 images:

python -m pdf2html_ai.processor --file document.pdf --max-images 5

API Key Setup

The script requires a valid Mistral API key to function. There are two ways to provide the API key:

  1. Create a .env file in your project directory with the following content:

    MISTRAL_API_KEY=your_api_key_here
    
  2. Provide the API key directly using the --api-key command-line argument:

    python -m pdf2html_ai.processor --file document.pdf --api-key your_api_key_here
    

Notes

  • Processing large PDFs may take some time
  • Image alt text generation uses the Pixtral 12B model
  • The HTML output is designed to be WCAG-compliant for accessibility
  • You can limit the number of images processed to save API usage
  • The code for this application is based on a public Google Colab notebook
  • The original code is from this repository: https://github.com/coldplazma/Accessible-OCR-Mistral-
  • This tool was mostly modified and extended using AI tools. Use this tool at your own risk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2html_ai-0.1.1.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2html_ai-0.1.1-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf2html_ai-0.1.1.tar.gz.

File metadata

  • Download URL: pdf2html_ai-0.1.1.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.1.tar.gz
Algorithm Hash digest
SHA256 093fa82f532e28ca688ef04dde536f1529e9f0af11411a9a4887bba2f30c5d9d
MD5 7693cdece3a91670d1a519083248b8a2
BLAKE2b-256 f6828e9382be1d15044c974b7003c4b7f02613f4e0b272e77c3fbd1ece050a9d

See more details on using hashes here.

File details

Details for the file pdf2html_ai-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2html_ai-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 01e2c7d9a0a05482d8e4262014755da409413841a9af7fe2c9e90f817179395f
MD5 f7fa0c8979f0e026800668ef9a41c682
BLAKE2b-256 eb2110848683c706bd0a58a797682578cc567917d6e68549bf3dd0858d0151f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page