Skip to main content

AI-powered PDF to HTML conversion using mistral-ocr and pixtral-12b

Project description

PDF2HTML AI

PyPI version

A Python package for converting PDF documents to accessible HTML using Mistral OCR and Pixtral 12B. This tool processes PDFs and generates WCAG-compliant HTML output with enhanced accessibility features.

Features

  • Process local PDF files or download from URLs
  • OCR processing with Mistral OCR
  • Generate accessible alt text for images using Pixtral 12B
  • Convert to WCAG-compliant accessible HTML
  • Enhance tables with proper accessibility features
  • Save output as HTML file
  • Option to open result in browser

Requirements

  • Python 3.10+
  • Mistral API key
  • Required Python packages (automatically installed with pip):
    • mistralai
    • requests
    • python-dotenv
    • PyPDF2

Installation

Option 1: Install from PyPI (Recommended)

pip install pdf2html-ai

Option 2: Install from Repository

  1. Clone the repository:

    git clone https://github.com/mystique920/ai-powered-pdf2html
    cd ai-powered-pdf2html
    
  2. Install required packages:

    pip install -r requirements.txt
    

Quick Start

After installing the package, you can use it in your Python code or via the command line.

Using as a Python Package

from pdf2html_ai import process_pdf_with_ocr, convert_ocr_to_accessible_html
from mistralai import Mistral

# Initialize Mistral client
client = Mistral(api_key="your_api_key_here")

# Process a local PDF file
with open("document.pdf", "rb") as f:
    file_content = f.read()
    
ocr_result = process_pdf_with_ocr(client, file_content, "document.pdf")
html_content = convert_ocr_to_accessible_html(client, ocr_result)

# Save the HTML output
with open("output.html", "w", encoding="utf-8") as f:
    f.write(html_content)

Using the Command Line

After installing the package, you can use the pdf2html-ai command directly:

Process a local PDF file:

pdf2html --file path/to/your/document.pdf

Process a PDF from a URL:

pdf2html --url https://example.com/document.pdf

Alternatively, you can still use the module directly:

python -m pdf2html_ai.processor --file path/to/your/document.pdf

Example Scripts

Several example scripts are provided to help you get started:

  • examples/example.py - An interactive example that guides you through the options
  • tests/test_local_pdf.py - A test script for processing a local PDF file
  • tests/test_url_pdf.py - A test script for processing a PDF from a URL

To run the interactive example:

python examples/example.py

Command-line Options

  • --file, -f: Path to local PDF file
  • --url, -u: URL to PDF file
  • --api-key, -k: Mistral API key
  • --output, -o: Output HTML file path (default: output.html)
  • --max-images, -m: Maximum number of images to process (default: all)
  • --open-browser, -b: Open the output HTML in browser after processing

Examples

Process a local file with a custom API key and open in browser:

pdf2html --file document.pdf --api-key YOUR_API_KEY --open-browser

Process a PDF from URL and save to a custom output file:

pdf2html --url https://example.com/document.pdf --output result.html

Process a file but limit image processing to 5 images:

pdf2html --file document.pdf --max-images 5

API Key Setup

The script requires a valid Mistral API key to function. There are two ways to provide the API key:

  1. Create a .env file in your project directory with the following content:

    MISTRAL_API_KEY=your_api_key_here
    
  2. Provide the API key directly using the --api-key command-line argument:

    pdf2html-ai --file document.pdf --api-key your_api_key_here
    

Notes

  • Processing large PDFs may take some time
  • Image alt text generation uses the Pixtral 12B model
  • The HTML output is designed to be WCAG-compliant for accessibility
  • You can limit the number of images processed to save API usage
  • The code for this application is based on a public Google Colab notebook
  • The original code is from this repository: https://github.com/coldplazma/Accessible-OCR-Mistral-
  • This tool was mostly modified and extended using AI tools. Use this tool at your own risk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2html_ai-0.1.3.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2html_ai-0.1.3-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf2html_ai-0.1.3.tar.gz.

File metadata

  • Download URL: pdf2html_ai-0.1.3.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.3.tar.gz
Algorithm Hash digest
SHA256 35780a044330eeaee0326ce033a22ac1ee34b1a7fa48b6eb4d386052454e793d
MD5 24aa3c789a66edb5d878a3027d1bc082
BLAKE2b-256 fb1c5ae78ccea209e7376ea1836c7a8a9fc46453ebd5a2f42be0476450efb5d5

See more details on using hashes here.

File details

Details for the file pdf2html_ai-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pdf2html_ai-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 841ff2dc04611bbda86a13ae6968de142026434343d8fe331220027e7ba5c73e
MD5 9bc19b546e7d372cf1220369f4b3cb5c
BLAKE2b-256 249a9646f925d32334dd4df7163a2aa314228169aaad699022f63f5d460fcf70

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page