AI-powered PDF to HTML conversion using mistral-ocr and pixtral-12b

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PDF2HTML AI

A Python package for converting PDF documents to accessible HTML using Mistral OCR and Pixtral 12B. This tool processes PDFs and generates WCAG-compliant HTML output with enhanced accessibility features.

Features

Process local PDF files or download from URLs
OCR processing with Mistral OCR
Generate accessible alt text for images using Pixtral 12B
Convert to WCAG-compliant accessible HTML
Enhance tables with proper accessibility features
Save output as HTML file
Option to open result in browser

Requirements

Python 3.10+
Mistral API key
Required Python packages (automatically installed with pip):
- mistralai
- requests
- python-dotenv
- PyPDF2

Installation

Option 1: Install from PyPI (Recommended)

pip install pdf2html-ai

Option 2: Install from Repository

Clone the repository:

git clone https://github.com/mystique920/ai-powered-pdf2html
cd ai-powered-pdf2html

Install required packages:
```
pip install -r requirements.txt
```

Quick Start

After installing the package, you can use it in your Python code or via the command line.

Using as a Python Package

from pdf2html_ai import process_pdf_with_ocr, convert_ocr_to_accessible_html
from mistralai import Mistral

# Initialize Mistral client
client = Mistral(api_key="your_api_key_here")

# Process a local PDF file
with open("document.pdf", "rb") as f:
    file_content = f.read()
    
ocr_result = process_pdf_with_ocr(client, file_content, "document.pdf")
html_content = convert_ocr_to_accessible_html(client, ocr_result)

# Save the HTML output
with open("output.html", "w", encoding="utf-8") as f:
    f.write(html_content)

Using the Command Line

After installing the package, you can use the pdf2html-ai command directly:

Process a local PDF file:

pdf2html --file path/to/your/document.pdf

Process a PDF from a URL:

pdf2html --url https://example.com/document.pdf

Alternatively, you can still use the module directly:

python -m pdf2html_ai.processor --file path/to/your/document.pdf

Example Scripts

Several example scripts are provided to help you get started:

examples/example.py - An interactive example that guides you through the options
tests/test_local_pdf.py - A test script for processing a local PDF file
tests/test_url_pdf.py - A test script for processing a PDF from a URL

To run the interactive example:

python examples/example.py

Command-line Options

--file, -f: Path to local PDF file
--url, -u: URL to PDF file
--api-key, -k: Mistral API key
--output, -o: Output HTML file path (default: output.html)
--max-images, -m: Maximum number of images to process (default: all)
--open-browser, -b: Open the output HTML in browser after processing

Examples

Process a local file with a custom API key and open in browser:

pdf2html --file document.pdf --api-key YOUR_API_KEY --open-browser

Process a PDF from URL and save to a custom output file:

pdf2html --url https://example.com/document.pdf --output result.html

Process a file but limit image processing to 5 images:

pdf2html --file document.pdf --max-images 5

API Key Setup

The script requires a valid Mistral API key to function. There are two ways to provide the API key:

Create a .env file in your project directory with the following content:
```
MISTRAL_API_KEY=your_api_key_here
```
Provide the API key directly using the --api-key command-line argument:
```
pdf2html-ai --file document.pdf --api-key your_api_key_here
```

Notes

Processing large PDFs may take some time
Image alt text generation uses the Pixtral 12B model
The HTML output is designed to be WCAG-compliant for accessibility
You can limit the number of images processed to save API usage
The code for this application is based on a public Google Colab notebook
The original code is from this repository: https://github.com/coldplazma/Accessible-OCR-Mistral-
This tool was mostly modified and extended using AI tools. Use this tool at your own risk

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.3

Mar 14, 2025

0.1.1

Mar 14, 2025

0.1.0

Mar 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2html_ai-0.1.3.tar.gz (14.2 kB view details)

Uploaded Mar 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2html_ai-0.1.3-py3-none-any.whl (12.9 kB view details)

Uploaded Mar 14, 2025 Python 3

File details

Details for the file pdf2html_ai-0.1.3.tar.gz.

File metadata

Download URL: pdf2html_ai-0.1.3.tar.gz
Upload date: Mar 14, 2025
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`35780a044330eeaee0326ce033a22ac1ee34b1a7fa48b6eb4d386052454e793d`
MD5	`24aa3c789a66edb5d878a3027d1bc082`
BLAKE2b-256	`fb1c5ae78ccea209e7376ea1836c7a8a9fc46453ebd5a2f42be0476450efb5d5`

See more details on using hashes here.

File details

Details for the file pdf2html_ai-0.1.3-py3-none-any.whl.

File metadata

Download URL: pdf2html_ai-0.1.3-py3-none-any.whl
Upload date: Mar 14, 2025
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2html_ai-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`841ff2dc04611bbda86a13ae6968de142026434343d8fe331220027e7ba5c73e`
MD5	`9bc19b546e7d372cf1220369f4b3cb5c`
BLAKE2b-256	`249a9646f925d32334dd4df7163a2aa314228169aaad699022f63f5d460fcf70`

See more details on using hashes here.

pdf2html-ai 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF2HTML AI

Features

Requirements

Installation

Option 1: Install from PyPI (Recommended)

Option 2: Install from Repository

Quick Start

Using as a Python Package

Using the Command Line

Example Scripts

Command-line Options

Examples

API Key Setup

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes