Skip to main content

Hybrid OCR with gemini and DocumentAI

Project description

Gemini OCR

gemini-ocr

Traceable Generative Markdown for PDFs

Gemini OCR is a library designed to convert PDF documents into clean, semantic Markdown while maintaining precise traceability back to the source coordinates. It bridges the gap between the readability of Generative AI (Gemini, Document AI Chunking) and the grounded accuracy of traditional OCR (Google Document AI).

Key Features

  • Generative Markdown: Uses Google's Gemini Pro or Document AI Layout models to generate human-readable Markdown with proper structure (headers, tables, lists).
  • Precision Traceability: Aligns the generated Markdown text back to the original PDF coordinates using detailed OCR data from Google Document AI.
  • Reverse-Alignment Algorithm: Implements a robust "reverse-alignment" strategy that starts with the readable text and finds the corresponding bounding boxes, ensuring the Markdown is the ground truth for content.
  • Confidence Metrics: (New) Includes coverage metrics to quantify how much of the Markdown content is successfully backed by OCR data.
  • Pagination Support: Automatically handles PDF page splitting and merging logic.

Architecture

The library processes documents in two parallel streams:

  1. Semantic Stream: The PDF is sent to a Generative AI model (e.g., Gemini 2.5 Flash) to produce a clean Markdown representation.
  2. Positional Stream: The PDF is sent to Google Document AI to extract raw bounding boxes and text segments.

These two streams are then merged using a custom alignment engine (seq_smith + bbox_alignment.py) which:

  1. Normalizes both text sources.
  2. Identifies "anchor" comparisons for reliable alignment.
  3. Computes a global alignment using the anchors to constrain the search space.
  4. Identifies significant gaps or mismatches.
  5. Recursively re-aligns mismatched regions until a high-quality alignment is achieved.

Key Features:

  • Robust to Cleanliness Issues: Handles extra headers/footers, watermarks, and noisy OCR artifacts.
  • Scale-Invariant: Recursion ensures even small missed sections in large documents are recovered.

Quick Start

import asyncio
from pathlib import Path
from gemini_ocr import gemini_ocr, settings

async def main():
    # Configure settings
    ocr_settings = settings.Settings(
        project="my-gcp-project",
        location="us",
        gcp_project_id="my-gcp-project",
        layout_processor_id="projects/.../processors/...",
        ocr_processor_id="projects/.../processors/...",
        mode=settings.OcrMode.GEMINI,
    )

    file_path = Path("path/to/document.pdf")

    # Process the document
    result = await gemini_ocr.process_document(ocr_settings, file_path)

    # Access results
    print(f"Coverage: {result.coverage_percent:.2%}")

    # Get annotated HTML-compatible Markdown
    annotated_md = result.annotate()
    print(annotated_md[:500])  # View first 500 chars

if __name__ == "__main__":
    asyncio.run(main())

Configuration

The gemini_ocr.settings.Settings class controls the behavior:

Parameter Type Description
project str GCP Project Name
location str GCP Location (e.g., us, eu)
gcp_project_id str GCP Project ID (might be same as project)
layout_processor_id str Document AI Processor ID for Layout (if using DOCUMENTAI mode)
ocr_processor_id str Document AI Processor ID for OCR (required for bounding boxes)
mode OcrMode GEMINI (default), DOCUMENTAI, or DOCLING
gemini_model_name str Gemini model to use (default: gemini-2.5-flash)
alignment_uniqueness_threshold float Min score ratio for unique match (default: 0.5)
alignment_min_overlap float Min overlap fraction for valid match (default: 0.9)
include_bboxes bool Whether to perform alignment (default: True)
markdown_page_batch_size int Pages per batch for Markdown generation (default: 10)
ocr_page_batch_size int Pages per batch for OCR (default: 10)
num_jobs int Max concurrent jobs (default: 10)
cache_dir str Directory to store API response cache (default: .docai_cache)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemini_ocr-0.3.0.tar.gz (228.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gemini_ocr-0.3.0-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file gemini_ocr-0.3.0.tar.gz.

File metadata

  • Download URL: gemini_ocr-0.3.0.tar.gz
  • Upload date:
  • Size: 228.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gemini_ocr-0.3.0.tar.gz
Algorithm Hash digest
SHA256 83c5f0fb9cb5dbc0d3a5f755d4d2db3d146b67b8d2954cdee5569141434a4ca4
MD5 5aaee8b7910520b83d61e11b39b876e1
BLAKE2b-256 7f25ce6bd6c0bb13a0a01b1e5c332ba84c04fb1c571ea3ffd161dc65b3bb6d21

See more details on using hashes here.

Provenance

The following attestation bundles were made for gemini_ocr-0.3.0.tar.gz:

Publisher: release.yaml on folded/gemini-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gemini_ocr-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: gemini_ocr-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gemini_ocr-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af81b38a3d9ad7527a3833ba2b6ac1c3ae742e4e1ecb8f82bb6a2f59802c85f5
MD5 37238b12f6aeb3a536bcaf6d2d7addac
BLAKE2b-256 bf76e50fe43f651f113993773706a02108773986611e4c26bd43ad368f1e32e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for gemini_ocr-0.3.0-py3-none-any.whl:

Publisher: release.yaml on folded/gemini-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page