Hybrid OCR with gemini and DocumentAI
Project description
Gemini OCR
Traceable Generative Markdown for PDFs
Gemini OCR is a library designed to convert PDF documents into clean, semantic Markdown while maintaining precise traceability back to the source coordinates. It bridges the gap between the readability of Generative AI (Gemini, Document AI Chunking) and the grounded accuracy of traditional OCR (Google Document AI).
Key Features
- Generative Markdown: Uses Google's Gemini Pro or Document AI Layout models to generate human-readable Markdown with proper structure (headers, tables, lists).
- Precision Traceability: Aligns the generated Markdown text back to the original PDF coordinates using detailed OCR data from Google Document AI.
- Reverse-Alignment Algorithm: Implements a robust "reverse-alignment" strategy that starts with the readable text and finds the corresponding bounding boxes, ensuring the Markdown is the ground truth for content.
- Confidence Metrics: (New) Includes coverage metrics to quantify how much of the Markdown content is successfully backed by OCR data.
- Pagination Support: Automatically handles PDF page splitting and merging logic.
Architecture
The library processes documents in two parallel streams:
- Semantic Stream: The PDF is sent to a Generative AI model (e.g., Gemini 2.5 Flash) to produce a clean Markdown representation.
- Positional Stream: The PDF is sent to Google Document AI to extract raw bounding boxes and text segments.
These two streams are then merged using a custom alignment engine (seq_smith + bbox_alignment.py) which:
- Normalizes both text sources.
- Identifies "anchor" comparisons for reliable alignment.
- Computes a global alignment using the anchors to constrain the search space.
- Identifies significant gaps or mismatches.
- Recursively re-aligns mismatched regions until a high-quality alignment is achieved.
Key Features:
- Robust to Cleanliness Issues: Handles extra headers/footers, watermarks, and noisy OCR artifacts.
- Scale-Invariant: Recursion ensures even small missed sections in large documents are recovered.
Quick Start
import asyncio
from pathlib import Path
from gemini_ocr import gemini_ocr, settings
async def main():
# Configure settings
ocr_settings = settings.Settings(
project="my-gcp-project",
location="us",
gcp_project_id="my-gcp-project",
layout_processor_id="projects/.../processors/...",
ocr_processor_id="projects/.../processors/...",
mode=settings.OcrMode.GEMINI,
)
file_path = Path("path/to/document.pdf")
# Process the document
result = await gemini_ocr.process_document(ocr_settings, file_path)
# Access results
print(f"Coverage: {result.coverage_percent:.2%}")
# Get annotated HTML-compatible Markdown
annotated_md = result.annotate()
print(annotated_md[:500]) # View first 500 chars
if __name__ == "__main__":
asyncio.run(main())
Configuration
The gemini_ocr.settings.Settings class controls the behavior:
| Parameter | Type | Description |
|---|---|---|
project |
str |
GCP Project Name |
location |
str |
GCP Location (e.g., us, eu) |
gcp_project_id |
str |
GCP Project ID (might be same as project) |
layout_processor_id |
str |
Document AI Processor ID for Layout (if using DOCUMENTAI mode) |
ocr_processor_id |
str |
Document AI Processor ID for OCR (required for bounding boxes) |
mode |
OcrMode |
GEMINI (default), DOCUMENTAI, or DOCLING |
gemini_model_name |
str |
Gemini model to use (default: gemini-2.5-flash) |
alignment_uniqueness_threshold |
float |
Min score ratio for unique match (default: 0.5) |
alignment_min_overlap |
float |
Min overlap fraction for valid match (default: 0.9) |
include_bboxes |
bool |
Whether to perform alignment (default: True) |
markdown_page_batch_size |
int |
Pages per batch for Markdown generation (default: 10) |
ocr_page_batch_size |
int |
Pages per batch for OCR (default: 10) |
num_jobs |
int |
Max concurrent jobs (default: 10) |
cache_dir |
str |
Directory to store API response cache (default: .docai_cache) |
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gemini_ocr-0.2.0.tar.gz.
File metadata
- Download URL: gemini_ocr-0.2.0.tar.gz
- Upload date:
- Size: 180.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e2f7e2b001af19d0dd070dd65962c953235d6f7fefb54ff9e8f35d268bf7f32
|
|
| MD5 |
dc5c30d1f3b84157e23c19bd97a6e446
|
|
| BLAKE2b-256 |
eaa136c27a33bcbf2609df67e9908d926b8e8aab387259fc326ceb48d217ebca
|
Provenance
The following attestation bundles were made for gemini_ocr-0.2.0.tar.gz:
Publisher:
release.yaml on folded/gemini-ocr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gemini_ocr-0.2.0.tar.gz -
Subject digest:
1e2f7e2b001af19d0dd070dd65962c953235d6f7fefb54ff9e8f35d268bf7f32 - Sigstore transparency entry: 791371794
- Sigstore integration time:
-
Permalink:
folded/gemini-ocr@4e24f296733e359e6524e24d1b4cdac80ba7a4d7 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/folded
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@4e24f296733e359e6524e24d1b4cdac80ba7a4d7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gemini_ocr-0.2.0-py3-none-any.whl.
File metadata
- Download URL: gemini_ocr-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac2e5afe982c27f0756c830dc85a3797350fae7e9a3a4bbe45fd8aba29b6eda7
|
|
| MD5 |
91bb42ab6cbd443636d8fcf0b3d4d013
|
|
| BLAKE2b-256 |
ffe4f5b3a1cfd5875206ac25502273160ed51ec19a29e5a0eccc6d9388d749a7
|
Provenance
The following attestation bundles were made for gemini_ocr-0.2.0-py3-none-any.whl:
Publisher:
release.yaml on folded/gemini-ocr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gemini_ocr-0.2.0-py3-none-any.whl -
Subject digest:
ac2e5afe982c27f0756c830dc85a3797350fae7e9a3a4bbe45fd8aba29b6eda7 - Sigstore transparency entry: 791371796
- Sigstore integration time:
-
Permalink:
folded/gemini-ocr@4e24f296733e359e6524e24d1b4cdac80ba7a4d7 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/folded
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@4e24f296733e359e6524e24d1b4cdac80ba7a4d7 -
Trigger Event:
release
-
Statement type: