Skip to main content

Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama

Project description

glmmedia-ocr

Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama. Fully self-contained — zero ongoing maintenance after install.

npm install -g glmmedia-ocr
glmmedia-ocr scan invoice.pdf
# → invoice.md written

Table of Contents


Requirements

Only two things need to be on your machine before installing:

Requirement Why Where
Python 3.12 or 3.13 Runs the GLM-OCR SDK python.org
Ollama (installed, not necessarily running) Serves the glm-ocr model locally ollama.com/download

That's it. Everything else — the Python virtual environment, all dependencies, and the Ollama process lifecycle — is managed automatically by the package.

Note: Python 3.14+ is not yet supported. The GLM-OCR SDK and its dependencies (PyTorch, Transformers) only publish wheels for Python 3.10–3.13.


Installation

npm (recommended)

npm install -g glmmedia-ocr

This triggers a postinstall script that:

  1. Creates a dedicated Python virtual environment inside the package (.venv/)
  2. Installs glmocr[selfhosted] with CPU-only PyTorch into the venv
  3. Verifies the installation by importing the SDK

The first install takes a few minutes while pip downloads ~1-2GB of dependencies. This is a one-time cost.

pip

pip install .

Or from source:

git clone https://github.com/glmmedia-ocr/glmmedia-ocr.git
cd glmmedia-ocr
pip install .

This installs the same dependencies directly into your Python environment and registers the glmmedia-ocr CLI command. Both npm and pip packages provide the exact same functionality and CLI interface.

GPU install (optional)

By default, the npm package installs CPU-only PyTorch to avoid GPU resource competition with Ollama. If you have a GPU and want to use it for layout detection:

# npm
GLMOCR_GPU=1 npm install -g glmmedia-ocr

# pip — pip resolves CUDA PyTorch by default
pip install .

Reinstall / repair

# npm
npm rebuild glmmedia-ocr

# pip
pip install --force-reinstall .

Quick Start

# Single PDF
glmmedia-ocr scan invoice.pdf

# Single image
glmmedia-ocr scan receipt.png

# Multiple images
glmmedia-ocr scan page1.png page2.png page3.png

# Mixed PDFs and images
glmmedia-ocr scan report.pdf page1.png page2.png

# All images in a directory
glmmedia-ocr scan ./images/

# All images in directory + subdirectories
glmmedia-ocr scan ./images/ --recursive

# Shell glob
glmmedia-ocr scan *.png

# Custom output path
glmmedia-ocr scan contract.pdf --output ./results/contract.md

# Higher DPI for better OCR quality
glmmedia-ocr scan receipt.pdf --dpi 300

# Connect to a remote Ollama instance
glmmedia-ocr scan report.pdf --ollama-host 192.168.1.100:11434

# Faster processing with parallel workers
glmmedia-ocr scan book.pdf --concurrency 2

# Debug logging to see layout detection progress
glmmedia-ocr scan document.pdf --log-level DEBUG

First run

On the very first run, the CLI will:

  1. Detect that Ollama is not running and start it automatically
  2. Detect that the glm-ocr:latest model is not pulled and download it (~2.2GB)
  3. Process your input
  4. Shut down Ollama on exit (since it started it)

Subsequent runs skip steps 1 and 2 if Ollama is already running and the model is cached.


CLI Reference

glmmedia-ocr scan <input...> [options]

Inputs:
  <file.pdf>                   Single PDF file
  <image.png>                  Single image file (PNG, JPEG, WebP, BMP, TIFF, GIF)
  <img1.png> <img2.png> ...    Multiple image files
  <directory>/                 Directory of images (use --recursive for subfolders)

Input/Output:
  --output <path>              Output .md path (default: auto-generated from input names)
  --recursive                  Scan directories recursively for images

Rendering:
  --dpi <number>               Render DPI for PDFs (default: 200)
  --image-format <format>      Image format: PNG, JPEG, WEBP (default: PNG)
  --min-pixels <number>        Minimum image pixels (default: 12544)
  --max-pixels <number>        Maximum image pixels (default: 71372800)
  --patch-expand-factor <n>    Patch expansion factor (default: 1)
  --t-patch-size <n>           T-patch size (default: 2)
  --image-expect-length <n>    Image expect length (default: 6144)

Generation:
  --max-tokens <number>        Max generation tokens (default: 8192)
  --temperature <float>        Sampling temperature (default: 0.0)
  --top-p <float>              Top-p sampling (default: 0.00001)
  --top-k <number>             Top-k sampling (default: 1)
  --repetition-penalty <float> Repetition penalty (default: 1.1)

Layout (PP-DocLayoutV3):
  --layout-device <device>     Device: cpu, cuda, cuda:N (default: cpu)
  --layout-model-dir <path>    Custom layout model directory
  --layout-threshold <float>   Detection threshold (default: 0.3)
  --layout-batch-size <n>      Layout batch size (default: 1)
  --layout-use-polygon         Use polygon masks for cropping
  --no-layout-nms              Disable layout NMS
  --layout-merge-mode <mode>   Merge overlapping bboxes: large|small (default: large)
  --layout-workers <n>         Layout workers (default: 1)

Result formatting:
  --output-format <format>     Output: markdown, json, both (default: markdown)
  --no-merge-formula-numbers   Disable formula number merging
  --no-merge-text-blocks       Disable text block merging
  --no-format-bullet-points    Disable bullet point formatting

Pipeline:
  --concurrency <number>       Parallel OCR workers (default: 1)
  --page-maxsize <number>      Page queue max size (default: 100)
  --region-maxsize <number>    Region queue max size (default: 2000)

Ollama / API:
  --ollama-host <host>         Ollama host (default: localhost:11434)
  --ollama-num-ctx <n>         Ollama num_ctx for glm-ocr (default: 8192; 0 = omit)
  --api-scheme <scheme>        API scheme: http, https (default: auto)
  --api-key <key>              API key for MaaS providers
  --verify-ssl                 Enable SSL verification
  --connect-timeout <seconds>  Connect timeout (default: 30)
  --request-timeout <seconds>  Request timeout (default: 120)

MaaS (Zhipu Cloud):
  --maas                       Enable MaaS mode (disables local OCR)
  --maas-api-url <url>         MaaS API URL
  --maas-model <model>         MaaS model name
  --maas-api-key <key>         MaaS API key
  --no-maas-verify-ssl         Disable MaaS SSL verification
  --maas-connect-timeout <s>   MaaS connect timeout (default: 30)
  --maas-request-timeout <s>   MaaS request timeout (default: 300)
  --maas-retry-attempts <n>    MaaS retry attempts (default: 2)

Logging:
  --log-level <level>          Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)

Flag Details

Inputs

Input type Description
<file.pdf> One or more PDF files. Each page becomes <!-- PAGE N --> in output.
<image.png> One or more image files. Supported: PNG, JPEG, WebP, BMP, TIFF, GIF.
<file.pdf> <img.png> Mixed PDFs and images. Pages are merged in input order.
<directory>/ Directory of images. Scans flat by default; use --recursive for subfolders.

Input/Output

Flag Default Description
--output auto-generated Where to write the Markdown output. Single input → <name>.md. Multiple inputs → <name1>_<name2>_output.md. --output overrides all.
--recursive off When a directory is passed, recurse into subdirectories for images.

Rendering

Flag Default Description
--dpi 200 Resolution for rendering PDF pages to images. Higher DPI improves OCR accuracy but increases processing time and memory usage. Recommended: 200-300.
--image-format PNG Format for images sent to the OCR API. PNG is lossless (best for code, diagrams). JPEG is smaller (best for text documents). WEBP is smallest but may not be supported by all backends.
--min-pixels 12544 Minimum image pixel count (112×112). Images smaller than this are upscaled.
--max-pixels 71372800 Maximum image pixel count (14×14×4×1280). Images larger than this are downscaled.
--patch-expand-factor 1 Patch expansion factor for image processing.
--t-patch-size 2 T-patch size for image processing.
--image-expect-length 6144 Expected image token length.

Generation

Flag Default Description
--max-tokens 8192 Maximum tokens generated per region. Increase for very dense pages.
--temperature 0.0 Sampling temperature. 0.0 = deterministic (recommended for OCR).
--top-p 0.00001 Top-p (nucleus) sampling. Keep very low for OCR.
--top-k 1 Top-k sampling. 1 = always pick the most likely token.
--repetition-penalty 1.1 Penalty for repeating tokens. Prevents the model from getting stuck in loops.

Layout (PP-DocLayoutV3)

Flag Default Description
--layout-device cpu Device for the PP-DocLayoutV3 layout detection model. cpu avoids GPU memory competition with Ollama. Use cuda or cuda:N for GPU.
--layout-model-dir (SDK default) Path to a custom PP-DocLayoutV3 model directory. Leave unset to use the SDK's built-in default.
--layout-threshold 0.3 Confidence threshold for layout detection. Lower values detect more regions (may include false positives).
--layout-batch-size 1 Max images per layout model forward pass. Reduce to 1 if OOM.
--layout-use-polygon off Use polygon masks for region cropping instead of bounding boxes. More precise for rotated or staggered layouts.
--no-layout-nms off Disable non-maximum suppression for layout detection.
--layout-merge-mode large How to merge overlapping bounding boxes. large keeps the larger region, small keeps the smaller one.
--layout-workers 1 Number of layout detection workers.

Result Formatting

Flag Default Description
--output-format markdown Output format: markdown, json, or both.
--no-merge-formula-numbers off Disable automatic merging of formula numbers with their equations.
--no-merge-text-blocks off Disable automatic merging of adjacent text blocks.
--no-format-bullet-points off Disable automatic bullet point formatting normalization.

Pipeline

Flag Default Description
--concurrency 1 Number of parallel OCR workers. Increase for faster processing on multi-page documents. Set to 1 for maximum stability with Ollama.
--page-maxsize 100 Maximum number of pages queued for processing.
--region-maxsize 2000 Maximum number of regions queued for OCR.

Ollama / API

Flag Default Description
--ollama-host localhost:11434 Ollama server address. Use this to connect to a remote or non-standard Ollama instance.
--ollama-num-ctx 8192 Ollama num_ctx parameter for glm-ocr. Prevents GGML tensor size crashes. Set to 0 to omit.
--api-scheme auto API URL scheme: http or https. Auto-detects based on port (HTTPS if 443).
--api-key null API key for MaaS providers (Zhipu, OpenAI, etc.).
--verify-ssl off Enable SSL certificate verification for API requests.
--connect-timeout 30 Connection timeout in seconds.
--request-timeout 120 Request timeout in seconds.

MaaS (Zhipu Cloud)

Flag Default Description
--maas off Enable MaaS mode. Sends requests directly to Zhipu's cloud API. Disables local OCR and Ollama checks.
--maas-api-url Zhipu default MaaS API endpoint URL.
--maas-model glm-ocr MaaS model name.
--maas-api-key null MaaS API key (or set ZHIPU_API_KEY env var).
--no-maas-verify-ssl off Disable SSL verification for MaaS requests.
--maas-connect-timeout 30 MaaS connection timeout in seconds.
--maas-request-timeout 300 MaaS request timeout in seconds.
--maas-retry-attempts 2 Number of retry attempts for transient MaaS errors.

Logging

Flag Default Description
--log-level INFO Log level: DEBUG, INFO, WARNING, ERROR. Use DEBUG to see detailed timing and layout detection progress.

How It Works

Startup Sequence

glmmedia-ocr scan invoice.pdf
│
├─ 1. Preflight Checks
│   ├─ Python 3.12 or 3.13 found?
│   ├─ Ollama binary on PATH? (skipped if --maas)
│   └─ GLM-OCR SDK importable in managed venv?
│
├─ 2. Ollama Lifecycle (skipped if --maas)
│   ├─ Is Ollama already running? (GET localhost:11434)
│   ├─ If yes → use it, leave it running after exit
│   └─ If no → spawn ollama serve, wait until healthy
│
├─ 3. Model Check (skipped if --maas)
│   ├─ Is glm-ocr:latest pulled? (ollama list)
│   └─ If no → ollama pull glm-ocr:latest (~2.2GB, one-time)
│
├─ 4. Pipeline Execution
│   ├─ PDF: Render pages to images (pypdfium2, in-memory, capped to 2000px)
│   │   Images: Load and cap to 2000px (no rendering step)
│   ├─ Run layout detection (PP-DocLayoutV3) — progress logged to stderr
│   ├─ OCR each region via Ollama (/api/generate) or MaaS
│   └─ Merge results with page markers
│
└─ 5. Cleanup
    ├─ Write output .md
    └─ Shut down Ollama (only if CLI started it)

Ollama Ownership Tracking

The CLI tracks whether it started Ollama or found it already running:

Scenario CLI behavior
Ollama was already running Uses it, leaves it running on exit
CLI started Ollama Shuts it down on normal exit, SIGINT, or SIGTERM
CLI crashes Still shuts down Ollama via signal trap

This means you can run Ollama manually before using the CLI, and it won't be touched.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                     User (CLI)                              │
│   glmmedia-ocr scan invoice.pdf  (or *.png, ./images/)     │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              bin/glmmedia-ocr.js (Node.js)                  │
│                                                             │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │  Preflight  │  │   Ollama     │  │   Model Check     │  │
│  │  Checks     │  │  Lifecycle   │  │   (pull if needed)│  │
│  └──────┬──────┘  └──────┬───────┘  └────────┬──────────┘  │
│         │                │                    │              │
│         └────────────────┼────────────────────┘              │
│                          │                                   │
│              ┌───────────▼────────────┐                      │
│              │  Resolve inputs        │                      │
│              │  (files, dirs, globs)  │                      │
│              └───────────┬────────────┘                      │
│                          │                                   │
│              ┌───────────▼────────────┐                      │
│              │  Generate config.yaml  │                      │
│              │  (full SDK template)   │                      │
│              └───────────┬────────────┘                      │
│                          │                                   │
│              ┌───────────▼────────────┐                      │
│              │  Spawn Python Pipeline │                      │
│              │  lib/pipeline.py       │                      │
│              └───────────┬────────────┘                      │
└──────────────────────────┼──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              lib/pipeline.py (Python)                       │
│                                                             │
│  ┌──────────────────┐    ┌──────────────────────────────┐  │
│  │  PDF: pypdfium2  │    │  GlmOcr SDK (selfhosted)     │  │
│  │  Image: PIL open │───▶│  ┌────────────────────────┐  │  │
│  │  (2000px cap)    │    │  │ PP-DocLayoutV3         │  │  │
│  └──────────────────┘    │  │ (Transformers + CPU    │  │  │
│                          │  │  PyTorch layout detect) │  │  │
│                          │  └───────────┬────────────┘  │  │
│                          │              │                │  │
│                          │  ┌───────────▼────────────┐  │  │
│                          │  │ OCRClient              │  │  │
│                          │  │ → Ollama /api/generate │  │  │
│                          │  └────────────────────────┘  │  │
│                          └──────────────────────────────┘  │
│                                     │                       │
│                          ┌──────────▼────────────┐          │
│                          │  Merge + Page Markers │          │
│                          │  → output.md          │          │
│                          └───────────────────────┘          │
└─────────────────────────────────────────────────────────────┘

Key Design Decisions

Decision Rationale
Managed .venv The package owns its Python environment. Never touches the user's global Python. Reproducible, isolated, self-contained.
CPU-only PyTorch by default Avoids GPU memory competition with Ollama. Smaller venv (~1-2GB vs 4GB+). Layout detection on CPU is fast enough for most documents.
Ollama /api/generate mode Official GLM-OCR recommendation for Ollama. More stable than the OpenAI-compatible endpoint for vision requests.
pypdfium2 for PDF rendering Ships its own PDFium binary in the wheel. Zero system dependencies. Renders directly to PIL images in-memory — no temp files, no subprocess calls.
2000px image cap Balances OCR quality with model stability. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS. Prevents GGML tensor size crashes on Ollama.
Full SDK config Generates a complete config.yaml matching the SDK's template on every run. All 50+ options are exposed as CLI flags.
Per-page error tolerance A failed page gets a placeholder in the output. The rest of the document continues processing.

Output Format

The output Markdown file contains clear page boundaries:

<!-- PAGE 1 -->

# Invoice

**Invoice Number:** INV-2024-0042
**Date:** January 15, 2024

| Item | Quantity | Price |
|------|----------|-------|
| Widget A | 10 | $50.00 |
| Widget B | 5 | $75.00 |

**Total: $875.00**

---

<!-- PAGE 2 -->

## Terms and Conditions

1. Payment is due within 30 days.
2. Late payments incur a 2% monthly fee.

---

Page Markers

Each page is delimited by:

  • <!-- PAGE N --> — HTML comment identifying the page number
  • --- — Markdown horizontal rule as a visual separator

Failed Pages

If a page fails OCR (e.g., Ollama timeout, model error), it gets a placeholder:

<!-- PAGE 4 -->

<!-- PAGE 4: OCR failed — API request failed after 3 attempts -->

---

The rest of the document continues processing normally.


Configuration

Environment Variables

Variable Default Description
GLMOCR_GPU 0 Set to 1 during install to use GPU PyTorch instead of CPU-only.

Internal Config (auto-generated)

The CLI generates a temporary YAML config for each run. All SDK options are exposed as CLI flags:

# Example of generated config (abbreviated)
pipeline:
  maas:
    enabled: false
  ocr_api:
    api_host: localhost
    api_port: 11434
    api_path: /api/generate
    api_mode: ollama_generate
    model: glm-ocr:latest
    connect_timeout: 30
    request_timeout: 120
  max_workers: 1
  page_maxsize: 100
  region_maxsize: 2000
  page_loader:
    max_tokens: 8192
    temperature: 0.0
    top_p: 0.00001
    top_k: 1
    repetition_penalty: 1.1
    image_format: PNG
    min_pixels: 12544
    max_pixels: 71372800
  result_formatter:
    output_format: markdown
    enable_merge_formula_numbers: true
    enable_merge_text_blocks: true
    enable_format_bullet_points: true
  layout:
    device: "cpu"
    threshold: 0.3
    batch_size: 1
    use_polygon: false
    layout_nms: true
    layout_merge_bboxes_mode: large

This config is written to a temp directory before each run and cleaned up afterward. Users don't need to manage it manually.


GPU Support

The default installation uses CPU-only PyTorch for layout detection. This is intentional:

  1. No GPU competition — Ollama loads the glm-ocr model into GPU VRAM. Running layout detection on the same GPU can cause OOM errors.
  2. Smaller venv — CPU PyTorch is ~500MB vs ~4GB for CUDA.
  3. Fast enough — PP-DocLayoutV3 is lightweight and runs quickly on CPU for typical document sizes.

Enabling GPU

If you have ample GPU memory and want faster layout detection:

# Uninstall the CPU-only version
npm uninstall -g glmmedia-ocr

# Reinstall with GPU PyTorch
GLMOCR_GPU=1 npm install -g glmmedia-ocr

Then use --layout-device cuda when scanning:

glmmedia-ocr scan document.pdf --layout-device cuda

Recommended GPU Setup

If running both Ollama (glm-ocr model) and layout detection on the same GPU:

  • GPU with 12GB+ VRAM — glm-ocr takes ~2.2GB, layout detection takes ~1-2GB
  • Use --concurrency 1 — Avoids queuing multiple OCR requests that could spike memory
  • Monitor with nvidia-smi — Watch for OOM during processing

Troubleshooting

Python not found or unsupported version

✗ Python 3.12+ not found on PATH. Install from python.org

Fix: Install Python 3.12 or 3.13 from python.org. Make sure it's on your PATH. Python 3.14+ is not yet supported because key dependencies (PyTorch, Transformers) don't publish 3.14 wheels yet.

# Verify
python --version  # Should show 3.12.x or 3.13.x

Ollama not found

✗ Ollama not found on PATH. Install from https://ollama.com/download

Fix: Install Ollama from ollama.com/download.

# Verify
ollama --version

SDK installation failed

✗ GLM-OCR SDK installation failed. Run 'npm rebuild glmmedia-ocr' to retry.

Fix: Rebuild the package:

npm rebuild glmmedia-ocr

If that fails, try a clean reinstall:

npm uninstall -g glmmedia-ocr
npm install -g glmmedia-ocr

Model pull failed

✗ ollama pull failed with code 1

Fix: Check your internet connection and try again. The model is ~2.2GB and requires a stable connection.

# Manual pull to debug
ollama pull glm-ocr:latest

Ollama won't start

✗ Ollama did not become healthy within 15s

Fix: Start Ollama manually and check for errors:

ollama serve
# In another terminal:
ollama list

If Ollama is already running on a different port, use --ollama-host:

glmmedia-ocr scan document.pdf --ollama-host localhost:11435

OCR timeout on large documents

Error: OCR failed — API request failed after 3 attempts

Fix: Increase the request timeout or reduce concurrency:

# Reduce to single worker (most stable)
glmmedia-ocr scan large-document.pdf --concurrency 1

# If using a remote Ollama, ensure the network is stable
glmmedia-ocr scan document.pdf --ollama-host 192.168.1.100:11434

Out of memory

Error: CUDA out of memory

Fix: Use CPU for layout detection:

glmmedia-ocr scan document.pdf --layout-device cpu

Or reduce concurrency:

glmmedia-ocr scan document.pdf --concurrency 1

Corrupt or encrypted PDF

Error: Failed to render PDF: ...

Fix: Ensure the PDF is valid and not password-protected. The current version does not support encrypted PDFs. Use a tool like qpdf to decrypt first:

qpdf --decrypt --password=your-password input.pdf decrypted.pdf
glmmedia-ocr scan decrypted.pdf

No image files found in directory

✗ No image files found in directory: ./images/

Fix: Ensure the directory contains supported image files (PNG, JPEG, WebP, BMP, TIFF, GIF). Use --recursive if images are in subdirectories:

glmmedia-ocr scan ./images/ --recursive

Input not found

✗ Input not found: ./missing.pdf

Fix: Check the file path and ensure the input exists.


Project Structure

glmmedia-ocr/
├── bin/
│   └── glmmedia-ocr.js          # npm CLI entry point
│                                # - Thin wrapper: finds .venv Python
│                                # - Delegates to lib/pipeline.py
│
├── scripts/
│   └── postinstall.js           # npm package setup
│                                # - Creates .venv
│                                # - pip install glmocr[selfhosted] + CPU torch
│                                # - Verifies installation
│
├── lib/
│   └── pipeline.py              # PDF/Image-to-Markdown pipeline (npm path)
│                                # - pypdfium2: PDF → PIL images (2000px cap)
│                                # - PIL: load images directly (2000px cap)
│                                # - GlmOcr SDK: layout detection + OCR
│                                # - Logging: surfaces SDK progress to stderr
│                                # - Merge with page markers → .md
│
├── src/glmmedia_ocr/            # Pure Python CLI package (pip path)
│   ├── __init__.py              # Package version
│   ├── __main__.py              # python -m glmmedia_ocr entry
│   ├── cli.py                   # Full CLI: args, Ollama, config, spinner
│   ├── config.py                # Config YAML generation
│   ├── inputs.py                # Input resolution (files, dirs, types)
│   ├── ollama.py                # Ollama lifecycle management
│   ├── pipeline.py              # Rendering + OCR + output
│   └── spinner.py               # Animated terminal spinner
│
├── pyproject.toml               # Python package metadata + deps
├── .venv/                       # Created at npm install time (gitignored)
├── .gitignore
├── package.json                 # npm package metadata
└── README.md

Distribution Channels

Channel Entry point Code path
npm bin/glmmedia-ocr.js JS wrapper → lib/pipeline.py
pip src/glmmedia_ocr/cli.py Pure Python (full implementation)

Both provide the same CLI interface and functionality. They are independent implementations — changes to one should be mirrored in the other.

What's NOT Here

Not included Why
node_modules/ Zero npm dependencies — uses Node.js built-ins only
vendor/poppler/ pypdfium2 ships its own PDFium binary in its pip wheel
config.yaml Generated dynamically per run, cleaned up after
*.md output files Generated by the CLI, not part of the package
dist/, build/, *.egg-info/ Build artifacts (gitignored)

Under the Hood

Input Resolution

The CLI accepts PDFs, images, and directories. When a directory is passed, it collects all supported image files (flat or recursive with --recursive). Mixed input types (PDF + image) are supported — pages are merged in input order into a single output file with sequential <!-- PAGE N --> markers.

PDF Rendering

Uses pypdfium2, which bundles the PDFium engine (same as Chromium). Renders PDF pages directly to PIL images in-memory at the specified DPI. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS resampling. No temp files, no subprocess calls, no system dependencies.

Image Loading

Images are opened with PIL and capped to 2000px on their longest dimension via LANCZOS resampling. This ensures consistent quality while preventing GGML tensor size crashes on Ollama.

Layout Detection

Uses PP-DocLayoutV3 via HuggingFace Transformers. Detects text blocks, tables, formulas, images, and other regions on each page. Runs on CPU by default to avoid GPU memory competition with Ollama. Progress is logged to stderr when --log-level DEBUG is used.

OCR

Each detected region is sent to the glm-ocr model via Ollama's native /api/generate endpoint. The model returns structured Markdown for each region.

Result Merging

Per-page results are merged with <!-- PAGE N --> markers and --- separators. Failed pages get error placeholders instead of aborting the entire document.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glmmedia_ocr-0.1.0.tar.gz (37.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glmmedia_ocr-0.1.0-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file glmmedia_ocr-0.1.0.tar.gz.

File metadata

  • Download URL: glmmedia_ocr-0.1.0.tar.gz
  • Upload date:
  • Size: 37.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for glmmedia_ocr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69f91178b067877d082937a0cb51c9b33c90d337684a1c95170ad3ca90fa4b7d
MD5 1cc75b3c6fb2862f01382138373689d2
BLAKE2b-256 0061505c63bee49c7feaacd96c0eaf390c8ea9aa73a83933d7eb7bf1b9f40ef8

See more details on using hashes here.

File details

Details for the file glmmedia_ocr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: glmmedia_ocr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for glmmedia_ocr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18282b418fa02bee6d3c3a09ce5c3c33193ae0572c3911cb4f9495eef4043a3b
MD5 338005be207463450117e1f3d482afc9
BLAKE2b-256 e36f717e96adef1dd153d628400ad2c5186f458ad6bd0c1a7fce33fb772ecba8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page