Skip to main content

Tesseract OCR engine for OCR Bridge

Project description

OCR Bridge - Tesseract Engine

Tesseract OCR engine implementation for OCR Bridge.

Overview

This package provides a Tesseract OCR engine that integrates with the OCR Bridge architecture. Tesseract is a popular open-source OCR engine developed by Google.

Features

  • Multiple Formats: JPEG, PNG, TIFF, PDF
  • Multi-page PDFs: Automatic page splitting and merging
  • Language Support: 100+ languages via Tesseract language packs
  • Configurable: PSM, OEM, and DPI settings
  • HOCR Output: Structured XML with bounding boxes

Installation

pip install ocrbridge-tesseract

Note: Tesseract binary must be installed separately:

# Ubuntu/Debian
apt-get install tesseract-ocr tesseract-ocr-eng

# macOS
brew install tesseract

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Usage

The engine is automatically discovered by OCR Bridge via entry points.

Parameters

  • lang (str): Language code(s), e.g., "eng", "eng+fra" (default: "eng")
  • psm (int): Page segmentation mode 0-13 (default: 3)
  • oem (int): OCR engine mode 0-3 (default: 1)
  • dpi (int): DPI for PDF conversion, 70-2400 (default: 300)

Example

from pathlib import Path
from ocrbridge.engines.tesseract import TesseractEngine, TesseractParams

engine = TesseractEngine()

# Process with defaults
hocr = engine.process(Path("document.pdf"))

# Process with custom parameters
params = TesseractParams(
    lang="eng+fra",
    psm=6,
    oem=1,
    dpi=300
)
hocr = engine.process(Path("document.pdf"), params)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrbridge_tesseract-2.0.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrbridge_tesseract-2.0.0-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file ocrbridge_tesseract-2.0.0.tar.gz.

File metadata

  • Download URL: ocrbridge_tesseract-2.0.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ocrbridge_tesseract-2.0.0.tar.gz
Algorithm Hash digest
SHA256 49f76b020decfa99a703b9dd0a11c0b746ecae9e771300caa749e5f0f6ae4eff
MD5 99d977ef49f1726a2cabfbf53f2edc75
BLAKE2b-256 5aae4a65cab5112c46cb2357f8353ba2aa075d5bc908fea0563d4ad720e45222

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocrbridge_tesseract-2.0.0.tar.gz:

Publisher: release.yml on OCRBridge/ocrbridge-tesseract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ocrbridge_tesseract-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ocrbridge_tesseract-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d89cfcc05586da9aac7d5442cb3a02fb8f17a97bd1ef43d0f95b43e3572f974
MD5 6c70b54a13748d21abb1fcafca85c283
BLAKE2b-256 fee20986a1be6f9cb4b0a5d2b4ee7ef3e82f98b32754ac85c760bbf2e422b12c

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocrbridge_tesseract-2.0.0-py3-none-any.whl:

Publisher: release.yml on OCRBridge/ocrbridge-tesseract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page