Skip to main content

OCR tool for botanical documents using layout analysis and LLMs/OCR engines.

Project description

Chinese Version | English Version

Herbarium-OCR

Project Overview

Herbarium-OCR is an open-source OCR tool primarily designed for extracting text from floras, papers, and handwritten or printed labels of herbarium specimens from Central Eurasian countries, aiming to support research in plant systematics and ecology. It can also process scanned documents and photos from other regions and languages. Users are advised to first consider commercial OCR solutions for stable service and support, such as ABBYY, Google Document AI, and TextIn.

The workflow includes:

  1. Optional auto-rotation (if enabled) to correct full image/page orientation.
  2. Layout analysis using the DocLayout-YOLO model to extract text block images.
  3. Optional image enhancement (contrast, denoising, sharpening) applied to cropped text blocks.
  4. Text recognition using a supported OCR engine.

Supported OCR Engines:

  • Large Language Models via OpenAI-compatible interface (OpenAI SDK), such as Gemini (gemini-2.0-flash), Qwen (qwen-vl-plus), ChatGLM (glm-4v-plus). Theoretically compatible with local Ollama or vLLM.
  • XFYun OCR Services via HTTP API (OCR technology provided by iFlytek):
    • General Text Recognition (xfyun-general-ocr): Supports Chinese/English (API Doc - Chinese).
    • Printed Text Recognition (Multilingual) (xfyun-printed-ocr): Supports various languages (API Doc - Chinese).
      (Note: XFYun integration is not extensively tested and may require paid quotas).
  • Local OCR Engine:
    • Surya OCR (surya-ocr): A Torch-based OCR engine (GitHub). The current integration requires approximately 7GB of VRAM per process for layout+OCR. Please check your CUDA device specifications before use. (Note: Local OCR integration is not extensively tested).

Image Preprocessing Features (Configurable):

  • Auto-Rotation: Corrects image orientation using Tesseract OCR (requires Tesseract). Disabled by default.
  • Image Enhancement: Applies contrast enhancement, denoising (color or grayscale mode), and sharpening to cropped text blocks. Disabled by default.

Output Formats: Supports Markdown, JSON, XML, and HTML. By default, only a full.json file containing all details is generated. Other formats can be requested via the --output_format argument.

Batch Processing: pdf_batch and image_batch modes support parallel processing using multiple processes. The number of worker processes is configurable (default is 1).

Development Status and Maintenance

This project was developed by the author during graduate studies. Due to time constraints and research commitments, future maintenance and feature development will primarily rely on community collaboration. Users are encouraged to:

  • Submit bug reports or detailed feature requests in the Issues section on Gitee or GitHub.
  • Contribute code fixes or new features via Pull Requests.

System Requirements

  • Python: 3.10 or higher
  • Git: For source installation
  • Hardware: CUDA-enabled GPU (Optional, accelerates layout analysis and local OCR)
  • Dependencies (See requirements.txt):
    • toml: For parsing configuration files
    • Core libraries: PyTorch, OpenCV, Pillow, PyMuPDF, openai, doclayout_yolo, tqdm
  • Optional Dependencies:
    • Tesseract OCR engine and pytesseract: For auto-rotation feature
    • surya-ocr: For local OCR (CUDA device with >8GB VRAM strongly recommended)
    • requests: For XFYun OCR services

Installation

Installation from PyPI (Recommended)

Install Herbarium-OCR via PyPI:

pip install herbarium-ocr

To install with all optional features (Tesseract support, XFYun client, Surya client):

pip install "herbarium-ocr[full]"

Note: If enabling auto-rotation, you must separately install the Tesseract OCR engine for your operating system.

GPU Support: For accelerated processing, install a CUDA-enabled version of PyTorch from the PyTorch website.

Installation from Source

Clone the repository if you need to contribute or use the latest development version: From Gitee:

git clone https://gitee.com/esenzhou/Herbarium-OCR-Public.git
cd Herbarium-OCR-Public

From GitHub:

git clone https://github.com/GrootOtter/Herbarium-OCR-Public.git
cd Herbarium-OCR-Public

Install dependencies (ideally in a virtual environment):

pip install -r requirements.txt

To enable auto-rotation:

pip install pytesseract
# Also install the Tesseract OCR engine itself (see below)

Install Tesseract OCR Engine (Only if enabling auto-rotation):

  • Linux: Use package manager, e.g., sudo apt install tesseract-ocr (Debian/Ubuntu).
  • Windows: Download and install from Tesseract Wiki. Ensure the executable path is added to the system PATH environment variable.

GPU Support: Install a CUDA-enabled version of PyTorch from the PyTorch website.

Usage

Running from PyPI Installation

Use the following command-line tools after installation:

Main Processing: herbarium-ocr

Process PDF or image files for OCR.

herbarium-ocr --mode <mode> --input <input_path> --model <model_name> [options]
  • Modes: pdf, pdf_batch, image, image_batch
  • Options:
    • --languages: Comma-separated language codes (e.g., hy,ru)
    • --output_format: markdown, json, xml, html (generates this in addition to full.json)
    • --preprocess_images: Enable image block enhancements
    • -v, --verbose: Enable debug logging
    • -c, --config: Path to custom TOML config file

Example:

herbarium-ocr --mode pdf --input document.pdf --model gemini --output_format html

Convert Output: herbarium-ocr-convert

Convert an existing full.json output file to other formats.

herbarium-ocr-convert <input_path_full.json> --to <format>... [-v]
  • Formats: markdown, md, html, htm, xml, json (filtered version)

Example:

herbarium-ocr-convert output_full.json --to markdown html

Test Preprocessing: herbarium-ocr-preprocess

Test the preprocessing pipeline (rotation attempt, enhancements).

herbarium-ocr-preprocess --input <input_path> [-c <config_path>] [-v]

Example:

herbarium-ocr-preprocess --input image.jpg

Check Layout Model: herbarium-ocr-check-layout

Display supported layout classes from the built-in model.

herbarium-ocr-check-layout [-c <config_path>] [-v]

Example:

herbarium-ocr-check-layout -c my_config.toml -v

Running from Source

If you cloned the repository, run scripts from the project root using python -m:

Main Processing

python -m Main.herbarium_ocr --mode <mode> --input <input_path> --model <model_name> [options]

Example:

python -m Main.herbarium_ocr --mode pdf --input document.pdf --model gemini --output_format html

Convert Output

python -m Main.convert <input_path_full.json> --to <format>... [-v]

Example:

python -m Main.convert output_full.json --to markdown

Test Preprocessing

python -m Main.image_processer --input <input_path> [-c <config_path>] [-v]

Example:

python -m Main.image_processer --input image.jpg

Check Layout Model

python -m Main.check_layout_model [-c <config_path>] [-v]

Example:

python -m Main.check_layout_model -c my_config.toml -v

Configuration

Customize via herbarium_ocr_config.toml. Search order: -c path > User dir > Defaults. Only include settings you want to override.

Example Config (herbarium_ocr_config.toml):

[OCR_CONFIG]
languages = "en,ru"         # Default language hints
output_format = "html"      # Default conversion format
preprocess_images = true    # Enable block enhancement
enhance_contrast = true
denoise = false             # Disable slow denoising
sharpen = true
attempt_auto_rotation = true # Enable Tesseract rotation
# tesseract_cmd_path = "/usr/local/bin/tesseract" # Tesseract path (Example)
min_rotation_confidence = 50
max_workers = 0             # Use all CPU cores for batch

[DOCLAYOUT_CONFIG]
RELEVANT_TEXT_CLASSES = ["title", "plain text"]
DOCLAYOUT_CONF_THRESHOLD = 0.25

[MODEL_CONFIGS]
# Add a new model definition (Example using OpenRouter)
  [MODEL_CONFIGS.openrouter]                # Name used with the --model argument (e.g., --model openrouter)
  type = "openai_compatible"                # Specifies which client handles this (OpenAI compatible)
  language_mode = "list_hint"               # How the client uses the --languages arg (accepts list as hint)
  api_key_env = "OPENROUTER_API_KEY"        # Environment variable name holding the API key
  base_url = "https://openrouter.ai/api/v1" # Base URL for the API endpoint (provider: OpenRouter)
  model_id = "google/gemma-3-27b-it:free"   # Specific model identifier (get from provider's documentation)
  rpm_limit = 20                            # Requests Per Minute limit (check provider's documentation/limits)

  # Add local Ollama model (Example, untested)
  [MODEL_CONFIGS.ollama_llava]
  type = "openai_compatible"
  language_mode = "list_hint"
  api_key_env = "OLLAMA_API_KEY" # Can be dummy value like "ollama"
  base_url = "http://localhost:11434/v1"
  model_id = "gemma3:27b" # Your loaded model name
  rpm_limit = 10
  max_dimension = 0 # Disable client image processing

  # Modify existing gemini config
  [MODEL_CONFIGS.gemini]
  model_id = "gemini-2.0-flash-lite"
  rpm_limit = 30

  # Modify XFyun printed OCR params
  [MODEL_CONFIGS.xfyun-printed-ocr]
  param_value = "ru" # Default language Russian
  max_dimension = 2000
  jpeg_quality = 90

Note: Run herbarium-ocr-check-layout or python -m Main.check_layout_model to see supported RELEVANT_TEXT_CLASSES.

API Key/Credential Setup (Environment Variables):

  • OpenAI-Compatible Models:

  • Obtain the corresponding API keys from the respective LLM provider’s website. This project accesses the following models (--model parameter):

    • gemini gemini-2.0-flash
    • grok grok-2-vision-1212
    • qwen qwen-vl-plus
    • glm-4 glm-4v-plus-0111
    • yi yi-vision-v2
    • kimi moonshot-v1-8k-vision-preview
    • doubao doubao-1.5-vision-pro-250328
    • Other LLMs supporting the OpenAI interface (configure the API endpoint in herbarium_ocr_config.toml)
  • Set environment variables:

    Linux

    Temporary Setup (current session only):

    export GOOGLE_API_KEY="your-google-api-key"          # For Gemini
    export XAI_API_KEY="your-xai-api-key"                # For Grok
    export DASHSCOPE_API_KEY="your-dashscope-api-key"    # For Qwen
    export ZHIPUAI_API_KEY="your-zhipuai-api-key"        # For GLM-4 
    export YI_API_KEY="your-yi-api-key"                  # For Yi
    ...
    

    Permanent Setup :

    Add the above export commands to your shell configuration file (e.g., ~/.bashrc, ~/.zshrc):

    echo 'export GOOGLE_API_KEY="your-google-api-key"' >> ~/.bashrc
    

    Reload the shell configuration:

    source ~/.bashrc  # or source ~/.zshrc
    

    Windows

    Temporary Setup (current session only):

    Open PowerShell and run:

    $env:GOOGLE_API_KEY = "your-google-api-key"
    

    Permanent Setup :

    [System.Environment]::SetEnvironmentVariable("GOOGLE_API_KEY", "your-google-api-key", "User")
    

    Alternatively, set environment variables via the GUI:

    1. Search for "Environment Variables" in the Windows Start menu.
    2. Select "Edit the system environment variables" or "Edit environment variables for your account."
    3. Under "User variables," add a new variable with the name (e.g., GOOGLE_API_KEY) and value (e.g., your-google-api-key).
  • XFYun OCR API (--model parameter: xfyun-general-ocr, xfyun-printed-ocr):

    • Requires setting three environment variables: SPARK_APPID, SPARK_API_KEY, SPARK_API_SECRET.
    • Obtain these values from your application in the XFYun Open Platform Console.
    • Set the environment variables as described above.

Troubleshooting

  • Layout Detection Failures: Modify RELEVANT_TEXT_CLASSES and DOCLAYOUT_CONF_THRESHOLD in config. Enabling herbarium-ocr-preprocess might improve detection confidence.
  • API Key Errors: Use -v to verify environment variables are correctly set and checked.
  • XFYun 403 Forbidden: Check API credentials and ensure your system clock is accurate (within 5 minutes of UTC).
  • Tesseract Errors: Ensure Tesseract engine and pytesseract library are installed and configured correctly (PATH or tesseract_cmd_path).

Contributing

Contributions are welcome! Please use:

  • Issues: Report bugs or suggest features on Gitee or GitHub.
  • Pull Requests: Submit code fixes or new features. Future development relies significantly on community involvement.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file. The included doclayout_yolo_docstructbench_imgsz1024.pt model file is also under AGPL-3.0.

Acknowledgments

Thanks to the developers of key open-source projects and libraries such as DocLayout-YOLO, PyMuPDF, Pillow, OpenAI Python SDK, Requests, Tesseract OCR, and PyTorch. Special thanks to Gemini and Grok for their code instructions. Also, thanks to the Herbarium of Xinjiang Institute of Ecology and Geography, CAS (XJBI) for supporting this work.

See Other Excellent OCR Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

herbarium_ocr-0.1.2.tar.gz (37.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

herbarium_ocr-0.1.2-py3-none-any.whl (37.5 MB view details)

Uploaded Python 3

File details

Details for the file herbarium_ocr-0.1.2.tar.gz.

File metadata

  • Download URL: herbarium_ocr-0.1.2.tar.gz
  • Upload date:
  • Size: 37.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for herbarium_ocr-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fb2686685fb3aa906beeefb7e0bd618064b6c6db396b791e041dc72723ffef26
MD5 58661f70e3fe49070a950fb78e91a2a4
BLAKE2b-256 4b306ae42787a1cf411099b04a7a70823af78dae357e4dd3fae064c316746bed

See more details on using hashes here.

File details

Details for the file herbarium_ocr-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: herbarium_ocr-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 37.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for herbarium_ocr-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 da2d95b109a15d71e208cefd98c08ff37efc4d7a884feb0ec9a7cf78fabc4ae1
MD5 2246306ca69c27399cee25984692b6e8
BLAKE2b-256 9fc006ca923c39b8ee0f5dedff7a31fd0ee95f6f25ed189e0c6fbc011ec8de8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page