OCR tool for botanical documents using layout analysis and LLMs/OCR engines.

These details have not been verified by PyPI

Project links

Project description

Herbarium-OCR

Project Overview

Herbarium-OCR is an open-source OCR tool primarily designed for extracting text from floras, papers, and handwritten or printed labels of herbarium specimens from Central Eurasian countries, aiming to support research in plant systematics and ecology. It can also process scanned documents and photos from other regions and languages. Users are advised to first consider commercial OCR solutions for stable service and support, such as ABBYY, Google Document AI, and TextIn.

The workflow includes:

Optional auto-rotation (if enabled) to correct full image/page orientation.
Layout analysis using the DocLayout-YOLO model to extract text block images.
Optional image enhancement (contrast, denoising, sharpening) applied to cropped text blocks.
Text recognition using a supported OCR engine.

Supported OCR Engines:

Large Language Models via OpenAI-compatible interface (OpenAI SDK), such as Gemini (gemini-2.0-flash), Qwen (qwen-vl-plus), ChatGLM (glm-4v-plus). Theoretically compatible with local Ollama or vLLM.
XFYun OCR Services via HTTP API (OCR technology provided by iFlytek):
- General Text Recognition (xfyun-general-ocr): Supports Chinese/English (API Doc - Chinese).
- Printed Text Recognition (Multilingual) (xfyun-printed-ocr): Supports various languages (API Doc - Chinese).
  (Note: XFYun integration is not extensively tested and may require paid quotas).
Local OCR Engine:
- Surya OCR (surya-ocr): A Torch-based OCR engine (GitHub). The current integration requires approximately 7GB of VRAM per process for layout+OCR. Please check your CUDA device specifications before use. (Note: Local OCR integration is not extensively tested).

Image Preprocessing Features (Configurable):

Auto-Rotation: Corrects image orientation using Tesseract OCR (requires Tesseract). Disabled by default.
Image Enhancement: Applies contrast enhancement, denoising (color or grayscale mode), and sharpening to cropped text blocks. Disabled by default.

Output Formats: Supports Markdown, JSON, XML, and HTML. By default, only a full.json file containing all details is generated. Other formats can be requested via the --output_format argument.

Batch Processing: pdf_batch and image_batch modes support parallel processing using multiple processes. The number of worker processes is configurable (default is 1).

Development Status and Maintenance

This project was developed by the author during graduate studies. Due to time constraints and research commitments, future maintenance and feature development will primarily rely on community collaboration. Users are encouraged to:

Submit bug reports or detailed feature requests in the Issues section on Gitee or GitHub.
Contribute code fixes or new features via Pull Requests.

System Requirements

Python: 3.10 or higher
Git: For source installation
Hardware: CUDA-enabled GPU (Optional, accelerates layout analysis and local OCR)
Dependencies (See requirements.txt):
- toml: For parsing configuration files
- Core libraries: PyTorch, OpenCV, Pillow, PyMuPDF, openai, doclayout_yolo, tqdm
Optional Dependencies:
- Tesseract OCR engine and pytesseract: For auto-rotation feature
- surya-ocr: For local OCR (CUDA device with >8GB VRAM strongly recommended)
- requests: For XFYun OCR services

Installation

Installation from PyPI (Recommended)

Install Herbarium-OCR via PyPI:

pip install herbarium-ocr

To install with all optional features (Tesseract support, XFYun client, Surya client):

pip install "herbarium-ocr[full]"

Note: If enabling auto-rotation, you must separately install the Tesseract OCR engine for your operating system.

GPU Support: For accelerated processing, install a CUDA-enabled version of PyTorch from the PyTorch website.

Installation from Source

Clone the repository if you need to contribute or use the latest development version: From Gitee:

git clone https://gitee.com/esenzhou/Herbarium-OCR-Public.git
cd Herbarium-OCR-Public

From GitHub:

git clone https://github.com/GrootOtter/Herbarium-OCR-Public.git
cd Herbarium-OCR-Public

Install dependencies (ideally in a virtual environment):

pip install -r requirements.txt

To enable auto-rotation:

pip install pytesseract
# Also install the Tesseract OCR engine itself (see below)

Install Tesseract OCR Engine (Only if enabling auto-rotation):

Linux: Use package manager, e.g., sudo apt install tesseract-ocr (Debian/Ubuntu).
Windows: Download and install from Tesseract Wiki. Ensure the executable path is added to the system PATH environment variable.

GPU Support: Install a CUDA-enabled version of PyTorch from the PyTorch website.

Usage

Running from PyPI Installation

Use the following command-line tools after installation:

Main Processing: `herbarium-ocr`

Process PDF or image files for OCR.

herbarium-ocr --mode <mode> --input <input_path> --model <model_name> [options]

Modes: pdf, pdf_batch, image, image_batch
Options:
- --languages: Comma-separated language codes (e.g., hy,ru)
- --output_format: markdown, json, xml, html (generates this in addition to full.json)
- --preprocess_images: Enable image block enhancements
- -v, --verbose: Enable debug logging
- -c, --config: Path to custom TOML config file

Example:

herbarium-ocr --mode pdf --input document.pdf --model gemini --output_format html

Convert Output: `herbarium-ocr-convert`

Convert an existing full.json output file to other formats.

herbarium-ocr-convert <input_path_full.json> --to <format>... [-v]

Formats: markdown, md, html, htm, xml, json (filtered version)

Example:

herbarium-ocr-convert output_full.json --to markdown html

Test Preprocessing: `herbarium-ocr-preprocess`

Test the preprocessing pipeline (rotation attempt, enhancements).

herbarium-ocr-preprocess --input <input_path> [-c <config_path>] [-v]

Example:

herbarium-ocr-preprocess --input image.jpg

Check Layout Model: `herbarium-ocr-check-layout`

Display supported layout classes from the built-in model.

herbarium-ocr-check-layout [-c <config_path>] [-v]

Example:

herbarium-ocr-check-layout -c my_config.toml -v

Running from Source

If you cloned the repository, run scripts from the project root using python -m:

Main Processing

python -m Main.herbarium_ocr --mode <mode> --input <input_path> --model <model_name> [options]

Example:

python -m Main.herbarium_ocr --mode pdf --input document.pdf --model gemini --output_format html

Convert Output

python -m Main.convert <input_path_full.json> --to <format>... [-v]

Example:

python -m Main.convert output_full.json --to markdown

Test Preprocessing

python -m Main.image_processer --input <input_path> [-c <config_path>] [-v]

Example:

python -m Main.image_processer --input image.jpg

Check Layout Model

python -m Main.check_layout_model [-c <config_path>] [-v]

Example:

python -m Main.check_layout_model -c my_config.toml -v

Configuration

Customize via herbarium_ocr_config.toml. Search order: -c path > User dir > Defaults. Only include settings you want to override.

Example Config (herbarium_ocr_config.toml):

[OCR_CONFIG]
languages = "en,ru"         # Default language hints
output_format = "html"      # Default conversion format
preprocess_images = true    # Enable block enhancement
enhance_contrast = true
denoise = false             # Disable slow denoising
sharpen = true
attempt_auto_rotation = true # Enable Tesseract rotation
# tesseract_cmd_path = "/usr/local/bin/tesseract" # Tesseract path (Example)
min_rotation_confidence = 50
max_workers = 0             # Use all CPU cores for batch

[DOCLAYOUT_CONFIG]
RELEVANT_TEXT_CLASSES = ["title", "plain text"]
DOCLAYOUT_CONF_THRESHOLD = 0.25

[MODEL_CONFIGS]
# Add a new model definition (Example using OpenRouter)
  [MODEL_CONFIGS.openrouter]                # Name used with the --model argument (e.g., --model openrouter)
  type = "openai_compatible"                # Specifies which client handles this (OpenAI compatible)
  language_mode = "list_hint"               # How the client uses the --languages arg (accepts list as hint)
  api_key_env = "OPENROUTER_API_KEY"        # Environment variable name holding the API key
  base_url = "https://openrouter.ai/api/v1" # Base URL for the API endpoint (provider: OpenRouter)
  model_id = "google/gemma-3-27b-it:free"   # Specific model identifier (get from provider's documentation)
  rpm_limit = 20                            # Requests Per Minute limit (check provider's documentation/limits)

  # Add local Ollama model (Example, untested)
  [MODEL_CONFIGS.ollama_llava]
  type = "openai_compatible"
  language_mode = "list_hint"
  api_key_env = "OLLAMA_API_KEY" # Can be dummy value like "ollama"
  base_url = "http://localhost:11434/v1"
  model_id = "gemma3:27b" # Your loaded model name
  rpm_limit = 10
  max_dimension = 0 # Disable client image processing

  # Modify existing gemini config
  [MODEL_CONFIGS.gemini]
  model_id = "gemini-2.0-flash-lite"
  rpm_limit = 30

  # Modify XFyun printed OCR params
  [MODEL_CONFIGS.xfyun-printed-ocr]
  param_value = "ru" # Default language Russian
  max_dimension = 2000
  jpeg_quality = 90

Note: Run herbarium-ocr-check-layout or python -m Main.check_layout_model to see supported RELEVANT_TEXT_CLASSES.

API Key/Credential Setup (Environment Variables):

OpenAI-Compatible Models:
Obtain the corresponding API keys from the respective LLM provider’s website. This project accesses the following models (--model parameter):
- gemini gemini-2.0-flash
- grok grok-2-vision-1212
- qwen qwen-vl-plus
- glm-4 glm-4v-plus-0111
- yi yi-vision-v2
- kimi moonshot-v1-8k-vision-preview
- doubao doubao-1.5-vision-pro-250328
- Other LLMs supporting the OpenAI interface (configure the API endpoint in herbarium_ocr_config.toml)

Set environment variables:

Linux

Temporary Setup (current session only):

export GOOGLE_API_KEY="your-google-api-key"          # For Gemini
export XAI_API_KEY="your-xai-api-key"                # For Grok
export DASHSCOPE_API_KEY="your-dashscope-api-key"    # For Qwen
export ZHIPUAI_API_KEY="your-zhipuai-api-key"        # For GLM-4 
export YI_API_KEY="your-yi-api-key"                  # For Yi
...

Permanent Setup :

Add the above export commands to your shell configuration file (e.g., ~/.bashrc, ~/.zshrc):

echo 'export GOOGLE_API_KEY="your-google-api-key"' >> ~/.bashrc

Reload the shell configuration:

source ~/.bashrc  # or source ~/.zshrc

Windows

Temporary Setup (current session only):

Open PowerShell and run:

$env:GOOGLE_API_KEY = "your-google-api-key"

Permanent Setup :

[System.Environment]::SetEnvironmentVariable("GOOGLE_API_KEY", "your-google-api-key", "User")

Alternatively, set environment variables via the GUI:

Search for "Environment Variables" in the Windows Start menu.
Select "Edit the system environment variables" or "Edit environment variables for your account."
Under "User variables," add a new variable with the name (e.g., GOOGLE_API_KEY) and value (e.g., your-google-api-key).

XFYun OCR API (--model parameter: xfyun-general-ocr, xfyun-printed-ocr):
- Requires setting three environment variables: SPARK_APPID, SPARK_API_KEY, SPARK_API_SECRET.
- Obtain these values from your application in the XFYun Open Platform Console.
- Set the environment variables as described above.

Troubleshooting

Layout Detection Failures: Modify RELEVANT_TEXT_CLASSES and DOCLAYOUT_CONF_THRESHOLD in config. Enabling herbarium-ocr-preprocess might improve detection confidence.
API Key Errors: Use -v to verify environment variables are correctly set and checked.
XFYun 403 Forbidden: Check API credentials and ensure your system clock is accurate (within 5 minutes of UTC).
Tesseract Errors: Ensure Tesseract engine and pytesseract library are installed and configured correctly (PATH or tesseract_cmd_path).

Contributing

Contributions are welcome! Please use:

Issues: Report bugs or suggest features on Gitee or GitHub.
Pull Requests: Submit code fixes or new features. Future development relies significantly on community involvement.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file. The included doclayout_yolo_docstructbench_imgsz1024.pt model file is also under AGPL-3.0.

Acknowledgments

Thanks to the developers of key open-source projects and libraries such as DocLayout-YOLO, PyMuPDF, Pillow, OpenAI Python SDK, Requests, Tesseract OCR, and PyTorch. Special thanks to Gemini and Grok for their code instructions. Also, thanks to the Herbarium of Xinjiang Institute of Ecology and Geography, CAS (XJBI) for supporting this work.

See Other Excellent OCR Projects

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 7, 2025

This version

0.1.2

May 7, 2025

0.1.1

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

herbarium_ocr-0.1.2.tar.gz (37.5 MB view details)

Uploaded May 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

herbarium_ocr-0.1.2-py3-none-any.whl (37.5 MB view details)

Uploaded May 7, 2025 Python 3

File details

Details for the file herbarium_ocr-0.1.2.tar.gz.

File metadata

Download URL: herbarium_ocr-0.1.2.tar.gz
Upload date: May 7, 2025
Size: 37.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for herbarium_ocr-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fb2686685fb3aa906beeefb7e0bd618064b6c6db396b791e041dc72723ffef26`
MD5	`58661f70e3fe49070a950fb78e91a2a4`
BLAKE2b-256	`4b306ae42787a1cf411099b04a7a70823af78dae357e4dd3fae064c316746bed`

See more details on using hashes here.

File details

Details for the file herbarium_ocr-0.1.2-py3-none-any.whl.

File metadata

Download URL: herbarium_ocr-0.1.2-py3-none-any.whl
Upload date: May 7, 2025
Size: 37.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for herbarium_ocr-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da2d95b109a15d71e208cefd98c08ff37efc4d7a884feb0ec9a7cf78fabc4ae1`
MD5	`2246306ca69c27399cee25984692b6e8`
BLAKE2b-256	`9fc006ca923c39b8ee0f5dedff7a31fd0ee95f6f25ed189e0c6fbc011ec8de8d`

See more details on using hashes here.

herbarium-ocr 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Herbarium-OCR

Project Overview

Development Status and Maintenance

System Requirements

Installation

Installation from PyPI (Recommended)

Installation from Source

Usage

Running from PyPI Installation

Main Processing: herbarium-ocr

Convert Output: herbarium-ocr-convert

Test Preprocessing: herbarium-ocr-preprocess

Check Layout Model: herbarium-ocr-check-layout

Running from Source

Main Processing

Convert Output

Test Preprocessing

Check Layout Model

Configuration

Linux

Windows

Troubleshooting

Contributing

License

Acknowledgments

See Other Excellent OCR Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Main Processing: `herbarium-ocr`

Convert Output: `herbarium-ocr-convert`

Test Preprocessing: `herbarium-ocr-preprocess`

Check Layout Model: `herbarium-ocr-check-layout`