Skip to main content

Arabic OCR pipeline built on OnnxTR with fine-tuned Arabic models

Project description

mawshor

mawshor

Arabic OCR pipeline built on OnnxTR with fine-tuned ONNX models.


Sample Input Image

Model Prediction Output

Features

  • Arabic-first STR: recognition model fine-tuned on Arabic script.
  • Orientation correction: detects and corrects both page-level rotation and crop-level skew before inference (--straighten-pages)
  • LLM postprocessing: low-confidence OCR words are sent to any OpenAI-compatible LLM for context-aware correction (--postprocess)
  • GPU-accelerated: runs on CUDA via ONNX Runtime; CPU fallback available

Models

Four fine-turned Arabic models are loaded from HuggingFace (madskills/):

Model Architecture Task
onnxtr-fast_base-arabic FAST Text detection
onnxtr-parseq-arabic PARSeq Text recognition
onnxtr-mobilenet_v3_small-crop-orientation-arabic MobileNet V3 Small Crop orientation correction
onnxtr-mobilenet_v3_small-page-orientation-arabic MobileNet V3 Small Page orientation correction

Models were fine-tuned on synthetic Arabic datasets using DocTR's models as a base.

Requirements

  • Python 3.10+
  • CUDA-capable GPU (CPU fallback available but not the primary target)
pip install -r requirements.txt

Usage

python core.py <path> [options]

<path> can be a single image/PDF or a directory. Supported image formats: PNG, JPG, JPEG, BMP, TIFF.

Options

Flag Short Description
--straighten-pages -s Detect and correct page/crop orientation before OCR
--postprocess -p Send low-confidence words to an LLM for correction
--save Save output to a .txt file next to each input file
--raw-output -r Print the raw predictor output
--llm-endpoint OpenAI-compatible API base URL (default: http://localhost:11434/v1)
--llm-model Model name for postprocessing (default: qwen3.5:4b)
--llm-api-key API key (default: ollama)

Examples

# Basic OCR on a single image
python core.py document.jpg

# OCR a directory and save results
python core.py ./scans/ --save

# OCR with page straightening and LLM postprocessing via local Ollama
python core.py document.jpg --straighten-pages --postprocess

# Use a different model or remote endpoint
python core.py document.jpg --postprocess \
  --llm-endpoint https://api.openai.com/v1 \
  --llm-model gpt-4o \
  --llm-api-key sk-...

Postprocessing

When --postprocess is enabled, OCR output is filtered by confidence and sent to an LLM:

  • Words with confidence ≥ 0.8 are passed as-is
  • Words with confidence between 0.75–0.8 are passed and flagged as low-confidence
  • Words with confidence < 0.75 are dropped before sending

The LLM is prompted as an Arabic copyeditor to fix likely OCR errors, merge/split words, and clean up spacing: without changing meaning or adding content.

Any OpenAI-compatible endpoint works. Ollama runs out of the box with the defaults.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mawshor-0.1.0.tar.gz (229.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mawshor-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file mawshor-0.1.0.tar.gz.

File metadata

  • Download URL: mawshor-0.1.0.tar.gz
  • Upload date:
  • Size: 229.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mawshor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d87c82609112455fb3674d93bf69e54fbfbd4da5bbd0011d088bb207780900ff
MD5 20fa822a51b2d37b082f2c38cfc9cd69
BLAKE2b-256 5f432bda6c1aa4b0c8e6ebd25b7488754394aace58e1d1f1d879044a69d50426

See more details on using hashes here.

Provenance

The following attestation bundles were made for mawshor-0.1.0.tar.gz:

Publisher: publish.yml on tarekio/mawshor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mawshor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mawshor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mawshor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c31a0f9c07ebee032eaefddea7201fbdbed7aa8ac6111c0f5c599a3442863970
MD5 10d0708298f09d5e79d7706b352f7f67
BLAKE2b-256 f1ce782df194fd8c17a61a25a95de4ed33bf19e090d47dd863d5e661362fb1c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for mawshor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on tarekio/mawshor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page