Arabic OCR pipeline built on OnnxTR with fine-tuned Arabic models
Project description
mawshor
Arabic OCR pipeline built on OnnxTR with fine-tuned ONNX models.
Sample Input Image |
Model Prediction Output |
Features
- Arabic-first STR: recognition model fine-tuned on Arabic script.
- Orientation correction: detects and corrects both page-level rotation and crop-level skew before inference (
--straighten-pages) - LLM postprocessing: low-confidence OCR words are sent to any OpenAI-compatible LLM for context-aware correction (
--postprocess) - GPU-accelerated: runs on CUDA via ONNX Runtime; CPU fallback available
Models
Four fine-turned Arabic models are loaded from HuggingFace (madskills/):
| Model | Architecture | Task |
|---|---|---|
onnxtr-fast_base-arabic |
FAST | Text detection |
onnxtr-parseq-arabic |
PARSeq | Text recognition |
onnxtr-mobilenet_v3_small-crop-orientation-arabic |
MobileNet V3 Small | Crop orientation correction |
onnxtr-mobilenet_v3_small-page-orientation-arabic |
MobileNet V3 Small | Page orientation correction |
Models were fine-tuned on synthetic Arabic datasets using DocTR's models as a base.
Requirements
- Python 3.10+
- CUDA-capable GPU (CPU fallback available but not the primary target)
pip install -r requirements.txt
Usage
python core.py <path> [options]
<path> can be a single image/PDF or a directory. Supported image formats: PNG, JPG, JPEG, BMP, TIFF.
Options
| Flag | Short | Description |
|---|---|---|
--straighten-pages |
-s |
Detect and correct page/crop orientation before OCR |
--postprocess |
-p |
Send low-confidence words to an LLM for correction |
--save |
Save output to a .txt file next to each input file |
|
--raw-output |
-r |
Print the raw predictor output |
--llm-endpoint |
OpenAI-compatible API base URL (default: http://localhost:11434/v1) |
|
--llm-model |
Model name for postprocessing (default: qwen3.5:4b) |
|
--llm-api-key |
API key (default: ollama) |
Examples
# Basic OCR on a single image
python core.py document.jpg
# OCR a directory and save results
python core.py ./scans/ --save
# OCR with page straightening and LLM postprocessing via local Ollama
python core.py document.jpg --straighten-pages --postprocess
# Use a different model or remote endpoint
python core.py document.jpg --postprocess \
--llm-endpoint https://api.openai.com/v1 \
--llm-model gpt-4o \
--llm-api-key sk-...
Postprocessing
When --postprocess is enabled, OCR output is filtered by confidence and sent to an LLM:
- Words with confidence ≥ 0.8 are passed as-is
- Words with confidence between 0.75–0.8 are passed and flagged as low-confidence
- Words with confidence < 0.75 are dropped before sending
The LLM is prompted as an Arabic copyeditor to fix likely OCR errors, merge/split words, and clean up spacing: without changing meaning or adding content.
Any OpenAI-compatible endpoint works. Ollama runs out of the box with the defaults.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mawshor-0.1.0.tar.gz.
File metadata
- Download URL: mawshor-0.1.0.tar.gz
- Upload date:
- Size: 229.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d87c82609112455fb3674d93bf69e54fbfbd4da5bbd0011d088bb207780900ff
|
|
| MD5 |
20fa822a51b2d37b082f2c38cfc9cd69
|
|
| BLAKE2b-256 |
5f432bda6c1aa4b0c8e6ebd25b7488754394aace58e1d1f1d879044a69d50426
|
Provenance
The following attestation bundles were made for mawshor-0.1.0.tar.gz:
Publisher:
publish.yml on tarekio/mawshor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mawshor-0.1.0.tar.gz -
Subject digest:
d87c82609112455fb3674d93bf69e54fbfbd4da5bbd0011d088bb207780900ff - Sigstore transparency entry: 1519288152
- Sigstore integration time:
-
Permalink:
tarekio/mawshor@9263ba11f8ea544b1a9449baa34b949407884e1a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/tarekio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9263ba11f8ea544b1a9449baa34b949407884e1a -
Trigger Event:
release
-
Statement type:
File details
Details for the file mawshor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mawshor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c31a0f9c07ebee032eaefddea7201fbdbed7aa8ac6111c0f5c599a3442863970
|
|
| MD5 |
10d0708298f09d5e79d7706b352f7f67
|
|
| BLAKE2b-256 |
f1ce782df194fd8c17a61a25a95de4ed33bf19e090d47dd863d5e661362fb1c3
|
Provenance
The following attestation bundles were made for mawshor-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on tarekio/mawshor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mawshor-0.1.0-py3-none-any.whl -
Subject digest:
c31a0f9c07ebee032eaefddea7201fbdbed7aa8ac6111c0f5c599a3442863970 - Sigstore transparency entry: 1519288164
- Sigstore integration time:
-
Permalink:
tarekio/mawshor@9263ba11f8ea544b1a9449baa34b949407884e1a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/tarekio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9263ba11f8ea544b1a9449baa34b949407884e1a -
Trigger Event:
release
-
Statement type: