Clean Arabic text extraction from PDFs and scanned images — OCR + visual-order repair in one pipeline
Project description
arabic-extract
Clean Arabic text extraction from PDFs and scanned images — one call, clean output.
Combines PDF text extraction, image OCR, and arabic-repair into a single pipeline. Handles the visual-order problem that breaks standard Arabic NLP pipelines.
The problem it solves
Arabic PDFs and scanned documents store text in visual order with presentation-form characters. Standard tools (NFKC, CAMeL Tools) remove the presentation forms but cannot restore the reversed word order — retrieval recall stays broken at ~27%. arabic-extract applies arabic-repair automatically, restoring both letter forms and word order before the text reaches your NLP pipeline.
Install
pip install arabic-extract[pdf] # PDF text-layer extraction
pip install arabic-extract[tesseract] # + image OCR via Tesseract (needs binary)
pip install arabic-extract[easyocr] # + image OCR via EasyOCR (pure Python, ~200 MB)
pip install arabic-extract[pymupdf] # + scanned PDF rendering via PyMuPDF
pip install arabic-extract[all] # everything
Tesseract binary (for the tesseract extra):
- Windows: download from https://github.com/UB-Mannheim/tesseract/wiki — install the Arabic language pack
- Linux:
sudo apt install tesseract-ocr tesseract-ocr-ara - macOS:
brew install tesseract && brew install tesseract-lang
Quick start
import arabic_extract as aocr
# PDF — auto-detects text layer vs scanned, repairs each page
result = aocr.extract("document.pdf")
print(result.text) # clean logical Arabic, all pages joined
print(result.pages) # per-page breakdown
print(result.contamination) # how many words needed repair
# Scanned image
result = aocr.extract("scan.jpg")
print(result.text)
# Explicit PDF extraction
result = aocr.extract_pdf("document.pdf", engine="tesseract")
# Explicit image extraction
result = aocr.extract_image("scan.png", engine="easyocr")
# Chain into CAMeL Tools (normalize=True is the default)
result = aocr.extract("document.pdf", normalize=True)
How it works
Input PDF or image
│
├─ PDF with text layer → pdfplumber extracts text (visual order)
│ ↓
├─ Scanned PDF → render page as image → OCR engine
│ ↓
└─ Image file → OCR engine (Tesseract or EasyOCR)
↓
arabic-repair (de-shape + restore order)
↓
NFKC / CAMeL Tools normalization
↓
Clean logical Arabic text
A single PDF can have mixed pages — some with a text layer, some scanned. Each page is handled correctly.
Per-page results
result = aocr.extract("document.pdf")
for page in result.pages:
print(f"Page {page.page_number} [{page.method}]: {page.text[:80]}")
# method: "text_layer" | "ocr" | "text_layer_empty"
Ecosystem
| Package | Role |
|---|---|
| arabic-rt | Core shaping / fix / unfix engine |
| arabic-repair | Detect and repair visual-order contamination |
| arabic-extract | Full PDF + image extraction pipeline |
| arabic-benchmark | Benchmark proving the reordering gap |
License
MPL-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arabic_extract-0.1.0.tar.gz.
File metadata
- Download URL: arabic_extract-0.1.0.tar.gz
- Upload date:
- Size: 10.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dc5facf7f9d2f4515c783f7e42d332cb3b27d9da0db7425fb51d1e1e56b78fe
|
|
| MD5 |
156cc4436c1ca057d6934e9acbe04550
|
|
| BLAKE2b-256 |
95fc5e9d91dcdfc515775e8a2b140baf1f9de777cf07f6fca94ff6cd5f77c4bc
|
Provenance
The following attestation bundles were made for arabic_extract-0.1.0.tar.gz:
Publisher:
publish.yml on balswyan/arabic-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arabic_extract-0.1.0.tar.gz -
Subject digest:
9dc5facf7f9d2f4515c783f7e42d332cb3b27d9da0db7425fb51d1e1e56b78fe - Sigstore transparency entry: 1753727123
- Sigstore integration time:
-
Permalink:
balswyan/arabic-extract@d6b91ea068ce68834d23b8d8724a26f2efccb7fe -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/balswyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d6b91ea068ce68834d23b8d8724a26f2efccb7fe -
Trigger Event:
release
-
Statement type:
File details
Details for the file arabic_extract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arabic_extract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc949559c712be5c1cc5ed39d895bb6a96937698323a99d85714adae7ee95043
|
|
| MD5 |
94c52ab193f0c68d418dd0041ad8f43e
|
|
| BLAKE2b-256 |
107dc65d37a330d409ac09d49681534987672546d96b03cdecd13ae7f6a4bb68
|
Provenance
The following attestation bundles were made for arabic_extract-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on balswyan/arabic-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arabic_extract-0.1.0-py3-none-any.whl -
Subject digest:
dc949559c712be5c1cc5ed39d895bb6a96937698323a99d85714adae7ee95043 - Sigstore transparency entry: 1753727147
- Sigstore integration time:
-
Permalink:
balswyan/arabic-extract@d6b91ea068ce68834d23b8d8724a26f2efccb7fe -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/balswyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d6b91ea068ce68834d23b8d8724a26f2efccb7fe -
Trigger Event:
release
-
Statement type: