Clean Arabic text extraction from PDFs and scanned images — OCR + visual-order repair in one pipeline

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

balswyan

These details have not been verified by PyPI

Project description

arabic-extract

Clean Arabic text extraction from PDFs and scanned images — one call, clean output.

Combines PDF text extraction, image OCR, and arabic-repair into a single pipeline. Handles the visual-order problem that breaks standard Arabic NLP pipelines.

The problem it solves

Arabic PDFs and scanned documents store text in visual order with presentation-form characters. Standard tools (NFKC, CAMeL Tools) remove the presentation forms but cannot restore the reversed word order — retrieval recall stays broken at ~27%. arabic-extract applies arabic-repair automatically, restoring both letter forms and word order before the text reaches your NLP pipeline.

Install

pip install arabic-extract[pdf]          # PDF text-layer extraction
pip install arabic-extract[tesseract]    # + image OCR via Tesseract (needs binary)
pip install arabic-extract[easyocr]      # + image OCR via EasyOCR (pure Python, ~200 MB)
pip install arabic-extract[pymupdf]      # + scanned PDF rendering via PyMuPDF
pip install arabic-extract[all]          # everything

Tesseract binary (for the tesseract extra):

Windows: download from https://github.com/UB-Mannheim/tesseract/wiki — install the Arabic language pack
Linux: sudo apt install tesseract-ocr tesseract-ocr-ara
macOS: brew install tesseract && brew install tesseract-lang

Quick start

import arabic_extract as aocr

# PDF — auto-detects text layer vs scanned, repairs each page
result = aocr.extract("document.pdf")
print(result.text)           # clean logical Arabic, all pages joined
print(result.pages)          # per-page breakdown
print(result.contamination)  # how many words needed repair

# Scanned image
result = aocr.extract("scan.jpg")
print(result.text)

# Explicit PDF extraction
result = aocr.extract_pdf("document.pdf", engine="tesseract")

# Explicit image extraction
result = aocr.extract_image("scan.png", engine="easyocr")

# Chain into CAMeL Tools (normalize=True is the default)
result = aocr.extract("document.pdf", normalize=True)

How it works

Input PDF or image
    │
    ├─ PDF with text layer  → pdfplumber extracts text (visual order)
    │                                     ↓
    ├─ Scanned PDF          → render page as image → OCR engine
    │                                     ↓
    └─ Image file           → OCR engine (Tesseract or EasyOCR)
                                          ↓
                               arabic-repair (de-shape + restore order)
                                          ↓
                               NFKC / CAMeL Tools normalization
                                          ↓
                               Clean logical Arabic text

A single PDF can have mixed pages — some with a text layer, some scanned. Each page is handled correctly.

Per-page results

result = aocr.extract("document.pdf")

for page in result.pages:
    print(f"Page {page.page_number} [{page.method}]: {page.text[:80]}")
    # method: "text_layer" | "ocr" | "text_layer_empty"

Ecosystem

Package	Role
arabic-rt	Core shaping / fix / unfix engine
arabic-repair	Detect and repair visual-order contamination
arabic-extract	Full PDF + image extraction pipeline
arabic-benchmark	Benchmark proving the reordering gap

License

MPL-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

balswyan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabic_extract-0.1.0.tar.gz (10.0 kB view details)

Uploaded Jun 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arabic_extract-0.1.0-py3-none-any.whl (10.0 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file arabic_extract-0.1.0.tar.gz.

File metadata

Download URL: arabic_extract-0.1.0.tar.gz
Upload date: Jun 8, 2026
Size: 10.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_extract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9dc5facf7f9d2f4515c783f7e42d332cb3b27d9da0db7425fb51d1e1e56b78fe`
MD5	`156cc4436c1ca057d6934e9acbe04550`
BLAKE2b-256	`95fc5e9d91dcdfc515775e8a2b140baf1f9de777cf07f6fca94ff6cd5f77c4bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_extract-0.1.0.tar.gz:

Publisher: publish.yml on balswyan/arabic-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arabic_extract-0.1.0.tar.gz
- Subject digest: 9dc5facf7f9d2f4515c783f7e42d332cb3b27d9da0db7425fb51d1e1e56b78fe
- Sigstore transparency entry: 1753727123
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: balswyan/arabic-extract@d6b91ea068ce68834d23b8d8724a26f2efccb7fe
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/balswyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d6b91ea068ce68834d23b8d8724a26f2efccb7fe
- Trigger Event: release

File details

Details for the file arabic_extract-0.1.0-py3-none-any.whl.

File metadata

Download URL: arabic_extract-0.1.0-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 10.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_extract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc949559c712be5c1cc5ed39d895bb6a96937698323a99d85714adae7ee95043`
MD5	`94c52ab193f0c68d418dd0041ad8f43e`
BLAKE2b-256	`107dc65d37a330d409ac09d49681534987672546d96b03cdecd13ae7f6a4bb68`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_extract-0.1.0-py3-none-any.whl:

Publisher: publish.yml on balswyan/arabic-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arabic_extract-0.1.0-py3-none-any.whl
- Subject digest: dc949559c712be5c1cc5ed39d895bb6a96937698323a99d85714adae7ee95043
- Sigstore transparency entry: 1753727147
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: balswyan/arabic-extract@d6b91ea068ce68834d23b8d8724a26f2efccb7fe
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/balswyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d6b91ea068ce68834d23b8d8724a26f2efccb7fe
- Trigger Event: release

arabic-extract 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

arabic-extract

The problem it solves

Install

Quick start

How it works

Per-page results

Ecosystem

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance