Skip to main content

Clean Arabic text extraction from PDFs and scanned images — OCR + visual-order repair in one pipeline

Project description

arabic-extract

Clean Arabic text extraction from PDFs and scanned images — one call, clean output.

Combines PDF text extraction, image OCR, and arabic-repair into a single pipeline. Handles the visual-order problem that breaks standard Arabic NLP pipelines.

The problem it solves

Arabic PDFs and scanned documents store text in visual order with presentation-form characters. Standard tools (NFKC, CAMeL Tools) remove the presentation forms but cannot restore the reversed word order — retrieval recall stays broken at ~27%. arabic-extract applies arabic-repair automatically, restoring both letter forms and word order before the text reaches your NLP pipeline.

Install

pip install arabic-extract[pdf]          # PDF text-layer extraction
pip install arabic-extract[tesseract]    # + image OCR via Tesseract (needs binary)
pip install arabic-extract[easyocr]      # + image OCR via EasyOCR (pure Python, ~200 MB)
pip install arabic-extract[pymupdf]      # + scanned PDF rendering via PyMuPDF
pip install arabic-extract[all]          # everything

Tesseract binary (for the tesseract extra):

Quick start

import arabic_extract as aocr

# PDF — auto-detects text layer vs scanned, repairs each page
result = aocr.extract("document.pdf")
print(result.text)           # clean logical Arabic, all pages joined
print(result.pages)          # per-page breakdown
print(result.contamination)  # how many words needed repair

# Scanned image
result = aocr.extract("scan.jpg")
print(result.text)

# Explicit PDF extraction
result = aocr.extract_pdf("document.pdf", engine="tesseract")

# Explicit image extraction
result = aocr.extract_image("scan.png", engine="easyocr")

# Chain into CAMeL Tools (normalize=True is the default)
result = aocr.extract("document.pdf", normalize=True)

How it works

Input PDF or image
    │
    ├─ PDF with text layer  → pdfplumber extracts text (visual order)
    │                                     ↓
    ├─ Scanned PDF          → render page as image → OCR engine
    │                                     ↓
    └─ Image file           → OCR engine (Tesseract or EasyOCR)
                                          ↓
                               arabic-repair (de-shape + restore order)
                                          ↓
                               NFKC / CAMeL Tools normalization
                                          ↓
                               Clean logical Arabic text

A single PDF can have mixed pages — some with a text layer, some scanned. Each page is handled correctly.

Per-page results

result = aocr.extract("document.pdf")

for page in result.pages:
    print(f"Page {page.page_number} [{page.method}]: {page.text[:80]}")
    # method: "text_layer" | "ocr" | "text_layer_empty"

Ecosystem

Package Role
arabic-rt Core shaping / fix / unfix engine
arabic-repair Detect and repair visual-order contamination
arabic-extract Full PDF + image extraction pipeline
arabic-benchmark Benchmark proving the reordering gap

License

MPL-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabic_extract-0.1.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arabic_extract-0.1.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file arabic_extract-0.1.0.tar.gz.

File metadata

  • Download URL: arabic_extract-0.1.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_extract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9dc5facf7f9d2f4515c783f7e42d332cb3b27d9da0db7425fb51d1e1e56b78fe
MD5 156cc4436c1ca057d6934e9acbe04550
BLAKE2b-256 95fc5e9d91dcdfc515775e8a2b140baf1f9de777cf07f6fca94ff6cd5f77c4bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_extract-0.1.0.tar.gz:

Publisher: publish.yml on balswyan/arabic-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arabic_extract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arabic_extract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_extract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dc949559c712be5c1cc5ed39d895bb6a96937698323a99d85714adae7ee95043
MD5 94c52ab193f0c68d418dd0041ad8f43e
BLAKE2b-256 107dc65d37a330d409ac09d49681534987672546d96b03cdecd13ae7f6a4bb68

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_extract-0.1.0-py3-none-any.whl:

Publisher: publish.yml on balswyan/arabic-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page