Detect and repair visually-baked Arabic text from PDFs, OCR, and legacy sources
Project description
arabic-repair
Detect and repair visually-baked Arabic text extracted from PDFs, OCR engines, and legacy sources.
The problem
Arabic text stored in old PDF streams, scanned documents (OCR), and legacy systems is often baked: characters are stored as Unicode Presentation Forms (U+FB50–U+FEFF) in reversed visual order rather than logical reading order. Standard tools like Unicode NFKC normalization and CAMeL Tools remove the presentation forms but do not restore the character order — the text remains scrambled.
arabic-repair fixes both: it de-shapes the presentation forms and restores logical word order,
then hands clean text to your downstream NLP pipeline.
Install
pip install arabic-repair
Quick start
import arabic_repair as ar
# Repair a string from a PDF extractor or OCR engine
clean = ar.repair(raw_text)
# Inspect contamination before committing to repair
info = ar.detect(raw_text)
print(info.contamination_type) # "fully_baked" | "partially_baked" | "clean"
print(info.contaminated_ratio) # 0.0 – 1.0
# Chain into CAMeL Tools for full normalization
from camel_tools.utils.normalize import normalize_unicode
fully_clean = normalize_unicode(ar.repair(raw_text))
# Stream large documents line by line
with open("big_doc.txt", encoding="utf-8") as f:
for line in ar.repair_stream(f):
process(line)
What it fixes / what it doesn't
| arabic-repair | NFKC | CAMeL Tools | |
|---|---|---|---|
| Presentation forms → base letters | ✓ | ✓ | ✓ |
| Visual order → logical order | ✓ | ✗ | ✗ |
| Alef variant normalization | ✗ | ✗ | ✓ |
| Yaa / teh-marbuta normalization | ✗ | ✗ | ✓ |
| Diacritics | ✗ | ✗ | ✓ |
Use arabic-repair first, then CAMeL Tools for linguistic normalization.
License
MPL-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arabic_repair-0.1.0.tar.gz.
File metadata
- Download URL: arabic_repair-0.1.0.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0898f238c083053393fa89c514875939a7e7b81a0299422bd40c213180b7b47
|
|
| MD5 |
187011a340cef022974789cdf1bf6c28
|
|
| BLAKE2b-256 |
166755245d10e3903ef09fffe11c3e4b5c139b72c641e508ff6ff067b2cb0f86
|
Provenance
The following attestation bundles were made for arabic_repair-0.1.0.tar.gz:
Publisher:
publish.yml on balswyan/arabic-repair
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arabic_repair-0.1.0.tar.gz -
Subject digest:
f0898f238c083053393fa89c514875939a7e7b81a0299422bd40c213180b7b47 - Sigstore transparency entry: 1716794035
- Sigstore integration time:
-
Permalink:
balswyan/arabic-repair@4535829189f7bd1e4c78e061f2842101e84fcd09 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/balswyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4535829189f7bd1e4c78e061f2842101e84fcd09 -
Trigger Event:
release
-
Statement type:
File details
Details for the file arabic_repair-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arabic_repair-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e75a7da78761f2b6d5c26ba8b9611e5f5c7d9f68cc7fc3847c7eb55d0895ecb4
|
|
| MD5 |
37c46a864064ee897bef8689fe6b476d
|
|
| BLAKE2b-256 |
7e81e0ef0a537d86f6f450c0428542c3e0f9e46b763d7462787f121b80f9f1b0
|
Provenance
The following attestation bundles were made for arabic_repair-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on balswyan/arabic-repair
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arabic_repair-0.1.0-py3-none-any.whl -
Subject digest:
e75a7da78761f2b6d5c26ba8b9611e5f5c7d9f68cc7fc3847c7eb55d0895ecb4 - Sigstore transparency entry: 1716794189
- Sigstore integration time:
-
Permalink:
balswyan/arabic-repair@4535829189f7bd1e4c78e061f2842101e84fcd09 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/balswyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4535829189f7bd1e4c78e061f2842101e84fcd09 -
Trigger Event:
release
-
Statement type: