Arabic text normalization and cleaning — pure-Python, composable, reproducible.
Project description
araclean
Arabic text normalization and cleaning — pure-Python, composable, reproducible.
Status: pre-release (
0.x). The v1 normalization core is complete and fully tested; the API may still shift before 1.0.
araclean is non-destructive by default: the bare call is lossless encoding repair
(Unicode form, presentation forms, tatweel, bidi/zero-width characters, look-alike
letters, whitespace) and never silently strips tashkeel or folds letters. Everything
lossy is opt-in through named, serializable profiles (SEARCH, ML, SOCIAL,
CLASSICAL), so the exact preprocessing a corpus went through can be published and
reproduced.
>>> from araclean import normalize
>>> normalize("العـــربية") # lossless encoding repair (default)
'العربية'
>>> normalize("اَلسّلامُ عليكم", profile="search") # opt-in lossy folds for recall
'السلام عليكم'
Documentation
Full documentation lives at https://mhdmartini.github.io/araclean/:
- Getting started — install, first call, choosing a profile.
- Profiles — every step each profile applies, lossless vs lossy (generated from the code).
- Guides — the CLI, pandas & polars, tuning and composing pipelines, custom steps, reproducibility, and stopwords.
- Why araclean — the rationale and what sets it apart.
- API reference and CLI reference.
Every Python example in the docs is executed as a doctest in CI, and the generated pages (profiles, glossary, CLI reference) are drift-checked against the code.
Install
pip install araclean
Optional extras (declared now, populated by later slices):
pip install "araclean[cli]" # command-line interface
pip install "araclean[pandas]" # pandas Series accessor
pip install "araclean[polars]" # polars accessor
pip install "araclean[all]" # everything
The core install is lean: it requires only pydantic v2 — no compiler, Java, or data download.
Development
This project uses uv for environments and pre-commit for the quality gate.
uv sync # create the dev environment
uv run pre-commit install # wire up the pre-commit hooks
uv run pre-commit run --all-files
The gate runs ruff, mypy --strict, pyright, pytest, and cspell
(canonical Arabic terminology per GLOSSARY.md).
Commits & versioning
The version and changelog are derived from commit messages by
Commitizen — see
ADR-0008. Every commit must be a
Conventional Commit; the format is enforced by a
commit-msg hook (wired by uv run pre-commit install) and in CI on every PR.
feat(steps): add RemoveTashkeel step # → minor bump
fix(pipeline): preserve step order # → patch bump
feat(api)!: rename normalize() argument # breaking → minor (we are pre-1.0)
feat bumps the minor, fix/others the patch, and a breaking change (! or a BREAKING CHANGE:
footer) the minor — the project stays in 0.x until 1.0 is declared. To cut a release:
uv run cz bump # compute the bump, update pyproject.toml + uv.lock + CHANGELOG.md, tag vX.Y.Z
uv run cz changelog # preview release notes without bumping
Never hand-edit [project].version — Commitizen owns it.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file araclean-0.2.0.tar.gz.
File metadata
- Download URL: araclean-0.2.0.tar.gz
- Upload date:
- Size: 319.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7eccaed5b8cf041526e7b3502dae3a9bc0a134df898f7f6fe5a028fb919b3b18
|
|
| MD5 |
7fd3b9379afe3a888d6d0cb8849f83dd
|
|
| BLAKE2b-256 |
c6738685d70426bf4af86c5e560b0e3df286519ded66caf01bb1338a3f521385
|
Provenance
The following attestation bundles were made for araclean-0.2.0.tar.gz:
Publisher:
ci.yml on MhdMartini/araclean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
araclean-0.2.0.tar.gz -
Subject digest:
7eccaed5b8cf041526e7b3502dae3a9bc0a134df898f7f6fe5a028fb919b3b18 - Sigstore transparency entry: 1804972444
- Sigstore integration time:
-
Permalink:
MhdMartini/araclean@72ed61f3bdee4da3e389dde0bb5ab2dfeb13bc6d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/MhdMartini
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@72ed61f3bdee4da3e389dde0bb5ab2dfeb13bc6d -
Trigger Event:
push
-
Statement type:
File details
Details for the file araclean-0.2.0-py3-none-any.whl.
File metadata
- Download URL: araclean-0.2.0-py3-none-any.whl
- Upload date:
- Size: 80.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67e12ff99ec74d93e583eb8a556e477a69c7888f01ea63e1e283f99e81eee929
|
|
| MD5 |
cfb10982a8ced68bcf06976b29f241a0
|
|
| BLAKE2b-256 |
06f4c21ca6e5b8b38e3d77d81cbd0406c3ca070f58c50e9bda67f41db733a03f
|
Provenance
The following attestation bundles were made for araclean-0.2.0-py3-none-any.whl:
Publisher:
ci.yml on MhdMartini/araclean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
araclean-0.2.0-py3-none-any.whl -
Subject digest:
67e12ff99ec74d93e583eb8a556e477a69c7888f01ea63e1e283f99e81eee929 - Sigstore transparency entry: 1804972454
- Sigstore integration time:
-
Permalink:
MhdMartini/araclean@72ed61f3bdee4da3e389dde0bb5ab2dfeb13bc6d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/MhdMartini
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@72ed61f3bdee4da3e389dde0bb5ab2dfeb13bc6d -
Trigger Event:
push
-
Statement type: