Skip to main content

Arabic text normalization and cleaning — pure-Python, composable, reproducible.

Project description

araclean

Arabic text normalization and cleaning — pure-Python, composable, reproducible.

Status: pre-release (0.x). The v1 normalization core is complete and fully tested; the API may still shift before 1.0.

araclean is non-destructive by default: the bare call is lossless encoding repair (Unicode form, presentation forms, tatweel, bidi/zero-width characters, look-alike letters, whitespace) and never silently strips tashkeel or folds letters. Everything lossy is opt-in through named, serializable profiles (SEARCH, ML, SOCIAL, CLASSICAL), so the exact preprocessing a corpus went through can be published and reproduced.

>>> from araclean import normalize
>>> normalize("العـــربية")                          # lossless encoding repair (default)
'العربية'
>>> normalize("اَلسّلامُ عليكم", profile="search")   # opt-in lossy folds for recall
'السلام عليكم'

Documentation

Full documentation lives at https://mhdmartini.github.io/araclean/:

Every Python example in the docs is executed as a doctest in CI, and the generated pages (profiles, glossary, CLI reference) are drift-checked against the code.

Install

pip install araclean

Optional extras (declared now, populated by later slices):

pip install "araclean[cli]"     # command-line interface
pip install "araclean[pandas]"  # pandas Series accessor
pip install "araclean[polars]"  # polars accessor
pip install "araclean[all]"     # everything

The core install is lean: it requires only pydantic v2 — no compiler, Java, or data download.

Development

This project uses uv for environments and pre-commit for the quality gate.

uv sync                       # create the dev environment
uv run pre-commit install     # wire up the pre-commit hooks
uv run pre-commit run --all-files

The gate runs ruff, mypy --strict, pyright, pytest, and cspell (canonical Arabic terminology per GLOSSARY.md).

Commits & versioning

The version and changelog are derived from commit messages by Commitizen — see ADR-0008. Every commit must be a Conventional Commit; the format is enforced by a commit-msg hook (wired by uv run pre-commit install) and in CI on every PR.

feat(steps): add RemoveTashkeel step     # → minor bump
fix(pipeline): preserve step order        # → patch bump
feat(api)!: rename normalize() argument   # breaking → minor (we are pre-1.0)

feat bumps the minor, fix/others the patch, and a breaking change (! or a BREAKING CHANGE: footer) the minor — the project stays in 0.x until 1.0 is declared. To cut a release:

uv run cz bump        # compute the bump, update pyproject.toml + uv.lock + CHANGELOG.md, tag vX.Y.Z
uv run cz changelog   # preview release notes without bumping

Never hand-edit [project].version — Commitizen owns it.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

araclean-0.2.0.tar.gz (319.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

araclean-0.2.0-py3-none-any.whl (80.1 kB view details)

Uploaded Python 3

File details

Details for the file araclean-0.2.0.tar.gz.

File metadata

  • Download URL: araclean-0.2.0.tar.gz
  • Upload date:
  • Size: 319.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for araclean-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7eccaed5b8cf041526e7b3502dae3a9bc0a134df898f7f6fe5a028fb919b3b18
MD5 7fd3b9379afe3a888d6d0cb8849f83dd
BLAKE2b-256 c6738685d70426bf4af86c5e560b0e3df286519ded66caf01bb1338a3f521385

See more details on using hashes here.

Provenance

The following attestation bundles were made for araclean-0.2.0.tar.gz:

Publisher: ci.yml on MhdMartini/araclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file araclean-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: araclean-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 80.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for araclean-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 67e12ff99ec74d93e583eb8a556e477a69c7888f01ea63e1e283f99e81eee929
MD5 cfb10982a8ced68bcf06976b29f241a0
BLAKE2b-256 06f4c21ca6e5b8b38e3d77d81cbd0406c3ca070f58c50e9bda67f41db733a03f

See more details on using hashes here.

Provenance

The following attestation bundles were made for araclean-0.2.0-py3-none-any.whl:

Publisher: ci.yml on MhdMartini/araclean

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page