Rust-backed transliteration similar to Python Unidecode, with optional PyO3 bindings for Python
Project description
unidecode-rs
Rust implementation of the Unidecode transliteration logic with optional
PyO3 bindings to expose a drop-in replacement for the Python unidecode
package.
This repository contains:
src/— Rust implementation and PyO3 bindings (optional featurepython).python/— a small Python shim that provides upstream-compatible signatures and forwards to the compiled extension when available.tests/— Rust unit tests and a parity harness for upstream Python tests.bench/— benchmark helpers comparing pure-Pythonunidecodevs the compiledunidecode-rsextension.
Quickstart — Rust library usage
Add unidecode-rs as a dependency in your Cargo.toml (example):
[dependencies]
unidecode-rs = { git = "https://github.com/gmaOCR/unidecode-rs", tag = "v0.0.1" }
Then call the API from Rust:
use unidecode_rs::slugify; // example public function in this repo
let out = unidecode_rs::unidecode("Héllo Wörld — café");
println!("{}", out);
See src/lib.rs and src/lib_py.rs for additional exported functions.
Quickstart — Python users (drop-in replacement)
If you want to replace the pure-Python unidecode package with the
Rust-backed implementation (faster), follow these steps.
- Build and install the Python wheel using
maturin(local develop):
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip maturin
cd unidecode-rs
maturin develop --release --features python
This will build the compiled extension and install a small Python package
that exposes the same API surface as upstream unidecode.
- Replace imports in your Python code
If your code does from unidecode import unidecode, the recommended way is
to install unidecode-rs into the same environment (see above). The
repository also contains a small shim at unidecode-rs/python/unidecode_rs
which ensures the exported callables use the same parameter names and raise
the same exception types as upstream.
- Compatibility notes
- The shim aims to provide identical function signatures and semantics to
the upstream
unidecodeincludingerrorshandling and surrogate behavior. Where upstream behavior depends on narrow/broad Py builds we mirror the upstream tests by warning and stripping surrogates. - If you need
inspect.signaturecompatibility, the shim exposes the textual signature(string, errors=None, replace_str=None)so tooling that introspects signatures will work as expected.
Benchmarks
See bench/bench_unidecode_compare.py — it compares call latency and
throughput for representative inputs. During development the Rust
implementation showed sizable speedups (multi‑x) vs the pure Python
implementation for large inputs.
Publishing to PyPI (OIDC)
This repository includes a GitHub Actions workflow to publish manylinux
wheels to PyPI using OIDC token minting (no long-lived PyPI token in the
repo). See .github/workflows/publish-pypi.yml for implementation. To
publish:
- Tag a release on GitHub (e.g.
v1.2.3) and push the tag. - The workflow builds manylinux wheels using
maturinand exchanges an OIDC token for a short-lived PyPI API token (mint). The workflow then uploads dists to PyPI.
Note: see the workflow file for details and required runner permissions.
Development notes
- Use
cargo testfor Rust unit tests. - Use
maturin develop --release --features pythonto iterate on Python bindings and local tests. - The repo contains a parity harness that runs the upstream
unidecodePython tests against this compiled extension to track functional parity.
License
Distributed under the project license (see LICENSE).
unidecode-rs — Unicode → ASCII transliteration faithful to Python
Fast Rust implementation (optional Python bindings via PyO3) targeting bit‑for‑bit equivalence with Python Unidecode. Provides:
- Same output as
Unidecodefor all covered tables - Noticeably higher performance (see perf snapshot in tests)
- Golden tests comparing dynamically against the Python version
- High coverage on critical paths (bitmap + per‑block dispatch)
Quick summary
- Rust usage:
unidecode_rs::unidecode("déjà") -> "deja" - Python usage: build extension with
maturin develop --features python - Idempotence:
unidecode(unidecode(x)) == unidecode(x)(after first pass everything is ASCII) - Golden tests: ensure exact parity with Python
Rust example
use unidecode_rs::unidecode;
fn main() {
println!("{}", unidecode("PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ")); // PRILIS ZLUTOUCKY KUN
}
Install / build (Rust only)
cargo add unidecode-rs
# or add manually in Cargo.toml then
cargo build
Build the Python extension (development)
Prerequisites: Rust stable, Python ≥3.8, pip.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip maturin
maturin develop --release --features python
python -c "import unidecode_rs; print(unidecode_rs.unidecode('déjà vu'))"
To build a distributable wheel:
maturin build --release --features python -i python
pip install target/wheels/*.whl
Python API
import unidecode_rs
print(unidecode_rs.unidecode("Příliš žluťoučký kůň"))
Minimal API: single function unidecode(text: str, errors: Optional[str] = None, replace_str: Optional[str] = None) -> str.
Idempotence — what is it?
A function is idempotent if applying it multiple times yields the same result as applying it once. Here:
unidecode(unidecode(s)) == unidecode(s)
After the first transliteration the output is pure ASCII; a second pass does nothing. A dedicated test validates this over multi‑script samples.
Golden tests (Python parity)
golden_equivalence tests run the Python Unidecode library in a subprocess and diff outputs across samples (Latin + accents, Cyrillic, Greek, CJK, emoji). Any mismatch fails the test.
Targeted run:
cargo test -- --nocapture golden_equivalence
Coverage & critical paths
Dispatch design:
- Presence bitmap per 256‑codepoint block (
BLOCK_BITMAPS) for quick negative checks. - Large generated
matchproviding PHF table access per block.
Extra tests (lookup_paths.rs + internal tests in lib.rs) exercise:
- Bit zero ⇒
lookupreturnsNone(negative path) - Bit one ⇒
lookupreturns non‑empty string - Out‑of‑range block ⇒ early exit
- ASCII parity / idempotence
Generate local report via cargo llvm-cov (alias if configured). Detailed guidance moved to docs/COVERAGE.md.
cargo llvm-cov --html
Upstream test harness
Beyond Rust & golden tests, a Python harness reuses the original upstream Unidecode test suite to assert behavioral parity.
Main file: tests/python/test_reference_suite.py
Characteristics:
- Dynamically loads the upstream base test class (via
_reference/upstream_loader.py). - Monkeypatches
unidecode.unidecodeto point to the Rust implementation (unidecode_rs.unidecode). - Implements full
errors=modes (ignore,replace,strict,preserve) for parity. - Overrides surrogate tests with lean variants to avoid warning noise while maintaining assertions.
Run only this suite:
pytest -q tests/python/test_reference_suite.py
Expected (evolving) report:
14 passed, 2 xfailed, 4 xpassed # exemple actuel
xfail / xpass policy:
- Temporary
xfailremoved once feature implemented; a formerxfailthat passes becomes a normal pass.
Parity roadmap:
- (Done) Implement
errors=modes. - Finalize surrogate handling parity (optional warning replication toggle).
- Extend tables to cover remaining mathematical alphanumeric symbols not yet mapped (e.g., script variants currently partial).
- Add multi‑corpus benchmarks (Latin, mixed CJK, emoji) for stable metrics.
- Provide exhaustive table diff script (block by block) with machine‑readable output.
Current limitations:
- Some mathematical script / stylistic letter ranges may still map to empty until table extension is complete.
- Generated table lines unexecuted in coverage are data-only, low semantic value.
How to contribute:
- Add a targeted parity test (Rust or Python) reproducing a divergence.
- Extend the table or adjust logic.
- Run
pytest tests/python/test_reference_suite.pyandcargo test. - Update this section if a batch of former gaps is closed.
Performance
A micro performance snapshot in golden_equivalence.rs::performance_snapshot runs 5 iterations on mixed‑script text vs Python. Numbers are indicative only; for robust measurement use Criterion benchmarks or larger corpora.
Repository layout
src/ # Core library sources + generated tables
benches/ # Criterion or std benches (Rust)
scripts/ # Developer helper scripts (bench_compare, coverage)
tests/ # Rust integration & golden tests
tests/python/ # Python parity & upstream harness
docs/ # Coverage and performance documentation
docs/PERFORMANCE_PLAN.md details next-step performance ideas.
Philosophy
- Fidelity: match Python before adding new rules.
- Safety: no panics for any valid Unicode scalar value.
- Performance: avoid unnecessary copies (ASCII fast path, heuristic pre‑allocation).
- Maintainability: generated code isolated, core logic compact.
Development / tests
cargo test
# (optional) fallback feature using deunicode
cargo test --features fallback-deunicode
Python tests (after building extension):
pytest tests/python
License
MIT. Tables derived from public data of the Python Unidecode project.
Acknowledgements
- Original Python project Unidecode
- Rust & PyO3 community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unidecode_pyo3-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: unidecode_pyo3-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fca459bc11526212600dadc23f2ae25acfb9e6e130806e5362068e5f5a8e2563
|
|
| MD5 |
28766e39f5b02d2dcbcd19d162b26478
|
|
| BLAKE2b-256 |
9a6adf4e7306da9de7808a407cf381a323e01446058cc137db0aaf41a89cb5cd
|
Provenance
The following attestation bundles were made for unidecode_pyo3-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish-pypi-oidc.yml on gmaOCR/unidecode-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unidecode_pyo3-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
fca459bc11526212600dadc23f2ae25acfb9e6e130806e5362068e5f5a8e2563 - Sigstore transparency entry: 573334920
- Sigstore integration time:
-
Permalink:
gmaOCR/unidecode-rs@166ba402d90bf1a40b10f16d9a40a5b64615ed97 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/gmaOCR
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi-oidc.yml@166ba402d90bf1a40b10f16d9a40a5b64615ed97 -
Trigger Event:
workflow_dispatch
-
Statement type: