Skip to main content

Provides pure ASCII transliterations of Unicode strings.

Project description

Python Fast Unidecode

Build Python version Tests License: MIT


This repo is a fork of the rust-unicode repository and transports the original Rust implementation to be used with Python. It also implements a couple of source code changes to hasten a translation of ASCII family of characters and makes this implementation on par with Python unidecode implementation on this set of characters.

The overall result is this package should provide you with the same output as the aforementioned Python implementation. However, this package is much faster on a translation of non-ASCII characters (>~3x) and comparable to slightly slowe on ASCII characters (in a degree of small percents) on average based on the benchmark/speed_benchmark.py benchmark (depending on caching, etc.; sometimes, a translation of non-ASCII characters provides you with a speedup of up to >10x). The benchmarks were run on Python 3.13.

License

This project is licensed under the MIT License.

Important Note: Unlike the original Python unidecode package, which is distributed under the restrictive GNU General Public License (GPL), fast-unidecode is released under the permissive MIT license. This makes it suitable for use in a wider range of projects, including commercial and closed-source applications. For SaaS (Software as a Service) companies, using a GPL-licensed library can create an obligation to release your own source code, a requirement that the MIT license does not have.

Benchmark code is not a part of the distributed package.

Installation

pip install fast_unidecode
Installation from source

First, you need to build the package using maturin, then install fast_unidecode simply with pip.

maturin build --release
pip install target/wheels/fast_unidecode...

Usage

>>> from fast_unidecode import unidecode

>>> print(unidecode("Æneid"))
'AEneid'

>>> print(unidecode("北亰"))
'Bei Jing'
rust-unidecode (Original README.md)

Documentation

The rust-unidecode library is a Rust port of Sean M. Burke's famous Text::Unidecode module for Perl. It transliterates Unicode strings such as "Æneid" into pure ASCII ones such as "AEneid." For a detailed explanation on the rationale behind using such a library, you can refer to both the documentation of the original module and this article written by Burke in 2001.

The data set used to translate the Unicode was ported directly from the Text::Unidecode module using a Perl script, so rust-unidecode should produce identical output.

Examples

extern crate unidecode;
use unidecode::unidecode;

assert_eq!(unidecode("Æneid"), "AEneid");
assert_eq!(unidecode("étude"), "etude");
assert_eq!(unidecode("北亰"), "Bei Jing");
assert_eq!(unidecode("ᔕᓇᓇ"), "shanana");
assert_eq!(unidecode("げんまい茶"), "genmaiCha ");

Guarantees and Warnings

Here are some guarantees you have when calling unidecode():

  • The String returned will be valid ASCII; the decimal representation of every char in the string will be between 0 and 127, inclusive.
  • Every ASCII character (0x0000 - 0x007F) is mapped to itself.
  • All Unicode characters will translate to a string containing newlines ("\n") or ASCII characters in the range 0x0020 - 0x007E. So for example, no Unicode character will translate to \u{01}. The exception is if the ASCII character itself is passed in, in which case it will be mapped to itself. (So '\u{01}' will be mapped to "\u{01}".)

There are, however, some things you should keep in mind:

  • As stated, some transliterations do produce \n characters.
  • Some Unicode characters transliterate to an empty string, either on purpose or because rust-unidecode does not know about the character.
  • Some Unicode characters are unknown and transliterate to "[?]".
  • Many Unicode characters transliterate to multi-character strings. For example, 北 is transliterated as "Bei ".

This information was paraphrased from the original Text::Unidecode documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_unidecode-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl (547.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

fast_unidecode-1.0.0-cp313-cp313-macosx_10_12_x86_64.whl (319.8 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

fast_unidecode-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (546.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

fast_unidecode-1.0.0-cp312-cp312-macosx_10_12_x86_64.whl (319.5 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

fast_unidecode-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (547.5 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

fast_unidecode-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl (322.5 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

fast_unidecode-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (547.7 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

fast_unidecode-1.0.0-cp310-cp310-macosx_10_12_x86_64.whl (322.7 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file fast_unidecode-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f4104fe7203f1c6a55dd62dd99915781a99c9318d40a23b40e2b3dfd22514cc0
MD5 cf1dff928da407e8d3e892a7d95789a8
BLAKE2b-256 0ccb66e429b6bf0ca7635ac2d444205bca2019fabfb93fd0480f48c79ed813f8

See more details on using hashes here.

File details

Details for the file fast_unidecode-1.0.0-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 aba859c95b08e6364a8a1821ef7d3706da868cb6ad4b26a9b5f929ef0e6ff392
MD5 dbc86b6abc99a501bcdfa7cfdd27e9cf
BLAKE2b-256 83f5426e0b85bbd275a061b968e7e7ad52d87cbcb707682ed5e8b99b0a689eea

See more details on using hashes here.

File details

Details for the file fast_unidecode-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4468a08b56799242eadf04ec1ff44159c927ab4a4a1e4a2ac70f4976ae90909e
MD5 bae9fe4241a718c2d5b4a1a277e72e28
BLAKE2b-256 1a6bb9da2fb8326fba23142dd34d7835ea51bbc37d390b432cdd88d06b2ccaa8

See more details on using hashes here.

File details

Details for the file fast_unidecode-1.0.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 aaaa51825cd0c797a42bcbd1dfb5ed534a636601f66d89c0adbe8f4523c90cb4
MD5 3d433c8542d3ccb660d54d3966abb482
BLAKE2b-256 e80cb6718ab7bb4c183b22d482e58a43a7ab92d695eaa1c8c44b77579497ca73

See more details on using hashes here.

File details

Details for the file fast_unidecode-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 329e614ed5d69c784b728f2ca4616bd73ec496cfee09c5983fc2b9eedbeb3892
MD5 d0e2433cbf19545ca29d8f896e40b146
BLAKE2b-256 d76713befbccc62d66a45e1defb147bd25b9d8bc53beb8ddddf86137694a9e0f

See more details on using hashes here.

File details

Details for the file fast_unidecode-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a1f39e8aff20edadc8727eca1cf7806b516eb3f1f633ee387006cacbaf3475ca
MD5 10c6fe7817d67aa0456a9d6ac71faf1d
BLAKE2b-256 a617276f1b5e995dffb678545965952d80549cb3d27cbaae9957158bd71b03e8

See more details on using hashes here.

File details

Details for the file fast_unidecode-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 16fd95d097b3ca0b3c4f177a6073b5830c154ed0d87e3003018048a2ec6113c9
MD5 f91c272b3897750f830febe96fb41999
BLAKE2b-256 071dee2135b7f8f9604da49dcf9e1944a39e1298dc15bdbd277d12fc3373e45d

See more details on using hashes here.

File details

Details for the file fast_unidecode-1.0.0-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for fast_unidecode-1.0.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8ea7dafcbe01f3c64abee7763c56acc405165be866b9f3c8fe6886f5abd9d547
MD5 03995df77db5c072a854e482bb845383
BLAKE2b-256 2943f2f0b0600434cebb364a8bcee683288d0a510fd24dd9af330f8352663539

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page