Skip to main content

Convert unicode characters that resemble ASCII to their equivalent.

Project description

uni2ascii

Convert unicode to closest ASCII equivalent.

There are many unicode glyphs whose appearance is very similar to ASCII characters. This script converts these codepoints to their ASCII equivalent. For example, the string і lоѵе üńісοdе contains only 2 ASCII characters, despite the fact that all but üń look fine at first glance. To convert to ASCII:

> echo і lоѵе üńісοdе | uni2ascii
i love unicode

The default action is to leave untouched any non-ascii that uni2ascii.py doesn't know about. This can be overridden with command line arguments. Call uni2ascii -h for help.

You can also call from python:

from uni2ascii import uni2ascii
ascii_string = uni2ascii('і lоѵе üńісοdе')

It's quite easy to add new transliterations by just copying and pasting offending strings into the code. See the function get_translits() in __init__.py. Feel free to contact me or do a pull request if you find useful ones that aren't there.

Notes

uni2ascii was written to handle particular data we had on hand. There are plenty of missing transliterations. I'm happy to add new ones!

Input encoding must be utf-8.

Feel free to modify - it's not likely it'll work exactly correctly for you out of the box.

The code will no longer work in python2 -- I added some unicode normalization from unicodedata and haven't quite figured out how to make it work in python2 and python3 simultaneously.

This was not designed to thwart homograph attacks, but rather to help with text normalization of English, where unicode sometimes sneaks in.

Install

pip install uni2ascii-janin

For the most up to date:

pip install git+https://github.com/ajanin/uni2ascii.git

Alternatives

The Python module unidecode

Very similar in spirit, but doesn't handle punctuation and makes some choices I disagree with.

iconv

If you call iconv -t //TRANSLIT, it'll do some of what uni2ascii does, but a bunch of stuff is missing that is important to our application.

Image processing approaches

There have been a few projects that actually look at the generated pixels to determine if two glyphs are too similar. I love the idea, but wanted more fine grain control.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uni2ascii_janin-1.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

uni2ascii_janin-1.1.0-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file uni2ascii_janin-1.1.0.tar.gz.

File metadata

  • Download URL: uni2ascii_janin-1.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for uni2ascii_janin-1.1.0.tar.gz
Algorithm Hash digest
SHA256 e26062c2a15bdaaf4f8d51185fdc828a2e988cdef3a52e6aa1a049de568bbfe1
MD5 397522769e8b340d4c9a5258cb36322d
BLAKE2b-256 3c0c941fae18a71e3a921e7ef1602c60be9c406e48071c39704c9b6f925640a5

See more details on using hashes here.

File details

Details for the file uni2ascii_janin-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for uni2ascii_janin-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 550301f6f7d8d2b91b1530230b30fee5e71f851764368cd32bb13642e05628d1
MD5 dbb7df2a7a29a84e05c59e9ceaa38eb7
BLAKE2b-256 7f1b0aad90491191c24feaafa76caaa9524d4ee23871fee02776bc8b47568f29

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page