Skip to main content

Detect confusable usage of unicode homoglyphs, prevent homograph attacks.

Project description

https://img.shields.io/travis/vhf/confusable_homoglyphs.svg https://img.shields.io/pypi/v/confusable_homoglyphs.svg

a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph

Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.

  • AlaskaJazz is single script: only Latin characters.

  • ΑlaskaJazz is mixed-script: the first character is a greek letter.

You might also want to avoid people being tricked into entering their password on www.microsоft.com or www.faϲebook.com instead of www.microsoft.com or www.facebook.com. Here is a utility to play with these confusable homoglyphs.

Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.

  • Allo and ρττ are fine: single script.

  • Alloτ is fine: mixed script, but τ is not confusable.

  • Alloρ is dangerous: mixed script and ρ could be confused with p.

Documentation

confusables

from confusable_homoglyphs import confusables

confusables.is_mixed_script

confusables.is_mixed_script(unicode_string)

Boolean: is unicode_string mixed-script.

confusables.is_confusable

confusables.is_confusable(unicode_string, greedy=False, preferred_aliases=[])

Takes a character or string and returns each character present in unicode’s confusable characters list.

If greedy=False, it will only return the first confusable character found without looking at the rest of the string, greedy=True returns all of them.

preferred_aliases=[] can take an array of unicode block aliases to be considered as your ‘base’ unicode blocks:

  • considering paρa,

    • with preferred_aliases=['latin'], the 3rd character ρ would be returned because this greek letter can be confused with latin p.

    • with preferred_aliases=['greek'], the 1st character p would be returned because this latin letter can be confused with greek ρ.

    • with preferred_aliases=[] and greedy=True, you’ll discover the 29 characters that can be confused with p, the 23 characters that look like a, and the one that looks like ρ (which is, of course, p aka LATIN SMALL LETTER P).

confusables.is_dangerous

confusables.is_dangerous(unicode_string, preferred_aliases=[])

Boolean: True if is_mixed_script(unicode_string) and is_confusable(unicode_string).

The preferred_aliases argument is simply passed to is_confusable.

Is the data up to date?

Yep.

The unicode blocks aliases and names for each character are extracted from this file provided by the unicode consortium.

The matrix of which character can be confused with which other characters is built using this file provided by the unicode consortium.

This data is stored in two JSON files: categories.json and confusables.json. If you delete them, they will both be recreated by downloading and parsing the two abovementioned files and stored as JSON files again.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confusable_homoglyphs-1.0.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

confusable_homoglyphs-1.0-py2.py3-none-any.whl (9.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file confusable_homoglyphs-1.0.tar.gz.

File metadata

File hashes

Hashes for confusable_homoglyphs-1.0.tar.gz
Algorithm Hash digest
SHA256 42489c48c289eaab019846e826674eda3e339a102431b3a5b260230d577b18aa
MD5 47622d54c1f6f8a9be08dbaef41526a8
BLAKE2b-256 50800470e80c41336f1a6c88c607347f5884c580b6673f132f144808502aed5a

See more details on using hashes here.

File details

Details for the file confusable_homoglyphs-1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for confusable_homoglyphs-1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cda66b4983c8b65bf66845eda38bc14a237443e0023d9c7105b32d523274fef3
MD5 96fe892b51f1de3c180625d537b51ddc
BLAKE2b-256 4baa890b8e49bb414244a8ff2615c5d99b5ccfb9da1c4b40b2b00e113d98429a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page