Skip to main content

Python implementation of eudex, a fast phonetic reduction/hashing algorithm

Project description

This is the Python port of ticki/eudex, you can install it via

pip install eudex

Eudex ([juːˈdɛks]) is a phonetic reduction/hashing algorithm, providing locality sensitive “hashes” of words, based on the spelling and pronunciation.

It is derived from the classification of the pulmonic consonants.

Eudex is about two orders of magnitude faster than Soundex, and several orders of magnitude faster than Levenshtein distance, making it feasible to run on large sets of strings in very short time.

Example

>>> from eudex import eudex
>>> eudex('Jesus'), eudex('Yesus')
(216172782115094804, 16429131440648880404)  # values in base 10 are very different
>>> sum(1 for _ in bin(eudex('Jesus') ^ eudex('Yesus')) if _ == '1') # number of one after xoring hashes
6  # very low distance, so words are similar !

Features

  • High quality locality-sensitive hashing based on pronunciation.

  • Works with, but not limited to, English, Catalan, German, Spanish, Italian, and Swedish.

  • Sophisticated phonetic mapping.

  • Better quality than Soundex.

  • Takes non-english letters into account.

  • Extremely fast.

  • Vowel sensitive.

FAQ

Why aren’t Rupert and Robert mapped to the same value, like in Soundex?

Eudex is not a phonetic classifier, it is a phonetic hasher. It maps words in a manner that exposes the difference.

The results seems completely random. What is wrong?

It is likely because you assume that the hashes of similar sounding words are mapped near to each other, while they don’t. Instead, their Hamming distance (i.e. XOR the values and sum their bits) will be low.

Does it support non-English letters?

Yes, it supports all the C1 letters (e.g., ü, ö, æ, ß, é and so on), and it takes their respective sound into account.

Is it English-only?

No, it works on most European languages as well. However, it is limited to the Latin alphabet.

Does it take digraphs into account?

The table is designed to encapsulate digraphs as well, though there is no separate table for these (like in Metaphone).

Does it replace Levenshtein?

It is not a replacement for Levenshtein distance, it is a replacement for Levenshtein distance in certain use cases, e.g. searching for spell check suggestions.

What languages is it tested for?

It is tested on the English, Catalan, German, Spanish, Swedish, and Italian dictionaries, and has been confirmed to have decent to good quality on all of them.

Implementations

How does Eudex work ?

see how_it_works.md


Credits: This README was build based on the [ticki/eudex](https://github.com/ticki/eudex) README

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eudex-0.0.1.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

eudex-0.0.1-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file eudex-0.0.1.tar.gz.

File metadata

  • Download URL: eudex-0.0.1.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.22.0

File hashes

Hashes for eudex-0.0.1.tar.gz
Algorithm Hash digest
SHA256 912df922e2dec00515644a1561488897c68997ec9c21ed13453e5a3ec046d8cb
MD5 182e0fec4c64a5a01b75df8b8c13fe94
BLAKE2b-256 d6c67c2b19606eaa4e8804995700aa55eaf3f206570730a7e91262748ce4ece9

See more details on using hashes here.

File details

Details for the file eudex-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: eudex-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.22.0

File hashes

Hashes for eudex-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9a9974027d6df5a16bd6f3b13e8cfcfe29aed560d38804729b976f94a0632dc7
MD5 26fd23a4443ec9496bafaad1fa3db953
BLAKE2b-256 111bc69e2033f33f2c07ba06e4f94bd734d8d1af811128ca24bd903820dafdda

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page