Skip to main content

A edit distance edition based on graphs.

Project description

graph-edit-distance

A set of edit distance edition methods based on graphs. These methods allow to calculate the edition cost of an entity (for example a word), among a big quantity of terms with less computational cost than the usual methods.

At the moment, only a normal Levenshtein algorithm is applied, but it is very easy to add new algorithms thanks to the project structure.

How to use

Although you can use all kind of sequences of objects, which are hashable and comparable, with the class Graph, we are going to use the class TextGraph specially suited for text entities. You can create an edit distance graph with this command:

from grapheditdistance import TextGraph

g = TextGraph()

And next to add entities:

# Adding entities individually
g.add('hi')
g.add('hello')
# Adding a sequence of entities
g.index(['hola', 'adiós', 'goodbye', 'punto de venta', 'puerta'])

Finally, you can calculate the edition distance of a new word against all those previously added terms:

# Search the term with spelling mistakes "pumto de ventas"
results = g.seq_search('pumto de ventas', threshold=0.8, nbest=0)
# It should return
[(
    'pumto de ventas',
    'punto de venta',
    2.0,
    '[(None), (None), (replace[m -> n], 1), (None), (None), (None), (None), (None), \
    (None), (None), (None), (None), (None), (None), (insert[s], 1), (Final)]'
)]

Where the first element of the tuple is the original entity, the second the found one, the third the edition distance weight, and the last the list of edition operation applied to change the original query for obtaining the previously indexed one. This method will only return the entities which edition distance is less than the given threshold of 0.8 respect to the length of the original entity. That means, if the original entity has 15 character, the maximum number of errors is 3 (len(entity) * (1 - threshold)).

At the moment, the method search() is not implemented yet. this algorith

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grapheditdistance-0.0.2.tar.gz (22.2 kB view details)

Uploaded Source

File details

Details for the file grapheditdistance-0.0.2.tar.gz.

File metadata

  • Download URL: grapheditdistance-0.0.2.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.3

File hashes

Hashes for grapheditdistance-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b5e587af7c9225be432f7eba33061b75112511bf499158266e47499acb0af7f4
MD5 67d8c8e05379764ed1e830fe5f70d4ad
BLAKE2b-256 9f05c46a4fb8b234e15dc8bf4da05294073f17fe5b12681ecd83c021b2e103fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page