Skip to main content

A edit distance edition based on graphs.

Project description

graph-edit-distance

A set of edit distance edition methods based on graphs. These methods allow to calculate the edition cost of an entity (for example a word), among a big quantity of terms with less computational cost than the usual methods.

At the moment, only a normal Levenshtein algorithm is applied, but it is very easy to add new algorithms thanks to the project structure.

How to use

Although you can use all kind of sequences of objects, which are hashable and comparable, with the class Graph, we are going to use the class TextGraph specially suited for text entities. You can create an edit distance graph with this command:

from grapheditdistance import TextGraph

g = TextGraph()

And next to add entities:

# Adding entities individually
g.add('hi')
g.add('hello')
# Adding a sequence of entities
g.index(['hola', 'adiós', 'goodbye', 'punto de venta', 'puerta'])

Finally, you can calculate the edition distance of a new word against all those previously added terms:

# Search the term with spelling mistakes "pumto de ventas"
results = g.seq_search('pumto de ventas', threshold=0.8, nbest=0)
# It should return
[(
    'pumto de ventas',
    'punto de venta',
    2.0,
    '[(None), (None), (replace[m -> n], 1), (None), (None), (None), (None), (None), \
    (None), (None), (None), (None), (None), (None), (insert[s], 1), (Final)]'
)]

Where the first element of the tuple is the original entity, the second the found one, the third the edition distance weight, and the last the list of edition operation applied to change the original query for obtaining the previously indexed one. This method will only return the entities which edition distance is less than the given threshold of 0.8 respect to the length of the original entity. That means, if the original entity has 15 character, the maximum number of errors is 3 (len(entity) * (1 - threshold)).

At the moment, the method search() is not implemented yet. this algorith

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grapheditdistance-0.0.4.tar.gz (22.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page