Skip to main content

No project description provided

Project description

FuzzDex

FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed by a human input.

Algorithm

You load into fuzzdex series of short phrases - like street names consisting of one or multiple words, with an numerical index that identifies this street names in related dictionary of cities.

Then, you can query the index using a must-token (currently only one, but could be altered to use more) and additional should-tokens to read a total of limit of possibly matching phrases.

Must-token is trigramized (warszawa -> war ars rsz sza zaw awa) and all phrases containing given trigrams are initially read from the index. Trigrams have scores, the more common they are, the less they increase the phrase score. Trigrams of should-tokens additionally alter the score (positively when they match), but don't add additional phrases from index. Phrases are then sorted by score.

Top phrases are filtered to contain an optional constraint and the must-token with a maximal editing distance (Levenshtein) until limit of phrases is gathered.

Internally, the results of a must-token search are LRU cached as in practise it's pretty often repeated. Should-tokens vary and they are always recalculated.

Usecases

It was designed to match parts of a user supplied physical addresses to a data extracted from the OpenStreet map - in order to find streets and cities.

Address is first tokenized and then it's parts are matched against fuzzy dictionary of cities and streets. Additional constraints can limit the matched streets only to given city - or finding cities that have a given street.

Data is first searched for using trigrams (warszawa -> war ars rsz sza zaw awa), and then additionally filtered using maximal Levenshtein editing distance.

Original solution used fuzzy query of the Elasticsearch database, which worked - but was 21x slower in our tests.

Example

import fuzzdex
# Create two fuzzy indices with cities and streets.
cities = fuzzdex.FuzzDex()
# Warsaw has streets: Czerniakowska, Nowy Świat and Wawelska
cities.add_phrase("Warsaw", 1, constraints={1, 2, 3})
# Wrocław only Czerniawska
cities.add_phrase("Wrocław", 2, constraints={4})

streets = fuzzdex.FuzzDex()
# Streets with reversed constraints and own indices:
streets.add_phrase("Czerniakowska", 1, constraints={1})
streets.add_phrase("Nowy Świat", 2, constraints={1})
streets.add_phrase("Wawelska", 3, constraints={1})

streets.add_phrase("Czerniawska", 4, constraints={2})

# This recalculates trigram scores and makes index immutable:
cities.finish()
streets.finish()

# warszawa matches warsaw at editing distance 2.
cities.search("warszawa", [], max_distance=2, limit=60)
#    [{'origin': 'Warsaw', 'index': 1, 'token': 'warsaw',
#      'distance': 2, 'score': 200000.0, 'should_score': 0.0}]

# `świat` adds additional should score to the result and places it higher
# in case the limit is set:
streets.search("nowy", ["świat"], max_distance=2, constraint=1)
#    [{'origin': 'Nowy Świat', 'index': 2, 'token': 'nowy',
#      'distance': 0, 'score': 5.999, 'should_score': 7.4999}]

# Won't match with constraint 2.
streets.search("nowy", ["świat"], constraint=2)
#    []

# Quering for `czerniawska` will return `czerniakowska` (no constraints),
# but with a lower score and higher distance:
In [22]: streets.search("czerniawska", [], max_distance=2)
Out[22]:
#  [{'origin': 'Czerniawska', 'index': 4, 'token': 'czerniawska',
#   'distance': 0, 'score': 9.49995231628418, 'should_score': 0.0},
#  {'origin': 'Czerniakowska', 'index': 1, 'token': 'czerniakowska',
#   'distance': 2, 'score': 6.4999680519104, 'should_score': 0.0}]

Installation, development

You can install fuzzdex from PyPI when using one of the architectures it's published for (x86_64, few Python versions).

pip3 install fuzzdex

Or use maturin to build it locally:

pipenv install --dev
pipenv shell
maturin develop -r
pytest

You can also use cargo and copy or link the .so file directly (rename libfuzzdex.so to fuzzdex.so):

cargo build --release
ln -s target/release/libfuzzdex.so fuzzdex.so

build.sh has commands for building manylinux packages for PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fuzzdex-0.6.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

File details

Details for the file fuzzdex-0.6.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 046a009a856cc02a4906830df66500ba2ec929c6271cd4130e9264e7bd99970b
MD5 89427ffafe9f9b26186994d15faf251e
BLAKE2b-256 bd6ce64d6d6f8338bb785c8fac13bc8b417fa219129acd479ce2734e9b2c1b2f

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e14038a6907a3c4df83d77dcd50da67fd4782fd05a33275debddf26f16f3783c
MD5 8b6b55b3468501e83e3bd1417b53bead
BLAKE2b-256 9f68bf4be8aba9afb3b3d8c17f3bde7fe0a8014d11d264358a01a6fb9d8788d3

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 069604446d4b689e38ca67f3d3af5bf7d6ba8a0095fe1bd1d386e63799887382
MD5 886b44523be4c5c0d403b02543a47d50
BLAKE2b-256 947cb697473071cf793544f513a39074344e63f68b46648136f7121dfcd93a7f

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2eae37bc77255585c5ca2fafd89d1f7d6350e8cd59446e4fa854d97741b2d868
MD5 6f97eac9681acca20ea1e9adcc351f34
BLAKE2b-256 18e6e7e09999b891df5b2e6e9c0198db4e09310d20c4b5023aef0084e456daf9

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2d729b2a9059063eaec0093494b29a86b9becd143aecbc94b935f65428cb7df4
MD5 215170d19622dbe04107ca261821ec86
BLAKE2b-256 99892eba10eb41f63a6d599dc9acb6af9b14dd53047238fc172ef419517df688

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page