Skip to main content

No project description provided

Project description

FuzzDex

FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed by a human input.

Algorithm

You load into fuzzdex series of short phrases - like street names consisting of one or multiple words, with an numerical index that identifies this street names in related dictionary of cities.

Then, you can query the index using a must-token (currently only one, but could be altered to use more) and additional should-tokens to read a total of limit of possibly matching phrases.

Must-token is trigramized (warszawa -> war ars rsz sza zaw awa) and all phrases containing given trigrams are initially read from the index. Trigrams have scores, the more common they are, the less they increase the phrase score. Trigrams of should-tokens additionally alter the score (positively when they match), but don't add additional phrases from index. Phrases are then sorted by score.

Top phrases are filtered to contain an optional constraint and the must-token with a maximal editing distance (Levenshtein) until limit of phrases is gathered.

Internally, the results of a must-token search are LRU cached as in practise it's pretty often repeated. Should-tokens vary and they are always recalculated.

Usecases

It was designed to match parts of a user supplied physical addresses to a data extracted from the OpenStreet map - in order to find streets and cities.

Address is first tokenized and then it's parts are matched against fuzzy dictionary of cities and streets. Additional constraints can limit the matched streets only to given city - or finding cities that have a given street.

Data is first searched for using trigrams (warszawa -> war ars rsz sza zaw awa), and then additionally filtered using maximal Levenshtein editing distance.

Original solution used fuzzy query of the Elasticsearch database, which worked - but was 21x slower in our tests.

Example

import fuzzdex
# Create two fuzzy indices with cities and streets.
cities = fuzzdex.FuzzDex()
# Warsaw has streets: Czerniakowska, Nowy Świat and Wawelska
cities.add_phrase("Warsaw", 1, constraints={1, 2, 3})
# Wrocław only Czerniawska
cities.add_phrase("Wrocław", 2, constraints={4})

streets = fuzzdex.FuzzDex()
# Streets with reversed constraints and own indices:
streets.add_phrase("Czerniakowska", 1, constraints={1})
streets.add_phrase("Nowy Świat", 2, constraints={1})
streets.add_phrase("Wawelska", 3, constraints={1})

streets.add_phrase("Czerniawska", 4, constraints={2})

# This recalculates trigram scores and makes index immutable:
cities.finish()
streets.finish()

# warszawa matches warsaw at editing distance 2.
cities.search(["warszawa"], [], max_distance=2, limit=60)
#    [{'origin': 'Warsaw', 'index': 1, 'token': 'warsaw',
#      'distance': 2, 'score': 200000.0, 'should_score': 0.0}]
#
# NOTE: Currently only a single `must` token is supported.
#
# `świat` adds additional should score to the result and places it higher
# in case the limit is set:
streets.search(["nowy"], ["świat"], max_distance=2, constraint=1)
#    [{'origin': 'Nowy Świat', 'index': 2, 'token': 'nowy',
#      'distance': 0, 'score': 5.999, 'should_score': 7.4999}]

# Won't match with constraint 2.
streets.search(["nowy"], ["świat"], constraint=2)
#    []

# Quering for `czerniawska` will return `czerniakowska` (no constraints),
# but with a lower score and higher distance:
In [22]: streets.search(["czerniawska"], [], max_distance=2)
Out[22]:
#  [{'origin': 'Czerniawska', 'index': 4, 'token': 'czerniawska',
#   'distance': 0, 'score': 9.49995231628418, 'should_score': 0.0},
#  {'origin': 'Czerniakowska', 'index': 1, 'token': 'czerniakowska',
#   'distance': 2, 'score': 6.4999680519104, 'should_score': 0.0}]

Installation, development

You can install fuzzdex from PyPI when using one of the architectures it's published for (x86_64, few Python versions).

pip3 install fuzzdex

Or use maturin to build it locally:

pipenv install --dev
pipenv shell
maturin develop -r
pytest

You can also use cargo and copy or link the .so file directly (rename libfuzzdex.so to fuzzdex.so):

cargo build --release
ln -s target/release/libfuzzdex.so fuzzdex.so

build.sh has commands for building manylinux packages for PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fuzzdex-1.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.5+ x86-64

fuzzdex-1.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

fuzzdex-1.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

fuzzdex-1.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

fuzzdex-1.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

File details

Details for the file fuzzdex-1.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3af5e446f5747244939e843d5e7c4baa4d5774ffcb9f84d2361b6ca279e386e5
MD5 42da06b35a6ee228f702247c83dc5b59
BLAKE2b-256 c69c4fba37487988b39b2b4e0f4ab618bbb03662bfa45e26590b0f27b85cd0c2

See more details on using hashes here.

File details

Details for the file fuzzdex-1.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d7ad5bf11f6b164f8d47840a92c993070cbfa5e2d0cfb655c4a4853c26546313
MD5 2a36b886ac82cca76d92d55c353c5663
BLAKE2b-256 eb21edce86bbadfd15cbe6fc1955dec9bd8f14e06fb7f27fe32a5c5bbadd306c

See more details on using hashes here.

File details

Details for the file fuzzdex-1.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4222fdf8a790247a998ac3080d473e4ba90b7b5f3f53b1f67915c2cd24f51f29
MD5 13ab1ead0918fdb9039f2be198a10d2f
BLAKE2b-256 1aa13b16b7374435a5b1d9a4544136156392d1b6c6e9553151bc0df1425eaec9

See more details on using hashes here.

File details

Details for the file fuzzdex-1.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b38ebba889a7df4c5b8fb55760e56e9c787bd2050158f8b217080e2490f13476
MD5 43dfba837e2d14f1bdd6c31f27c5f8b7
BLAKE2b-256 0660e181b66083c1bdde1c80c9f871fcf8dee7d4e0722a921c5dec915f066e65

See more details on using hashes here.

File details

Details for the file fuzzdex-1.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1994b1ae47a95f27b281cd769048c753b31ab947b41c0ca8a7664c42e0c4efd7
MD5 d9d00939135a8b3dae99443dcbbbf08a
BLAKE2b-256 d0cda8bf3dfaa3c8815cb12a1a9db9659c239689267acb2630dda521e2c92c4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page