Skip to main content

No project description provided

Project description

FuzzDex

FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed by a human input.

Algorithm

You load into fuzzdex series of short phrases - like street names consisting of one or multiple words, with an numerical index that identifies this street names in related dictionary of cities.

Then, you can query the index using a must-token (currently only one, but could be altered to use more) and additional should-tokens to read a total of limit of possibly matching phrases.

Must-token is trigramized (warszawa -> war ars rsz sza zaw awa) and all phrases containing given trigrams are initially read from the index. Trigrams have scores, the more common they are, the less they increase the phrase score. Trigrams of should-tokens additionally alter the score (positively when they match), but don't add additional phrases from index. Phrases are then sorted by score.

Top phrases are filtered to contain an optional constraint and the must-token with a maximal editing distance (Levenshtein) until limit of phrases is gathered.

Internally, the results of a must-token search are LRU cached as in practise it's pretty often repeated. Should-tokens vary and they are always recalculated.

Usecases

It was designed to match parts of a user supplied physical addresses to a data extracted from the OpenStreet map - in order to find streets and cities.

Address is first tokenized and then it's parts are matched against fuzzy dictionary of cities and streets. Additional constraints can limit the matched streets only to given city - or finding cities that have a given street.

Data is first searched for using trigrams (warszawa -> war ars rsz sza zaw awa), and then additionally filtered using maximal Levenshtein editing distance.

Original solution used fuzzy query of the Elasticsearch database, which worked - but was 21x slower in our tests.

Example

import fuzzdex
# Create two fuzzy indices with cities and streets.
cities = fuzzdex.FuzzDex()
# Warsaw has streets: Czerniakowska, Nowy Świat and Wawelska
cities.add_phrase("Warsaw", 1, constraints={1, 2, 3})
# Wrocław only Czerniawska
cities.add_phrase("Wrocław", 2, constraints={4})

streets = fuzzdex.FuzzDex()
# Streets with reversed constraints and own indices:
streets.add_phrase("Czerniakowska", 1, constraints={1})
streets.add_phrase("Nowy Świat", 2, constraints={1})
streets.add_phrase("Wawelska", 3, constraints={1})

streets.add_phrase("Czerniawska", 4, constraints={2})

# This recalculates trigram scores and makes index immutable:
cities.finish()
streets.finish()

# warszawa matches warsaw at editing distance 2.
cities.search("warszawa", [], max_distance=2, limit=60)
#    [{'origin': 'Warsaw', 'index': 1, 'token': 'warsaw',
#      'distance': 2, 'score': 200000.0, 'should_score': 0.0}]

# `świat` adds additional should score to the result and places it higher
# in case the limit is set:
streets.search("nowy", ["świat"], max_distance=2, constraint=1)
#    [{'origin': 'Nowy Świat', 'index': 2, 'token': 'nowy',
#      'distance': 0, 'score': 5.999, 'should_score': 7.4999}]

# Won't match with constraint 2.
streets.search("nowy", ["świat"], constraint=2)
#    []

# Quering for `czerniawska` will return `czerniakowska` (no constraints),
# but with a lower score and higher distance:
In [22]: streets.search("czerniawska", [], max_distance=2)
Out[22]:
#  [{'origin': 'Czerniawska', 'index': 4, 'token': 'czerniawska',
#   'distance': 0, 'score': 9.49995231628418, 'should_score': 0.0},
#  {'origin': 'Czerniakowska', 'index': 1, 'token': 'czerniakowska',
#   'distance': 2, 'score': 6.4999680519104, 'should_score': 0.0}]

Installation, development

You can install fuzzdex from PyPI when using one of the architectures it's published for (x86_64, few Python versions).

pip3 install fuzzdex

Or use maturin to build it locally:

pipenv install --dev
pipenv shell
maturin develop -r
pytest

You can also use cargo and copy or link the .so file directly (rename libfuzzdex.so to fuzzdex.so):

cargo build --release
ln -s target/release/libfuzzdex.so fuzzdex.so

build.sh has commands for building manylinux packages for PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fuzzdex-0.6.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

fuzzdex-0.6.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

File details

Details for the file fuzzdex-0.6.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 107b36b59789134571ddbbf8e4aa9b3ede0f990cc335207ff2fa040580f824be
MD5 c539d099f7cfc3e15ce9c9cca8035a9a
BLAKE2b-256 344ec8e097e3e794884b8864b058b4b479a3deee05fed26fabb2e18e3320ebdd

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bed933b18ebd12534133cb3c1202c6243c667586ff382e3ea5f54f421695a00b
MD5 70f92c545f98ccf83838ca3696be33db
BLAKE2b-256 fe027ff10a1bfdc98abef9b306a8b4fd8eac93038a5c6dc4b4dc06e87329a30f

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 43971ccdd30e8c4e95d1f9f23697095dff65ebd921bce6445d428189724f0478
MD5 bc796cbb21c06776d08c12174459b248
BLAKE2b-256 e77a1f93de8916659c711b7a66cf65b4e3d318e20713ebf6dc4c3d058f893937

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2fb2e8fd8ff8e31dfc6e05fd0432ae3eb1ddd819c826781ec22fe4bc88840575
MD5 0576774273456c8100549d03b8e5d7f4
BLAKE2b-256 77d2c8c89a494c0c4c53a917fa49cff70bd7cc1a5db956348dccffdcef69199d

See more details on using hashes here.

File details

Details for the file fuzzdex-0.6.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.6.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 24e018254e740af46a86a01907042f2fe45c395481ef4269be0ded3d0dc80296
MD5 0a152b949bf7e3e4d373ff0404280a46
BLAKE2b-256 038dfd07128d9e1862c59a206bd58f8afc04d2efba849df1b7e3585651efafed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page