Skip to main content

No project description provided

Project description

FuzzDex

FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed by a human input.

Algorithm

You load into fuzzdex series of short phrases - like street names consisting of one or multiple words, with an numerical index that identifies this street names in related dictionary of cities.

Then, you can query the index using a must-token (currently only one, but could be altered to use more) and additional should-tokens to read a total of limit of possibly matching phrases.

Must-token is trigramized (warszawa -> war ars rsz sza zaw awa) and all phrases containing given trigrams are initially read from the index. Trigrams have scores, the more common they are, the less they increase the phrase score. Trigrams of should-tokens additionally alter the score (positively when they match), but don't add additional phrases from index. Phrases are then sorted by score.

Top phrases are filtered to contain an optional constraint and the must-token with a maximal editing distance (Levenshtein) until limit of phrases is gathered.

Internally, the results of a must-token search are LRU cached as in practise it's pretty often repeated. Should-tokens vary and they are always recalculated.

Usecases

It was designed to match parts of a user supplied physical addresses to a data extracted from the OpenStreet map - in order to find streets and cities.

Address is first tokenized and then it's parts are matched against fuzzy dictionary of cities and streets. Additional constraints can limit the matched streets only to given city - or finding cities that have a given street.

Data is first searched for using trigrams (warszawa -> war ars rsz sza zaw awa), and then additionally filtered using maximal Levenshtein editing distance.

Original solution used fuzzy query of the Elasticsearch database, which worked - but was 21x slower in our tests.

Example

import fuzzdex
# Create two fuzzy indices with cities and streets.
cities = fuzzdex.FuzzDex()
# Warsaw has streets: Czerniakowska, Nowy Świat and Wawelska
cities.add_phrase("Warsaw", 1, constraints={1, 2, 3})
# Wrocław only Czerniawska
cities.add_phrase("Wrocław", 2, constraints={4})

streets = fuzzdex.FuzzDex()
# Streets with reversed constraints and own indices:
streets.add_phrase("Czerniakowska", 1, constraints={1})
streets.add_phrase("Nowy Świat", 2, constraints={1})
streets.add_phrase("Wawelska", 3, constraints={1})

streets.add_phrase("Czerniawska", 4, constraints={2})

# This recalculates trigram scores and makes index immutable:
cities.finish()
streets.finish()

# warszawa matches warsaw at editing distance 2.
cities.search("warszawa", [], max_distance=2, limit=60)
#    [{'origin': 'Warsaw', 'index': 1, 'token': 'warsaw',
#      'distance': 2, 'score': 200000.0, 'should_score': 0.0}]

# `świat` adds additional should score to the result and places it higher
# in case the limit is set:
streets.search("nowy", ["świat"], max_distance=2, constraint=1)
#    [{'origin': 'Nowy Świat', 'index': 2, 'token': 'nowy',
#      'distance': 0, 'score': 5.999, 'should_score': 7.4999}]

# Won't match with constraint 2.
streets.search("nowy", ["świat"], constraint=2)
#    []

# Quering for `czerniawska` will return `czerniakowska` (no constraints),
# but with a lower score and higher distance:
In [22]: streets.search("czerniawska", [], max_distance=2)
Out[22]:
#  [{'origin': 'Czerniawska', 'index': 4, 'token': 'czerniawska',
#   'distance': 0, 'score': 9.49995231628418, 'should_score': 0.0},
#  {'origin': 'Czerniakowska', 'index': 1, 'token': 'czerniakowska',
#   'distance': 2, 'score': 6.4999680519104, 'should_score': 0.0}]

Installation, development

You can install fuzzdex from PyPI when using one of the architectures it's published for (x86_64, few Python versions).

pip3 install fuzzdex

Or use maturin to build it locally:

pipenv install --dev
pipenv shell
maturin develop -r
pytest

You can also use cargo and copy or link the .so file directly (rename libfuzzdex.so to fuzzdex.so):

cargo build --release
ln -s target/release/libfuzzdex.so fuzzdex.so

build.sh has commands for building manylinux packages for PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fuzzdex-0.4.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.5+ x86-64

fuzzdex-0.4.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

fuzzdex-0.4.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

fuzzdex-0.4.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

fuzzdex-0.4.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

File details

Details for the file fuzzdex-0.4.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.4.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 72a9d6fee230ebe15e8b490928d678023e4df4956ec0fd42707afe639410a8d5
MD5 3717ff7203441b5b17846956afecec83
BLAKE2b-256 53faea01afa39ff9a2bf83d36c8f307012b1b072fb50f3a2969e3d2cc80273ad

See more details on using hashes here.

File details

Details for the file fuzzdex-0.4.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.4.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a5296bfdc966ed60787092b1de2884cc5b9d9829bdec993e3c5d216f31dfb873
MD5 b13b505816e1de8b561284375feedf75
BLAKE2b-256 7ebf5693b2f2c630e62ad6ccd3dacb9abe7e924b74201aca8c3f2bee5d40128c

See more details on using hashes here.

File details

Details for the file fuzzdex-0.4.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.4.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cff5fdc8f878481b6adf8690cc57bf07529ddc8135a66bd330bce0996b45b9ef
MD5 d7dd2d9225a9b59a4c3168691aa9b0ee
BLAKE2b-256 75996cd0b65e86ba359a94118b3a5c0c47b94d1d3b54a5d6e82a7186ba20d10f

See more details on using hashes here.

File details

Details for the file fuzzdex-0.4.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.4.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0f2e3dc5b5d13e131048b2de18e7e52d2001f6749d449bdc5a85cf0b2160a48e
MD5 1d50fc204b74adcc135a5301b16d6533
BLAKE2b-256 ce4a8d0751ac76f97e85ea16d39fca1935adf8d45106f978147c5d79a22f7c63

See more details on using hashes here.

File details

Details for the file fuzzdex-0.4.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-0.4.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 652e19b8a320d84ea8b4aea350550fad231205a895f6a1f9338f8df23ed7bd2d
MD5 a2d5be7114a4756ce61315502bc206f2
BLAKE2b-256 1de1899759949034b704079558e296bbbd61f1cc0b6412d0c4ec0a89696f7f46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page