Skip to main content

No project description provided

Project description

FuzzDex

FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed by a human input.

Algorithm

You load into fuzzdex series of short phrases - like street names consisting of one or multiple words, with an numerical index that identifies this street names in related dictionary of cities.

Then, you can query the index using a must-token (currently only one, but could be altered to use more) and additional should-tokens to read a total of limit of possibly matching phrases.

Must-token is trigramized (warszawa -> war ars rsz sza zaw awa) and all phrases containing given trigrams are initially read from the index. Trigrams have scores, the more common they are, the less they increase the phrase score. Trigrams of should-tokens additionally alter the score (positively when they match), but don't add additional phrases from index. Phrases are then sorted by score.

Top phrases are filtered to contain an optional constraint and the must-token with a maximal editing distance (Levenshtein) until limit of phrases is gathered.

Internally, the results of a must-token search are LRU cached as in practise it's pretty often repeated. Should-tokens vary and they are always recalculated.

Usecases

It was designed to match parts of a user supplied physical addresses to a data extracted from the OpenStreet map - in order to find streets and cities.

Address is first tokenized and then it's parts are matched against fuzzy dictionary of cities and streets. Additional constraints can limit the matched streets only to given city - or finding cities that have a given street.

Data is first searched for using trigrams (warszawa -> war ars rsz sza zaw awa), and then additionally filtered using maximal Levenshtein editing distance.

Original solution used fuzzy query of the Elasticsearch database, which worked - but was 21x slower in our tests.

Example

import fuzzdex
# Create two fuzzy indices with cities and streets.
cities = fuzzdex.FuzzDex()
# Warsaw has streets: Czerniakowska, Nowy Świat and Wawelska
cities.add_phrase("Warsaw", 1, constraints={1, 2, 3})
# Wrocław only Czerniawska
cities.add_phrase("Wrocław", 2, constraints={4})

streets = fuzzdex.FuzzDex()
# Streets with reversed constraints and own indices:
streets.add_phrase("Czerniakowska", 1, constraints={1})
streets.add_phrase("Nowy Świat", 2, constraints={1})
streets.add_phrase("Wawelska", 3, constraints={1})

streets.add_phrase("Czerniawska", 4, constraints={2})

# This recalculates trigram scores and makes index immutable:
cities.finish()
streets.finish()

# warszawa matches warsaw at editing distance 2.
cities.search(["warszawa"], [], max_distance=2, limit=60)
#    [{'origin': 'Warsaw', 'index': 1, 'token': 'warsaw',
#      'distance': 2, 'score': 200000.0, 'should_score': 0.0}]
#
# NOTE: Currently only a single `must` token is supported.
#
# `świat` adds additional should score to the result and places it higher
# in case the limit is set:
streets.search(["nowy"], ["świat"], max_distance=2, constraint=1)
#    [{'origin': 'Nowy Świat', 'index': 2, 'token': 'nowy',
#      'distance': 0, 'score': 5.999, 'should_score': 7.4999}]

# Won't match with constraint 2.
streets.search(["nowy"], ["świat"], constraint=2)
#    []

# Quering for `czerniawska` will return `czerniakowska` (no constraints),
# but with a lower score and higher distance:
In [22]: streets.search(["czerniawska"], [], max_distance=2)
Out[22]:
#  [{'origin': 'Czerniawska', 'index': 4, 'token': 'czerniawska',
#   'distance': 0, 'score': 9.49995231628418, 'should_score': 0.0},
#  {'origin': 'Czerniakowska', 'index': 1, 'token': 'czerniakowska',
#   'distance': 2, 'score': 6.4999680519104, 'should_score': 0.0}]

Installation, development

You can install fuzzdex from PyPI when using one of the architectures it's published for (x86_64, few Python versions).

pip3 install fuzzdex

Or use maturin to build it locally:

pipenv install --dev
pipenv shell
maturin develop -r
pytest

You can also use cargo and copy or link the .so file directly (rename libfuzzdex.so to fuzzdex.so):

cargo build --release
ln -s target/release/libfuzzdex.so fuzzdex.so

build.sh has commands for building manylinux packages for PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fuzzdex-1.0.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.5+ x86-64

fuzzdex-1.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

fuzzdex-1.0.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

fuzzdex-1.0.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

fuzzdex-1.0.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

File details

Details for the file fuzzdex-1.0.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.0.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a2b901ed052401488c4874cc2033317d1c765636d2a6ac31022639ab01086572
MD5 0392cc65a75739d3d1e533f8e9374d95
BLAKE2b-256 2bb0dd7cfaca6fbeb58f0d5da2efbd69a9a2e0b99220b9414da19015c5775b46

See more details on using hashes here.

File details

Details for the file fuzzdex-1.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9890b508ef20f15dbf46d3a91f106db46f2b566b456550605a781d2afe0b214a
MD5 2ef7b7f286deb42544cbfc0f63bfed16
BLAKE2b-256 6cd6633e658025d597baa21c7e478031b299e9d494285324417429a7238d8374

See more details on using hashes here.

File details

Details for the file fuzzdex-1.0.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.0.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fd1652f8b508e0cdc36fea6bb8d8647976088e66872127a09870002e846c9b8d
MD5 ceb0ffb7e15f1052b1758cea18ded96c
BLAKE2b-256 1a436b70c076773c4f77763f46d9b7cbeaa013a191ea8c25486f93a73f0ed1a6

See more details on using hashes here.

File details

Details for the file fuzzdex-1.0.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.0.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4ec2e2e5bcd28951456ebe33ceec94d12658859f36868273cdb4a4d2d0e83175
MD5 d293bc4688b2130b001efd0b25893194
BLAKE2b-256 d8958fbb17d5372038da1de7d8626133434be5bac316179bf5df2cbf3bfdb26f

See more details on using hashes here.

File details

Details for the file fuzzdex-1.0.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for fuzzdex-1.0.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4a99c3c4c54ac432e0768c3123acca537e9a47e7ae23f67fff9abcc9723def20
MD5 dae459ce3a0c6405d7b7bd78df9493be
BLAKE2b-256 5a4885aafdb29fc65d4afcec363e5bfba130157c638faed675a26f991052905a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page