Skip to main content

Finds best pairings of elements between two sequences

Project description

Fuzzy Sequence Matcher

https://img.shields.io/pypi/v/fuzzy_sequence_matcher.svg https://img.shields.io/travis/catherinedevlin/fuzzy_sequence_matcher.svg Documentation Status

Purpose

Finds best pairings of elements between two sequences.

Each element can only be used at most once, and the order of elements is preserved. That is, if X1 -> Y2, then X2 cannot match to Y1 or Y2; it must match to Y3 or later. This is appropriate for sequences where Y is a garbled or mutated copy of X.

Example

>>> from fuzzy_sequence_matcher.fuzzy_sequence_matcher import best_matches
>>> from jellyfish import jaro_distance
>>> declaration = "We hold these truths to be self evident".split()
>>> degradation = ("I guess wee hold them tooths and stuff "
...     "for being sort of evidence, y'know?").split()
>>> best_matches(declaration, degradation, scorer=jaro_distance)
[('We', 'wee'), ('hold', 'hold'), ('these', 'them'), ('truths', 'tooths'), ('to', 'for'), ('be', 'being'), ('self', 'sort'), ('evident', 'evidence,')]

Features

  • Match any objects you can write a scoring function for

  • No dependencies outside standard library

Scoring function

The matching is done with a scoring function you specify. It should look something like:

def score(element_from_seq1: Any, element_from_seq2: Any) -> float

with high scores indicating better matches.

For comparing strings, you might use jellyfish.jaro_distance. For comparing numbers, -abs(n1 - n2) works.

threshold

By default, fuzzy_sequence_matcher finds the combination that maximizes the sum scores of all the pairings. However, when one sequence is much longer than the other, the number of possible combinations grows impractically large to try them all. The itertools documentation gives the number of combinations as

len(Y)! / len(X)! / (len(Y) - len(X))!

when len(Y) >= len(X).

This function is exposed as fuzzy_sequence_matcher.n_combinations.

If the number of possible combinations exceeds a threshold - by default, 1_000_000, which happens when the long seq is ~ 15 or more elements longer than the short - then elements from the longer sequence will be dropped from consideration entirely, starting with those whose best match to the shorter sequence is worst, until n_combinations is under the threshold. This could conceivably give a result that is not the ideal-scoring set of matches.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Thanks to Dayton Dynamic Languages for advice and brainstorming

History

0.1.0 (2019-12-14)

  • First release on PyPI.

0.1.1 (2019-12-26)

  • Allow scorers that handle the arguments differently.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzy_sequence_matcher-0.1.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

fuzzy_sequence_matcher-0.1.1-py2.py3-none-any.whl (6.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file fuzzy_sequence_matcher-0.1.1.tar.gz.

File metadata

  • Download URL: fuzzy_sequence_matcher-0.1.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for fuzzy_sequence_matcher-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c386a12999d98448acd5e8bf4b9632d4f15bba9419fca842cc4189b7a562674b
MD5 186a081c67ac7e5c041f5f0be067b026
BLAKE2b-256 67da753340e647b460adcdd3e4e721386ef32f64a0a9d1f9c2c2e93034296348

See more details on using hashes here.

File details

Details for the file fuzzy_sequence_matcher-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: fuzzy_sequence_matcher-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.8.0

File hashes

Hashes for fuzzy_sequence_matcher-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c425b44e54fec2f6abae3acbbcae11bb7fc27361a1f183d4515b4c67411900a5
MD5 9bdeda6cf2e70ca10ecb93f6b108fcaa
BLAKE2b-256 00ed14f1224684296db00ac309330c98c76f73fe920c6c6bb8356e44bdd7c470

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page