Skip to main content

A small package that enables super-fast TF-IDF based string matching.

Project description

tfidf_matcher is a package for fuzzymatching large datasets together. Most fuzzy matching libraries like fuzzywuzzy get great results, but perform very poorly due to their O(n^2) complexity.

How does it work?

This package provides two functions:

  • ngrams(): Simple ngram generator.
  • matcher(): Matches a list of strings against a reference corpus. Does this by:
    • Vectorizing the reference corpus using TF-IDF into a term-document matrix.
    • Fitting a K-NearestNeighbours model to the sparse matrix.
    • Vectorizing the list of strings to be matched and passing it in to the KNN model to calculate the cosine distance (the OOTB cosine_similarity function in sklearn is very memory-inefficient for our use case).
    • Some data manipulation to emit k_matches closest matches.

Yeah ok, but how do I use it?

Define two lists; your original list (list you want matches for) and your lookup list (list you want to match against). Typically your lookup list will be much longer than your original list. Pass them into the matcher function along with the number of matches you want to display from the lookup list using the k_matches argument. The result will be a pandas DataFrame containing 1 row per item in your original list, along with `k\_matches` columns containing the closest match from the lookup list, and a match score for the closest match (which is 1 - the cosine distance between the matches normalised to [0,1])

Simply import with import tfidf_matcher as tm, and call the matcher function with tm.matcher(). It takes the following arguments:

  • `original`: List of strings you want to match.
  • `lookup`: List of strings you want to match against.
  • `k_matches`: Number of the closest results from `lookup` to return (1 per column).
  • `ngram_length`: Length of `ngrams` used in the algorithm. Anecdotal testing shows 2 or 3 to be optimal, but feel free to tinker.

Strengths and Weaknesses

  • Quick. Very quick.
  • Can emit however many closest matches you want. I found that 3 worked best.
  • Not very well tested so potentially unstable results. Worked well for 640 company names matched against a lookup corpus of >700,000 company names.
  • It’s pretty complicated to get to grips with the method if you wanted to apply it in different ways. The underlying algorithms are pretty hard to reason about when you jump to the definition of, say, TfidfVectorizer from sklearn. I just about understand the method, which I adapted from this blog post by Josh Taylor, which itself was adapted from another blog post.

Who do I thank?

As above, credit for the method goes to Josh Taylor and van den Blog. I wanted to adapt the methods to work nicely on a company mathcing problem I was having, and decided to build out my resultant code into a package for two reasons:

  1. Package building experience.
  2. Utility for future projects which may require large-domain fuzzy matching.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfidf_matcher-0.2.1.zip (12.2 kB view details)

Uploaded Source

Built Distribution

tfidf_matcher-0.2.1-py2.py3-none-any.whl (7.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file tfidf_matcher-0.2.1.zip.

File metadata

  • Download URL: tfidf_matcher-0.2.1.zip
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5

File hashes

Hashes for tfidf_matcher-0.2.1.zip
Algorithm Hash digest
SHA256 11f9c96718afee9c8cd343f2fcc57b9e6c5eda0c879dc8f75f2548c57f6b47f3
MD5 8243b18a3f6833a46a5ffc53f4a59510
BLAKE2b-256 0573226667c480d503826981da2f242709e5c2bf247a74aa95ba27b5e44cd337

See more details on using hashes here.

File details

Details for the file tfidf_matcher-0.2.1-py2.py3-none-any.whl.

File metadata

  • Download URL: tfidf_matcher-0.2.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5

File hashes

Hashes for tfidf_matcher-0.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 3b883355d584ac59f834d77d889be20c972cee856948c100877470d3d87b2358
MD5 a430a3f9e8ba13509dafa1b7feb20bab
BLAKE2b-256 27fcbce2327e3de47a49b0f2650f29cc1f978ccde2b7a6123b1c9f48328d7633

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page