A small package that enables super-fast TF-IDF based string matching.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

tfidf_matcher is a package for fuzzymatching large datasets together. Most fuzzy matching libraries like fuzzywuzzy get great results, but don't scale well due to their O(n^2) complexity.

How does it work?

This package provides two functions:

ngrams(): Simple ngram generator.
matcher(): Matches a list of strings against a reference corpus. Does this by:
- Vectorizing the reference corpus using TF-IDF into a term-document matrix.
- Fitting a K-NearestNeighbours model to the sparse matrix.
- Vectorizing the list of strings to be matched and passing it in to the KNN model to calculate the cosine distance (the OOTB cosine_similarity function in sklearn is very memory-inefficient for our use case).
- Some data manipulation to emit k_matches closest matches.

Yeah ok, but how do I use it?

Define two lists; your original list (list you want matches for) and your lookup list (list you want to match against). Typically your lookup list will be much longer than your original list. Pass them into the matcher function along with the number of matches you want to display from the lookup list using the k_matches argument. The result will be a pandas DataFrame containing 1 row per item in your original list, along with k_matches columns containing the closest match from the lookup list, and a match score for the closest match (which is 1 - the cosine distance between the matches normalised to [0,1])

Simply import with import tfidf_matcher as tm, and call the matcher function with tm.matcher(). It takes the following arguments:

original: List of strings you want to match.
lookup: List of strings you want to match against.
k_matches: Number of the closest results from lookup to return (1 per column).
ngram_length: Length of ngrams used in the algorithm. Anecdotal testing shows 2 or 3 to be optimal, but feel free to tinker.

Strengths and Weaknesses

Quick. Very quick.
Can emit however many closest matches you want. I found that 3 worked best.
Not very well tested so potentially unstable results. Worked well for 640 company names matched against a lookup corpus of >700,000 company names.

Who do I thank?

For the method, thank Josh Taylor and Chris van den Berg. I wanted to adapt the methods to work nicely on a company mathcing problem I was having, and decided to build out my resultant code into a package for two reasons:

Package building experience.
Utility for future projects which may require large-domain fuzzy matching.

I understand the algorithms behind k-Nearest Neighbours & TF-IDF Vectorisation, but it was through implementing the ideas in the blogs linked that I was able to build this project out.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.3.0

Apr 6, 2023

0.2.1

Feb 20, 2020

0.2.0

Feb 20, 2020

0.1.0

Feb 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfidf_matcher-0.3.0.tar.gz (8.1 kB view details)

Uploaded Apr 6, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tfidf_matcher-0.3.0-py3-none-any.whl (8.0 kB view details)

Uploaded Apr 6, 2023 Python 3

File details

Details for the file tfidf_matcher-0.3.0.tar.gz.

File metadata

Download URL: tfidf_matcher-0.3.0.tar.gz
Upload date: Apr 6, 2023
Size: 8.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for tfidf_matcher-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`26939ab3f7f12c6c7421af83ab974c1707aa9724eb99de2f66f00bd5d5d5b468`
MD5	`d72520ea9ad44d7443ff12b318580b2f`
BLAKE2b-256	`cccdaf378c05b1f199879f2deed7c31abc4175dad73052f08832a4b4752cdbf6`

See more details on using hashes here.

File details

Details for the file tfidf_matcher-0.3.0-py3-none-any.whl.

File metadata

Download URL: tfidf_matcher-0.3.0-py3-none-any.whl
Upload date: Apr 6, 2023
Size: 8.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for tfidf_matcher-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1f5ba14465a9d6a3fbb244d77319240908110ab81cd453bbb26ab99fff687210`
MD5	`38399ed465677d876f0a5ff4d6c57fd6`
BLAKE2b-256	`02bf9c086937798418e7eb283b97638e478ec39333bbed646b85c98024fd4b8c`

See more details on using hashes here.

tfidf-matcher 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

How does it work?

Yeah ok, but how do I use it?

Strengths and Weaknesses

Who do I thank?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes