Skip to main content

A Fuzzy Matching Approach for Clustering Strings

Project description

Fuzz Up [W.I.P.]

Build status codecov PyPI PyPI - Downloads License

fuzzup offers a simple approach for clustering string entitities based on Levenshtein Distance using Fuzzy Matching in conjunction with a simple rule-based clustering method.

fuzzup also provides functions for computing the prominence of the resulting entity clusters and to match them with entity whitelists.

An important use-case for fuzzup is organizing, structuring and analyzing output from Named-Entity Recognition(=NER). fuzzup also provides (2) functions for computing the prominence of the resulting entity clusters resulting from (1) as well as whitelist matching (3).

fuzzup has been designed to fit the output from NER predictions from the Hugging Face transformers NER pipeline specifically.

Installation guide

fuzzup can be installed from the Python Package Index (PyPI) by:

pip install fuzzup

If you want the development version then install directly from Github.

Workflow

fuzzup offers functionality for:

  1. Computing all of the mutual string distances (Levensteihn Distances/fuzzy ratios) between the string entities
  2. Forming clusters of string entities based on the distances from (1)
  3. Computing prominence of the clusters from (2) based on the number of entity occurrences, their positions in the text etc.
  4. Matching entities (clusters) with entity whitelists

Together these steps constitute an end-to-end approach for organizing and structuring the output from NER. Here is an example of how to use fuzzup for forming entity clusters based on edit distances.

To do

  • document whitelist matching in showcase
  • update readme with workflow
  • tests for whitelist
  • cutoff_threshold -> score_cutoff -> cdist
  • try and tune on junges entitites
  • run against tores list
  • document whitelist
  • update docs

Background

fuzzup is developed as a part of Ekstra Bladet’s activities on Platform Intelligence in News (PIN). PIN is an industrial research project that is carried out in collaboration between the Technical University of Denmark, University of Copenhagen and Copenhagen Business School with funding from Innovation Fund Denmark. The project runs from 2020-2023 and develops recommender systems and natural language processing systems geared for news publishing, some of which are open sourced like fuzzup.

Read more

The detailed documentation and motivation for fuzzup including code references and extended workflow examples can be accessed here.

Contact

We hope, that you will find fuzzup useful.

Please direct any questions and feedbacks to us!

If you want to contribute (which we encourage you to), open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzup-0.1.3.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

fuzzup-0.1.3-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file fuzzup-0.1.3.tar.gz.

File metadata

  • Download URL: fuzzup-0.1.3.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for fuzzup-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d2da20bf3442a450ad61c280b251c1fa4c9a2054feda0fefec5377e5fa18e355
MD5 6bf8ca2955ea14ac314636a38119e56c
BLAKE2b-256 9c3f9ad18eb7d52b1f91a24bccfd816ab542646b8b986f9a10aea1b3a6cec5f2

See more details on using hashes here.

File details

Details for the file fuzzup-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: fuzzup-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for fuzzup-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7d9c800775db5708df1d869ddae6a6cc40b796e33480a2b9f79d05d68e8949fb
MD5 40e82941679190b4a0dc3323f3eee80b
BLAKE2b-256 85f00bb672fcdff79c34b43d0192c2fc1229b5ac7dc589f375b4a1f843b62893

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page