Skip to main content

A Fuzzy Matching Approach for Clustering Strings

Project description

Fuzz Up [W.I.P.]

Build status codecov PyPI PyPI - Downloads License

fuzzup offers a simple approach for clustering strings based on Levenshtein Distance using Fuzzy Matching in conjunction with Hierarchical Clustering.

Installation guide

fuzzup can be installed from the Python Package Index (PyPI) by:

pip install fuzzup

If you want the development version then install directly from Github.

Workflow

fuzzup organizes strings by forming clusters from them. It does so in 3 steps:

  1. Compute all of the mutual string distances (Levensteihn Distances/fuzzy ratios) between the strings
  2. Form clusters of strings (using hierarchical clustering) based on the distances from (1)
  3. Rank the clusters by simply counting the number of nodes(strings) in each cluster
# TODO: update example with tuned model.
# strings we want to cluster
>>> person_names = ['Donald Trump', 'Donald Trump', 
                    'J. biden', 'joe biden', 'Biden', 
                    'Bide', 'mark esper', 'Christopher c . miller', 
                    'jim mattis', 'Nancy Pelosi', 'trumps',
                    'Trump', 'Donald', 'miller']

>>> from fuzzup.gear import form_clusters_and_rank
>>> form_clusters_and_rank(person_names)
[{'PROMOTED_STRING': 'Donald Trump',
  'STRINGS': ['Donald Trump', 'Trump', 'trumps'],
  'COUNT': 4,
  'RANK': 1},
 {'PROMOTED_STRING': 'joe biden',
  'STRINGS': ['Bide', 'Biden', 'J. biden', 'joe biden'],
  'COUNT': 4,
  'RANK': 1},
 {'PROMOTED_STRING': 'Christopher c . miller',
  'STRINGS': ['Christopher c . miller', 'miller'],
  'COUNT': 2,
  'RANK': 3},
 {'PROMOTED_STRING': 'Nancy Pelosi',
  'STRINGS': ['Nancy Pelosi', 'mark esper'],
  'COUNT': 2,
  'RANK': 3},
 {'PROMOTED_STRING': 'jim mattis',
  'STRINGS': ['jim mattis'],
  'COUNT': 1,
  'RANK': 5},
 {'PROMOTED_STRING': 'Donald', 'STRINGS': ['Donald'], 'COUNT': 1, 'RANK': 5}]

Background

fuzzup is developed as a part of Ekstra Bladet’s activities on Platform Intelligence in News (PIN). PIN is an industrial research project that is carried out in collaboration between the Technical University of Denmark, University of Copenhagen and Copenhagen Business School with funding from Innovation Fund Denmark. The project runs from 2020-2023 and develops recommender systems and natural language processing systems geared for news publishing, some of which are open sourced like fuzzup.

Read more

The detailed documentation and motivation for fuzzup including code references and extended workflow examples can be accessed here.

Contact

We hope, that you will find fuzzup useful.

Please direct any questions and feedbacks to us!

If you want to contribute (which we encourage you to), open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzup-0.0.13.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

fuzzup-0.0.13-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file fuzzup-0.0.13.tar.gz.

File metadata

  • Download URL: fuzzup-0.0.13.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.10

File hashes

Hashes for fuzzup-0.0.13.tar.gz
Algorithm Hash digest
SHA256 4e9c9230737d1f884fcfac51665b139f6e2f590ba457bc03fbd5b15c987e3c45
MD5 000692898cdd98f07ad47d2db470a826
BLAKE2b-256 3ecea84cb29b71f8546de74464a93de3a56e6041bc10a9354a94063510a43d49

See more details on using hashes here.

File details

Details for the file fuzzup-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: fuzzup-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.10

File hashes

Hashes for fuzzup-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 ea5cc49677d69c8874747c1d01e545d72ab2e402a3d97f5bd11e521ea552c3b1
MD5 9c9adb6cbe86013c987128ef6c2df2d2
BLAKE2b-256 a9bf25c98c8b72a4a34e1686d20cbd58415428c698511ebc8921a5d9061b04da

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page