Python library to create equivalence dictionaries between a set of texts and a catalog using Levenshtein distance.
Project description
WibblyWobbly
Python library to create equivalence dictionaries between a set of texts and a catalog using FuzzyWuzzy.
It is a common nightmare for data scientist, your human users captured the data according to a "catalog" but it is full of mistakes. WibblyWobbly automates the task of automatically matching the data to a catalog while allowing for manual review of suspicious cases and rejecting bad matches.
Requirements
- Python 3 or higher
- the fuzz
- python-Levenshtein (optional)
- unidecode
- pandas
#Instalation
Using PIP via PyPI
pip install wibblywobbly
WibblyWobbly extends hefuzz, it is recomended to install python-Levenshtein too
pip install thefuzz pip install python-Levenshtein
Usage
Match data to a catalog
Import wibblywobbly and load your data and catalog as list. If you are using pandas use .to_list().
import wibblywobbly as ww catalog = ["Mouse", "Cat", "Dog", "Human"] data = ["mice", "CAT ", "doggo", "PERSON", 999]
WibblyWobbly compares the data to the catalog and returns the most likely options and a similarity score. If it cannot find a good match it will return the original data.
It automaticaly accepts the catalog options that have a higher similarity score than thr_accept and rejects those that have a lower score than thr_reject. This treshold values can be adjusted depending in the data quality. It ignores non-string values.
By default it returns a pandas dataframe that can be saved as a csv or excel file .to_excel().
ww.map_list_to_catalog(data, catalog, thr_accept=95, thr_reject=40)
| Data | Option1 | Score1 | Option2 | Score2 | Option3 | Score3 | |
|---|---|---|---|---|---|---|---|
| 0 | CAT | Cat | 100 | None | NaN | None | NaN |
| 1 | doggo | Dog | 90 | Mouse | 20.0 | Human | 0.0 |
| 2 | mice | Mouse | 44 | Cat | 29.0 | Human | 22.0 |
| 3 | PERSON | PERSON | 0 | None | NaN | None | NaN |
| 4 | 999 | 999 | 0 | None | NaN | None | NaN |
WibblyWobbly can also return a dictionary with the best options. This dictionary can be used to clean a pandas dataframe with .replace() and .map().
ww.map_list_to_catalog(data, catalog, output_format="dictionary")
{'mice': 'mice', 999: 999, 'doggo': 'Dog', 'PERSON': 'PERSON', 'CAT ': 'Cat'}
It is possible set a reject_value.
ww.map_list_to_catalog(data, catalog, output_format="dictionary", reject_value='Other')
{'mice': 'Other', 999: 999, 'doggo': 'Dog', 'PERSON': 'Other', 'CAT ': 'Cat'}
WibblyWobbly can also raise warnings of the suspicious values to facilitate visual inspection.
ww.map_list_to_catalog(data, catalog, output_format="dictionary", thr_accept=95, thr_reject=40, warnings=True)
WOBBLY: mice Options: Mouse (44), Cat (29), Human (22) WOBBLY: doggo Options: Dog (90), Mouse (20), Human (0)
{'mice': 'Mouse', 999: 999, 'doggo': 'Dog', 'PERSON': 'PERSON', 'CAT ': 'Cat'}
Versions
-
0.2.0
- Now uses thefuzz
- Rough clustering algorithm
- Hierarchical dictionaries
- Happy New Year!
-
0.1.0
- We are online!
- Basic operations to match list to catalogs
Thanks
The thefuzz team, you are amazing!
Syats for helping with the hierarchical code.
You see, most people think that time is a strict progression of cause to effect, but actually, from a non-linear, non-subjective point of view, it’s more like a big ball of... Wibbly-Wobbly... Timey-Wimey... stuff.
The Doctor
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wibblywobbly-0.3.tar.gz.
File metadata
- Download URL: wibblywobbly-0.3.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e7c1be113402b71992e77f98589db1ab4862b9b68e12d8ab7908194c4ebafae
|
|
| MD5 |
a4b2a37ce21472e67d20f45af3246600
|
|
| BLAKE2b-256 |
d65c7e0e1d57e52c4cb241fdc99ac2f6cc4c24d2015a3702d2f667864985eabb
|
File details
Details for the file wibblywobbly-0.3-py3-none-any.whl.
File metadata
- Download URL: wibblywobbly-0.3-py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7960eaf038f8414e8174868808321a9965063bbcdae0fbcdba5f0a7b5389bec
|
|
| MD5 |
d69c4173a9915b64c915455ef25f4bf5
|
|
| BLAKE2b-256 |
016e4380786128bbfbe3cbfe50f7b5998b3e5840c3fcf3b95a26d3bc43fbb2e4
|