Deduplication using RapidFuzz library.
Project description
dedupe-FuzzyWuzzy
Deduplication of a data set using rapidfuzz library which is just the same as FuzzyWuzzy but is a lot faster.
Installation
pip install dedupe-FuzzyWuzzy
Basic Usage
It is very simple to use, you just have to import the library and pass the dataframe in dedupeFuzzy along with a list of columns which will be used for deduplication and thershold for how strict you want the criteria to be . A higher threshold will be strict in matching and will give you less matches whereas a lower threshold will give you more matches .
Deduplication
import pandas as pd
import dedupe_FuzzyWuzzy
from dedupe_FuzzyWuzzy import deduplication
#load dataframe
df = pd.read_csv('messy.csv')
#initiate deduplication
df_1 = deduplication.deduplication(df,['Site name','Address'],threshold=90)
#send output to csv
df_1.to_csv('dedupeOutput.csv')
Credits
This would have not been possible without the rapidfuzz package whose author is @maxbachmann, so kudos to you !
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dedupe_FuzzyWuzzy-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad069252b1f2aeff6263965950cad1d95193385440bdc93a5f75dfb920c980e1 |
|
MD5 | fdb06aeb37fc428830e903627bf840d9 |
|
BLAKE2b-256 | 9c658cdd095ed9c38a41a5949feb5dc904301e88d6bafc297e56232b2dfffae9 |