Deduplication using RapidFuzz library.
Project description
dedupe-FuzzyWuzzy
Deduplication of a data set using rapidfuzz library which is just the same as FuzzyWuzzy but is a lot faster.
Installation
pip install dedupe-FuzzyWuzzy
Basic Usage
It is very simple to use, you just have to import the library and pass the dataframe in dedupeFuzzy along with a list of columns which will be used for deduplication and thershold for how strict you want the criteria to be . A higher threshold will be strict in matching and will give you less matches whereas a lower threshold will give you more matches .
Deduplication
import pandas as pd
import dedupe_FuzzyWuzzy
from dedupe_FuzzyWuzzy import deduplication
#load dataframe
df = pd.read_csv('messy.csv')
#initiate deduplication
df_1 = deduplication.deduplication(df,['Site name','Address'],threshold=90)
#send output to csv
df_1.to_csv('dedupeOutput.csv')
Credits
This would have not been possible without the rapidfuzz package whose author is @maxbachmann, so kudos to you !
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dedupe_FuzzyWuzzy-1.0.2.tar.gz.
File metadata
- Download URL: dedupe_FuzzyWuzzy-1.0.2.tar.gz
- Upload date:
- Size: 2.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b640948d4acecd74916db628a18971d9a4792a640e3d7d7f1f61d86965d1ffa
|
|
| MD5 |
44779366fb0d6d5fc48cff9fad9bafcd
|
|
| BLAKE2b-256 |
12cfac7513c9439042c0f94668b55717ab1cb27a366e7d5330d5681a2a1575c2
|
File details
Details for the file dedupe_FuzzyWuzzy-1.0.2-py3-none-any.whl.
File metadata
- Download URL: dedupe_FuzzyWuzzy-1.0.2-py3-none-any.whl
- Upload date:
- Size: 3.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad069252b1f2aeff6263965950cad1d95193385440bdc93a5f75dfb920c980e1
|
|
| MD5 |
fdb06aeb37fc428830e903627bf840d9
|
|
| BLAKE2b-256 |
9c658cdd095ed9c38a41a5949feb5dc904301e88d6bafc297e56232b2dfffae9
|