Skip to main content

Deduplication using RapidFuzz library.

Project description

dedupe-FuzzyWuzzy

Deduplication of a data set using rapidfuzz library which is just the same as FuzzyWuzzy but is a lot faster.

Installation

pip install dedupe-FuzzyWuzzy

Basic Usage

It is very simple to use, you just have to import the library and pass the dataframe in dedupeFuzzy along with a list of columns which will be used for deduplication and thershold for how strict you want the criteria to be . A higher threshold will be strict in matching and will give you less matches whereas a lower threshold will give you more matches . I would suggest you have a look at fuzzywuzzy's doc to understand this better

Deduplication

import pandas as pd
import dedupe_FuzzyWuzzy

#load dataframe
df = pd.read_csv('messy.csv')

#initiate deduplication
df_1 = dedupe_FuzzyWuzzy.deduplication(df,['Site name','Address'],threshold=90,scorer=fuzz.token_set_ratio)

#send output to csv
df_1.to_csv('dedupeOutput.csv')

Credits

This would have not been possible without the rapidfuzz package whose author is @maxbachmann, so kudos to you !

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedupe_FuzzyWuzzy-1.0.1.tar.gz (2.8 kB view hashes)

Uploaded Source

Built Distribution

dedupe_FuzzyWuzzy-1.0.1-py3-none-any.whl (3.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page