The Dedupe library made easy with Pandas.
Project description
pandas-dedupe
The Dedupe library made easy with Pandas.
Installation
pip install pandas-dedupe
Usage
Basic Deduplication
import pandas as pd
import pandas_dedupe
#load dataframe
df = pd.read_csv('test_names.csv')
#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])
#send output to csv
df_final.to_csv('deduplication_output.csv')
#------------------------------additional details------------------------------
#A training file and a settings file will be created while running Dedupe.
#Keeping these files will eliminate the need to retrain your model in the future.
#If you would like to retrain your model, just delete the settings and training files.
Basic Matching / Record Linkage
import pandas as pd
import pandas_dedupe
#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
#send output to csv
df_final.to_csv('linkage_output.csv')
#------------------------------additional details------------------------------
#Use identical field names when linking dataframes.
#Record linkage should only be used on dataframes that have been deduplicated.
#A training file and a settings file will be created while running Dedupe.
#Keeping these files will eliminate the need to retrain your model in the future.
#If you would like to retrain your model, just delete the settings and training files.
Credits
Many thanks to folks at DataMade for making the the Dedupe library publicly available. People interested in a code-free implementation of the dedupe library can find a link here: Dedupe.io.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pandas_dedupe-0.21.tar.gz
(4.4 kB
view hashes)
Built Distribution
Close
Hashes for pandas_dedupe-0.21-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c791f71f1c98643febcaa16cecad5cca4b1d2ece96c681d9a2d1b2a9bb12925 |
|
MD5 | ed406c7d80b5dc176c87d600f8689b2c |
|
BLAKE2b-256 | 6b1efdc004604aa88e2f3850964691ebb9fcb11c083ae6ab22a52050ca154ea4 |