Skip to main content

The Dedupe library made easy with Pandas.

Project description

pandas-dedupe

The Dedupe library made easy with Pandas.

Installation

pip install pandas-dedupe

Video Tutorials

Basic Deduplication

Usage

Basic Deduplication

import pandas as pd
import pandas_dedupe

#load dataframe
df = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])

#send output to csv
df_final.to_csv('deduplication_output.csv')


#------------------------------additional details------------------------------

#A training file and a settings file will be created while running Dedupe. 
#Keeping these files will eliminate the need to retrain your model in the future. 
#If you would like to retrain your model, just delete the settings and training files.

Basic Matching / Record Linkage

import pandas as pd
import pandas_dedupe

#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')

#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])

#send output to csv
df_final.to_csv('linkage_output.csv')


#------------------------------additional details------------------------------

#Use identical field names when linking dataframes.

#Record linkage should only be used on dataframes that have been deduplicated.

#A training file and a settings file will be created while running Dedupe. 
#Keeping these files will eliminate the need to retrain your model in the future. 
#If you would like to retrain your model, just delete the settings and training files.

Credits

Many thanks to folks at DataMade for making the the Dedupe library publicly available. People interested in a code-free implementation of the dedupe library can find a link here: Dedupe.io.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pandas-dedupe, version 0.24
Filename, size File type Python version Upload date Hashes
Filename, size pandas_dedupe-0.24-py3-none-any.whl (6.3 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size pandas_dedupe-0.24.tar.gz (4.8 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page