Skip to main content

The Dedupe library made easy with Pandas.

Project description

pandas-dedupe

The Dedupe library made easy with Pandas.

Installation

pip install pandas-dedupe

Usage

Basic Deduplication

import pandas as pd
import pandas_dedupe

#load dataframe
df = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])

#send output to csv
df_final.to_csv('deduplication_output.csv')


#------------------------------additional details------------------------------

#A training file and a settings file will be created while running Dedupe. 
#Keeping these files will eliminate the need to retrain your model in the future. 
#If you would like to retrain your model, just delete the settings and training files.

Basic Matching / Record Linkage

import pandas as pd
import pandas_dedupe

#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')

#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])

#send output to csv
df_final.to_csv('linkage_output.csv')


#------------------------------additional details------------------------------

#Use identical field names when linking dataframes.

#Record linkage should only be used on dataframes that have been deduplicated.

#A training file and a settings file will be created while running Dedupe. 
#Keeping these files will eliminate the need to retrain your model in the future. 
#If you would like to retrain your model, just delete the settings and training files.

Credits

Many thanks to folks at DataMade for making the the Dedupe library publicly available. People interested in a code-free implementation of the dedupe library can find a link here: Dedupe.io.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_dedupe-0.21.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

pandas_dedupe-0.21-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file pandas_dedupe-0.21.tar.gz.

File metadata

  • Download URL: pandas_dedupe-0.21.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for pandas_dedupe-0.21.tar.gz
Algorithm Hash digest
SHA256 03bf1073bd8ea4829f6937fb87fe93e69c5a785ed261b45832d2ed227c2362e3
MD5 c0bc5707664712229756d37becc4f465
BLAKE2b-256 d4aea6370bb2998ae1516877ed81ddba89687d5e8024676af6c7e22cc02db1ec

See more details on using hashes here.

File details

Details for the file pandas_dedupe-0.21-py3-none-any.whl.

File metadata

  • Download URL: pandas_dedupe-0.21-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for pandas_dedupe-0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 6c791f71f1c98643febcaa16cecad5cca4b1d2ece96c681d9a2d1b2a9bb12925
MD5 ed406c7d80b5dc176c87d600f8689b2c
BLAKE2b-256 6b1efdc004604aa88e2f3850964691ebb9fcb11c083ae6ab22a52050ca154ea4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page