Skip to main content

The Dedupe library made easy with Pandas.

Project description

# pandas-dedupe The Dedupe library made easy with Pandas.

# Installation

pip install pandas-dedupe

# Video Tutorials

[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)

# Basic Usage

### Deduplication

import pandas as pd import pandas_dedupe

#load dataframe df = pd.read_csv(‘test_names.csv’)

#initiate deduplication df_final = pandas_dedupe.dedupe_dataframe(df,[‘first_name’, ‘last_name’, ‘middle_initial’])

#send output to csv df_final.to_csv(‘deduplication_output.csv’)

#——————————additional details——————————

#A training file and a settings file will be created while running Dedupe. #Keeping these files will eliminate the need to retrain your model in the future. #If you would like to retrain your model, just delete the settings and training files.

### Matching / Record Linkage

import pandas as pd import pandas_dedupe

#load dataframes dfa = pd.read_csv(‘file_a.csv’) dfb = pd.read_csv(‘file_b.csv’)

#initiate matching df_final = pandas_dedupe.link_dataframes(dfa, dfb, [‘field_1’, ‘field_2’, ‘field_3’, ‘field_4’])

#send output to csv df_final.to_csv(‘linkage_output.csv’)

#——————————additional details——————————

#Use identical field names when linking dataframes.

#Record linkage should only be used on dataframes that have been deduplicated.

#A training file and a settings file will be created while running Dedupe. #Keeping these files will eliminate the need to retrain your model in the future. #If you would like to retrain your model, just delete the settings and training files.

### Fuzzy Left Join

import pandas as pd import pandas_dedupe

#load dataframes df_left = pd.read_csv(‘file_a.csv’) df_right= pd.read_csv(‘file_b.csv’)

#initiate matching df_final = pandas_dedupe.left_join(df_left, df_right, [‘field_1’, ‘field_2’, ‘field_3’, ‘field_4’])

#send output to csv df_final.to_csv(‘df_left_join_output.csv’)

#——————————additional details——————————

#Use identical field names when linking dataframes.

#Record linkage should only be used on dataframes that have been deduplicated.

#A training file and a settings file will be created while running Dedupe. #Keeping these files will eliminate the need to retrain your model in the future. #If you would like to retrain your model, just delete the settings and training files.

# Advanced Usage

### Canonicalize Fields

pandas_dedupe.dedupe_dataframe(df,[‘first_name’, ‘last_name’, ‘payment_type’], canonicalize=True)

#——————————additional details——————————

#Creates a standardized version of every element by field & cluster id for instance, #if you had the field “first_name”, and the first cluster id had 3 items, “John”, #”John”, and “Johnny”, the canonicalized version would have “John” listed for all #three in a new field called “first_name - canonical”

#If you prefer only canonicalize a few of your fields, you can set the parameter #as a list of fields you want a canonical version for. In my example above, you #could have written canonicalize=[‘first_name’, ‘last_name’], and you would get #a canonical version for first_name, and last_name, but not for payment_type.

### Specifying Types

# Price Example pandas_dedupe.dedupe_dataframe(df,[‘first_name’, ‘last_name’, (‘salary’, ‘Price’)])

# has missing Example pandas_dedupe.link_dataframes(df,[‘SSN’, (‘bio_pgraph’, ‘Text’), (‘salary’, ‘Price’, ‘has missing’)])

# crf Example pandas_dedupe.dedupe_dataframe(df,[(‘first_name’, ‘String’, ‘crf’), ‘last_name’, (m_initial, ‘Exact’)])

#——————————additional details——————————

#If a type is not explicity listed, String will be used.

#Tuple (parenthesis) is required to declare all other types. If you prefer use tuple #for string also, (‘first_name’, ‘String’), that’s fine.

#If you want to specify either a ‘crf’ or ‘has missing’ parameter, a tuple with three elements #must be used. (‘first_name’, ‘String’, ‘crf’) works, (‘first_name’, ‘crf’) does not work.

### Recall Weight & Sample Size

Within the dedupe_dataframe() function, optional parameters exist for specifying recall_weight and sample_size: * recall_weight - Ranges from 0 to 2. When we set a recall weight of 2, we are saying we care twice as much about recall as we do precision * sample_size - Specify the sample size used for training as a float from 0 to 1. By default it is 30% (0.3) of our data.

# Types

Dedupe supports a variety of datatypes; a full listing with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)

pandas-dedupe officially supports the following datatypes: * String - Standard string comparison using string distance metric. This is the default type. * Text - Comparison for sentences or paragraphs of text. Uses cosine similarity metric. * Price - For comparing positive, non zero numerical values. * DateTime - For comparing dates. * LatLong - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance metric, even though the points are in a geographically similar location. The LatLong type resolves this by calculating the haversine distance between compared coordinates. LatLong requires the field to be in the format (Lat, Lng). The value can be a string, a tuple containing two strings, a tuple containing two floats, or a tuple containing two integers. If the format is not able to be processed, you will get a traceback. * Exact - Tests wheter fields are an exact match. * Exists - Sometimes, the presence or absence of data can be useful in predicting a match. The Exists type tests for whether both, one, or neither of fields are null.

Additional supported parameters are: * has missing - Can be used if one of your data fields contains null values * crf - Use conditional random fields for comparisons rather than distance metric. May be more accurate in some cases, but runs much slower. Works with String and ShortString types.

# Contributors [Tyler Marrs](http://tylermarrs.com/)

[Tawni Marrs](https://github.com/tawnimarrs)

# Credits

Many thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_dedupe-1.0.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

pandas_dedupe-1.0.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file pandas_dedupe-1.0.0.tar.gz.

File metadata

  • Download URL: pandas_dedupe-1.0.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.8

File hashes

Hashes for pandas_dedupe-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a1a0a04eb2e1d546670581ada1de2d66d11e8d2a7d3d7e6ea43f4fa4add5418a
MD5 87901a90d2a6cc1353956361efe66bad
BLAKE2b-256 0786fdf6db667ec2a2e166562cd693cc1eeea879431f90c85c6e7e61862912c5

See more details on using hashes here.

File details

Details for the file pandas_dedupe-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pandas_dedupe-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.6.8

File hashes

Hashes for pandas_dedupe-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 19f05481822eff9680924a4f3cc2c25b129989f7e029ca55a3322eee8eeacef5
MD5 9e8c974046c515def427a8d28557a20b
BLAKE2b-256 7b6f92e00381a57f3dc4af4a304e7bf5b74f1a918ae19c4d096ed982bf40f45c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page