The Dedupe library made easy with Pandas.

Project description

pandas-dedupe

The Dedupe library made easy with Pandas.

Installation

pip install pandas-dedupe

Video Tutorials

Basic Deduplication

Basic Usage

A training file and a settings file will be created while running Dedupe. Keeping these files will eliminate the need to retrain your model in the future.

If you would like to retrain your model from scratch, just delete the settings and training files.

Deduplication (dedupe_dataframe)

dedupe_dataframe is for deduplication when you have data that can contain multiple records that can all refer to the same entity

import pandas as pd
import pandas_dedupe

#load dataframe
df = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])

#send output to csv
df_final.to_csv('deduplication_output.csv')

Gazetteer deduplication (gazetteer_dataframe)

gazetteer_dataframe is for matching a messy dataset against a 'canonical dataset' (i.e. the gazette)

import pandas as pd
import pandas_dedupe

#load dataframe
df_clean = pd.read_csv('gazette.csv')
df_messy = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'fullname', canonicalize=True)

#send output to csv
df_final.to_csv('gazetteer_deduplication_output.csv')

Matching / Record Linkage

Use identical field names when linking dataframes. Record linkage should only be used on dataframes that have been deduplicated.

import pandas as pd
import pandas_dedupe

#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')

#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])

#send output to csv
df_final.to_csv('linkage_output.csv')

Advanced Usage

Canonicalize Fields

The canonicalize parameter will standardize names in a given cluster. Original fields are also kept.

pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)

Update Threshold (dedupe_dataframe and gazetteer_dataframe only)

Group records into clusters only if the cophenetic similarity of the cluster is greater than the threshold.

pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], threshold=.7)

Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)

If True, it allows a user to update the existing model.

pandas_dedupe.dedupe_dataframe(df, ['first_name', 'last_name'], update_model=True)

Recall Weight & Sample Size

The dedupe_dataframe() function has two optional parameters specifying recall_weight and sample_size:

recall_weight - Ranges from 0 to 2. When set to 2, we are saying we care twice as much about recall than we do about precision.
sample_size - Specifies the sample size used for training as a float from 0 to 1. By default it is 30% (0.3) of our data.

Specifying Types

If you'd like to specify dates, spatial data, etc, do so here. The structure must be like so: ('field', 'type', 'additional_parameter). the additional_parameter section can be omitted. The default type is String.

See the full list of types below.

# Price Example
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])

# has missing Example
pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])

# crf Example
pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])

Types

Dedupe supports a variety of datatypes; a full list with documentation can be found here.

pandas-dedupe officially supports the following datatypes:

String - Standard string comparison using string distance metric. This is the default type.
Text - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.
Price - For comparing positive, non zero numerical values.
DateTime - For comparing dates.
LatLong - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance metric, even though the points are in a geographically similar location. The LatLong type resolves this by calculating the haversine distance between compared coordinates. LatLong requires the field to be in the format (Lat, Long). The value can be a string, a tuple containing two strings, a tuple containing two floats, or a tuple containing two integers. If the format is not able to be processed, you will get a traceback.
Exact - Tests whether fields are an exact match.
Exists - Sometimes, the presence or absence of data can be useful in predicting a match. The Exists type tests for whether both, one, or neither of fields are null.

Additional supported parameters are:

has missing - Can be used if one of your data fields contains null values
crf - Use conditional random fields for comparisons rather than distance metric. May be more accurate in some cases, but runs much slower. Works with String and ShortString types.

Contributors

Tyler Marrs - Refactored code, added docstrings, added threshold parameter

Tawni Marrs - refactored code, added docstrings

ieriii - Added update_model parameter, updated codebase to use Dedupe 2.0, added support for multiprocessing, added gazetteer_dataframe.

Daniel Marczin - Extensive updates to documentation to enhance readability.

Credits

Many thanks to folks at DataMade for making the the Dedupe library publicly available. People interested in a code-free implementation of the dedupe library can find a link here: Dedupe.io.

Project details

Release history Release notifications | RSS feed

This version

1.5.0

Jul 21, 2021

1.4.0

Oct 31, 2020

1.3.1

Jun 5, 2020

1.1.1

Mar 24, 2020

1.0.0

Nov 15, 2019

0.42

Mar 25, 2019

0.31

Feb 20, 2019

0.24

Dec 21, 2018

0.22

Dec 11, 2018

0.21

Dec 9, 2018

0.2

Dec 7, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_dedupe-1.5.0.tar.gz (11.3 kB view details)

Uploaded Jul 21, 2021 Source

Built Distribution

pandas_dedupe-1.5.0-py3-none-any.whl (12.4 kB view details)

Uploaded Jul 21, 2021 Python 3

File details

Details for the file pandas_dedupe-1.5.0.tar.gz.

File metadata

Download URL: pandas_dedupe-1.5.0.tar.gz
Upload date: Jul 21, 2021
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2

File hashes

Hashes for pandas_dedupe-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`750a4198c958462a1469d0967bb54817df45f976a1fe9df2252bb182d6a10c29`
MD5	`c4851fa65ec0cffd358726fd64a2e40a`
BLAKE2b-256	`451ff24ba1dbb5ff59f07dc8829c9d80c3ff9d1d4367f21d7d482243a92f3f4e`

See more details on using hashes here.

File details

Details for the file pandas_dedupe-1.5.0-py3-none-any.whl.

File metadata

Download URL: pandas_dedupe-1.5.0-py3-none-any.whl
Upload date: Jul 21, 2021
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2

File hashes

Hashes for pandas_dedupe-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea36afe7c5db1dd4c5c5402505a8216ae0aefe96f12b60a2360eee185748aaba`
MD5	`e844ee413331eacd308ae25d7a6ada5c`
BLAKE2b-256	`abb7aa64ace5729a9de74a155cd97366f7115fe7e075e8f3a6ba07f94ca432d2`

See more details on using hashes here.

pandas-dedupe 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

pandas-dedupe

Installation

Video Tutorials

Basic Usage

Deduplication (dedupe_dataframe)

Gazetteer deduplication (gazetteer_dataframe)

Matching / Record Linkage

Advanced Usage

Canonicalize Fields

Update Threshold (dedupe_dataframe and gazetteer_dataframe only)

Update Existing Model (dedupe_dataframe and gazetteer_dataframe only)

Recall Weight & Sample Size

Specifying Types

Types

Contributors

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes