Skip to main content

A simple interface to datamade/dedupe to make probabilistic record linkage easy.

Project description

SuperDeduper

https://img.shields.io/pypi/v/superdeduper.svg https://img.shields.io/travis/dssg/superdeduper.svg https://codecov.io/gh/dssg/superdeduper/branch/master/graph/badge.svg Documentation Status Updates

SuperDeduper has been renamed to pgdedupe. All subsequent development will occur under the new name.

A work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps.

Interface

This provides a simple command-line program, superdeduper. Two configuration files specify the deduplication parameters and database connection settings. To run deduplication on a generated dataset, create a database.yml file that specifies the following parameters:

user:
password:
database:
host:
port:

You can now create a sample CSV file with:

$ python generate_fake_dataset.py
creating people: 100%|█████████████████████| 9500/9500 [00:21<00:00, 445.38it/s]
adding twins: 100%|█████████████████████████| 500/500 [00:00<00:00, 1854.72it/s]
writing csv:  47%|███████████▋             | 4666/10000 [00:42<00:55, 96.28it/s]

Once complete, store this example dataset in a database with:

$ python test/initialize_db.py
CREATE SCHEMA
DROP TABLE
CREATE TABLE
COPY 197617
ALTER TABLE
ALTER TABLE
UPDATE 197617

Now you can deduplicate this dataset. This will run dedupe as well as the custom pre-processing and post-processing steps as defined in config.yml:

$ superdeduper --config config.yml --db database.yml

Custom pre- and post-processing

In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.

  • Pre-processing: Before running dedupe, this script does an exact-match deduplication. Some systems create many identical rows; this can make it challenging for dedupe to create an effective blocking strategy and generally makes the fuzzy matching much harder and time intensive.

  • Post-processing: After running dedupe, this script does an optional exact-match merge across subsets of columns. For example, in some instances an exact match of just the last name and social security number are sufficient evidence that two clusters are indeed the same identity.

Further steps

This script was based upon and extended from the example in dedupe-examples. It would be nice to use this common interface across all database types, and potentially even allow reading from flat CSV files.

History

0.1.0 (2016-12-14)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superdeduper-0.1.7.tar.gz (69.3 kB view details)

Uploaded Source

Built Distribution

superdeduper-0.1.7-py2.py3-none-any.whl (13.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file superdeduper-0.1.7.tar.gz.

File metadata

File hashes

Hashes for superdeduper-0.1.7.tar.gz
Algorithm Hash digest
SHA256 195ab4c86d28c3410d079465769e18e991c38eba66b802d192bd1a6143bfd79c
MD5 20f7761d040cadb290788b90c71964b2
BLAKE2b-256 a8e93172347a56814232ca52e66d57e553133cffbe2f17c26c938e6d8c222415

See more details on using hashes here.

File details

Details for the file superdeduper-0.1.7-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for superdeduper-0.1.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9e644770da3b48d1002df94593422f66edb41ea50dc7b5f4a8b275f57177d34f
MD5 8fe4dc0610be89c4151b3a86ce1f14f8
BLAKE2b-256 ad2fa57f3c6ee78f8bdad320b658e5748ad75ddc2c5e1873564d7b46c12bf391

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page