superdeduper

A simple interface to datamade/dedupe to make probabilistic record linkage easy.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

SuperDeduper

https://codecov.io/gh/dssg/superdeduper/branch/master/graph/badge.svg

SuperDeduper has been renamed to pgdedupe. All subsequent development will occur under the new name.

A work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps.

Free software: MIT license
Documentation: https://superdeduper.readthedocs.io.

Interface

This provides a simple command-line program, superdeduper. Two configuration files specify the deduplication parameters and database connection settings. To run deduplication on a generated dataset, create a database.yml file that specifies the following parameters:

user:
password:
database:
host:
port:

You can now create a sample CSV file with:

$ python generate_fake_dataset.py
creating people: 100%|█████████████████████| 9500/9500 [00:21<00:00, 445.38it/s]
adding twins: 100%|█████████████████████████| 500/500 [00:00<00:00, 1854.72it/s]
writing csv:  47%|███████████▋             | 4666/10000 [00:42<00:55, 96.28it/s]

Once complete, store this example dataset in a database with:

$ python test/initialize_db.py
CREATE SCHEMA
DROP TABLE
CREATE TABLE
COPY 197617
ALTER TABLE
ALTER TABLE
UPDATE 197617

Now you can deduplicate this dataset. This will run dedupe as well as the custom pre-processing and post-processing steps as defined in config.yml:

$ superdeduper --config config.yml --db database.yml

Custom pre- and post-processing

In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.

Pre-processing: Before running dedupe, this script does an exact-match deduplication. Some systems create many identical rows; this can make it challenging for dedupe to create an effective blocking strategy and generally makes the fuzzy matching much harder and time intensive.
Post-processing: After running dedupe, this script does an optional exact-match merge across subsets of columns. For example, in some instances an exact match of just the last name and social security number are sufficient evidence that two clusters are indeed the same identity.

Further steps

This script was based upon and extended from the example in dedupe-examples. It would be nice to use this common interface across all database types, and potentially even allow reading from flat CSV files.

History

0.1.0 (2016-12-14)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.1.7

Apr 19, 2017

0.1.6

Mar 28, 2017

0.1.5

Mar 28, 2017

0.1.4

Mar 24, 2017

0.1.3

Mar 24, 2017

0.1.2

Mar 17, 2017

0.1.0

Feb 21, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superdeduper-0.1.7.tar.gz (69.3 kB view details)

Uploaded Apr 19, 2017 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

superdeduper-0.1.7-py2.py3-none-any.whl (13.1 kB view details)

Uploaded Apr 19, 2017 Python 2Python 3

File details

Details for the file superdeduper-0.1.7.tar.gz.

File metadata

Download URL: superdeduper-0.1.7.tar.gz
Upload date: Apr 19, 2017
Size: 69.3 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for superdeduper-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`195ab4c86d28c3410d079465769e18e991c38eba66b802d192bd1a6143bfd79c`
MD5	`20f7761d040cadb290788b90c71964b2`
BLAKE2b-256	`a8e93172347a56814232ca52e66d57e553133cffbe2f17c26c938e6d8c222415`

See more details on using hashes here.

File details

Details for the file superdeduper-0.1.7-py2.py3-none-any.whl.

File metadata

Download URL: superdeduper-0.1.7-py2.py3-none-any.whl
Upload date: Apr 19, 2017
Size: 13.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for superdeduper-0.1.7-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e644770da3b48d1002df94593422f66edb41ea50dc7b5f4a8b275f57177d34f`
MD5	`8fe4dc0610be89c4151b3a86ce1f14f8`
BLAKE2b-256	`ad2fa57f3c6ee78f8bdad320b658e5748ad75ddc2c5e1873564d7b46c12bf391`

See more details on using hashes here.

superdeduper 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SuperDeduper

Interface

Custom pre- and post-processing

Further steps

History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes