Skip to main content

A simple interface to datamade/dedupe to make probabilistic record linkage easy.

Project description

ssdedupe Documentation Status Updates

This is a fork from dssg/pgdedupe. This will now be a separate repo for MS SQL Server implementation. (See PR#40)

This packages is for working with Microsoft SQL Server. I will be slowly removing support for PostgreSQL, please use pgdedupe for working with PostgreSQL

A work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps.


This provides a simple command-line program, ssdedupe. Two configuration files specify the deduplication parameters and database connection settings. To run deduplication on a generated dataset, create a database.yml file that specifies the following parameters:


To connect to Microsoft SQL Server, an additional parameter type: mssql needs to added to database.yml file.

You can now create a sample CSV file with:

$ python --csv people.csv
creating people: 100%|█████████████████████| 9500/9500 [00:21<00:00, 445.38it/s]
adding twins: 100%|█████████████████████████| 500/500 [00:00<00:00, 1854.72it/s]
writing csv:  47%|███████████▋             | 4666/10000 [00:42<00:55, 96.28it/s]

Once complete, store this example dataset in a database with:

$ python test/ --db database.yml --csv people.csv
COPY 197617
UPDATE 197617

Now you can deduplicate this dataset. This will run dedupe as well as the custom pre-processing and post-processing steps as defined in config.yml:

$ ssdedupe --config config.yml --db database.yml

Custom pre- and post-processing

In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.

  • Pre-processing: Before running dedupe, this script does an exact-match deduplication. Some systems create many identical rows; this can make it challenging for dedupe to create an effective blocking strategy and generally makes the fuzzy matching much harder and time intensive.
  • Post-processing: After running dedupe, this script does an optional exact-match merge across subsets of columns. For example, in some instances an exact match of just the last name and social security number are sufficient evidence that two clusters are indeed the same identity.

Further steps

This script was based upon and extended from the example in dedupe-examples. It would be nice to use this common interface across all database types, and potentially even allow reading from flat CSV files.


0.2.1 (2017-05-03)

  • Make command line arguments required, resulting in better error messages.
  • Refactored testing scripts to be more user-friendly.

0.2.0 (2017-04-19)

  • First release on PyPI (as pgdedupe).

0.1.0 (2016-12-14)

  • First release on PyPI (as superdeduper).

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for ssdedupe, version 0.0.3
Filename, size File type Python version Upload date Hashes
Filename, size ssdedupe-0.0.3.tar.gz (301.7 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page