Skip to main content

workflow support for reproducible deduplication and merging

Project description

Installation:

pip install mergic

The mergic package provides a command-line script called mergic that uses Python’s built-in difflib.SequenceMatcher.ratio() for calculating string distances, but a major strength of mergic is that it enables easy customization of the distance function via the mergic.Blender class. Making a custom mergic script is as easy as:

#!/usr/bin/env python
# custom_mergic.py
import mergic

def distance(a, b):  # Any custom distance you want to try! e.g.,
    return max(i for i, (x, y) in enumerate(zip(a, b)) if x == y)

blender = mergic.Blender(distance)
blender.script()

Now custom_mergic.py can be used just like the standard mergic script! You can also use a custom function for generating the keys that values are de-duped to; by default mergic.Blender will use the first longest of a group’s values in sorted order.

The distance calculation, cutoff evaluation, and partition creation are currently all in mergic make:

# see all the possible partitions by their statistics
mergic make names_all.txt

# make a partition using a cutoff of 0.303
mergic make 0.303 > partition.json

Edit the partition until it’s good. Save it as partition_edited.json.

You can check that your partition is valid and see a cute summary:

mergic check partition_edited.json
# 669 items in 354 groups

You could proceed directly, but there are also diffing tools! Generate a diff:

mergic diff partition.json partition_edited.json > partition_diff.json

You can apply a diff to reconstruct an edited version:

mergic apply partition.json partition_diff.json > partition_rebuilt.json

Now if you mergic diff the files partition_edited.json and partition_rebuilt.json the result should just be {} (no difference).

To generate a CSV merge table that you’ll be able to use with any other tool:

mergic table partition_edited.json > partition.csv

Now the file partition.csv has two columns, original and mergic, where original contains all the values that appeared in the original data and mergic contains the deduplicated keys. You can join this on to your original data and go to town.

Distances

Here are some popular distances and how to do them with Python:

# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance("fuzzy", "wuzzy")
# 1
  • SeatGeek’s fuzzywuzzy: As described in a blog post, some distance variants that people have found to work well in practice. Its responses are phrased as integer percent similarities; one way to make a distance is to subtract from 100.

# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
100 - fuzz.ratio("Levensthein", "Leviathan")
# 50

There are a ton of distances, even just within the two packages mentioned! You can also roll your own! (This is encouraged!)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mergic-0.0.2.tar.gz (5.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page