workflow support for reproducible deduplication and merging
Project description
Installation:
pip install mergic
The mergic package provides a command-line script called mergic that uses Python’s built-in difflib.SequenceMatcher.ratio() for calculating string distances, but a major strength of mergic is that it enables easy customization of the distance function via the mergic.Blender class. Making a custom mergic script is as easy as:
#!/usr/bin/env python
# custom_mergic.py
import mergic
def distance(a, b): # Any custom distance you want to try! e.g.,
return max(i for i, (x, y) in enumerate(zip(a, b)) if x == y)
blender = mergic.Blender(distance)
blender.script()
Now custom_mergic.py can be used just like the standard mergic script! You can also use a custom function for generating the keys that values are de-duped to; by default mergic.Blender will use the first longest of a group’s values in sorted order.
The distance calculation, cutoff evaluation, and partition creation are currently all in mergic make:
# see all the possible partitions by their statistics
mergic make names_all.txt
# make a partition using a cutoff of 0.303
mergic make 0.303 > partition.json
Edit the partition until it’s good. Save it as partition_edited.json.
You can check that your partition is valid and see a cute summary:
mergic check partition_edited.json
# 669 items in 354 groups
You could proceed directly, but there are also diffing tools! Generate a diff:
mergic diff partition.json partition_edited.json > partition_diff.json
You can apply a diff to reconstruct an edited version:
mergic apply partition.json partition_diff.json > partition_rebuilt.json
Now if you mergic diff the files partition_edited.json and partition_rebuilt.json the result should just be {} (no difference).
To generate a CSV merge table that you’ll be able to use with any other tool:
mergic table partition_edited.json > partition.csv
Now the file partition.csv has two columns, original and mergic, where original contains all the values that appeared in the original data and mergic contains the deduplicated keys. You can join this on to your original data and go to town.
Distances
Here are some popular distances and how to do them with Python:
Levenshtein string edit distance: The classic! It has many implementations; one of them is python-Levenshtein.
# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance("fuzzy", "wuzzy")
# 1
SeatGeek’s fuzzywuzzy: As described in a blog post, some distance variants that people have found to work well in practice. Its responses are phrased as integer percent similarities; one way to make a distance is to subtract from 100.
# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
100 - fuzz.ratio("Levensthein", "Leviathan")
# 50
There are a ton of distances, even just within the two packages mentioned! You can also roll your own! (This is encouraged!)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.