workflow support for reproducible deduplication and merging
Project description
With the mergic.Blender:
The distance calculation, cutoff evaluation, and partition creation are currently all in mergic.py make:
# see all the possible partitions by their statistics
./mergic.py make names_all.txt
# make a partition using a cutoff of 0.303
./mergic.py make 0.303 > partition.json
Edit the partition until it’s good. Save it as partition_edited.json.
You can check that your partition is valid and see a cute summary:
./mergic.py check partition_edited.json
# 669 items in 354 groups
You could proceed directly, but there are also diffing tools! Generate a diff:
./mergic.py diff partition.json partition_edited.json > partition_diff.json
You can apply a diff to reconstruct an edited version:
./mergic.py apply partition.json partition_diff.json > partition_rebuilt.json
Now if you mergic.py diff the files partition_edited.json and partition_rebuilt.json the result should just be {} (no difference).
To generate a CSV merge table that you’ll be able to use with any other tool:
./mergic.py table partition_edited.json > partition.csv
Now the file partition.csv has two columns, original and mergic, where original contains all the values that appeared in the original data and mergic contains the deduplicated keys. You can join this on to your original data and go to town.
Distances
Here are some popular distances and how to do them with Python:
Levenshtein string edit distance: The classic! It has many implementations; one of them is python-Levenshtein.
# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance("fuzzy", "wuzzy")
# 1
SeatGeek’s fuzzywuzzy: As described in a blog post, some distance variants that people have found to work well in practice. Its responses are phrased as integer percent similarities; one way to make a distance is to subtract from 100.
# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
100 - fuzz.ratio("Levensthein", "Leviathan")
# 50
There are a ton of distances, even just within the two packages mentioned! You can also roll your own! (This is encouraged!)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.