Skip to main content

workflow support for reproducible deduplication and merging

Project description

With the mergic.Blender:

The distance calculation, cutoff evaluation, and partition creation are currently all in mergic.py make:

# see all the possible partitions by their statistics
./mergic.py make names_all.txt

# make a partition using a cutoff of 0.303
./mergic.py make 0.303 > partition.json

Edit the partition until it’s good. Save it as partition_edited.json.

You can check that your partition is valid and see a cute summary:

./mergic.py check partition_edited.json
# 669 items in 354 groups

You could proceed directly, but there are also diffing tools! Generate a diff:

./mergic.py diff partition.json partition_edited.json > partition_diff.json

You can apply a diff to reconstruct an edited version:

./mergic.py apply partition.json partition_diff.json > partition_rebuilt.json

Now if you mergic.py diff the files partition_edited.json and partition_rebuilt.json the result should just be {} (no difference).

To generate a CSV merge table that you’ll be able to use with any other tool:

./mergic.py table partition_edited.json > partition.csv

Now the file partition.csv has two columns, original and mergic, where original contains all the values that appeared in the original data and mergic contains the deduplicated keys. You can join this on to your original data and go to town.

Distances

Here are some popular distances and how to do them with Python:

# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance("fuzzy", "wuzzy")
# 1
  • SeatGeek’s fuzzywuzzy: As described in a blog post, some distance variants that people have found to work well in practice. Its responses are phrased as integer percent similarities; one way to make a distance is to subtract from 100.

# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
100 - fuzz.ratio("Levensthein", "Leviathan")
# 50

There are a ton of distances, even just within the two packages mentioned! You can also roll your own! (This is encouraged!)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mergic-0.0.1.tar.gz (4.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page