Skip to main content

workflow support for reproducible deduplication and merging

Project description

Installation:

pip install mergic

The mergic package provides a command-line script called mergic that uses Python’s built-in difflib.SequenceMatcher.ratio() for calculating string distances, but a major strength of mergic is that it enables easy customization of the distance function via the mergic.Blender class. Making a custom mergic script is as easy as:

#!/usr/bin/env python
# custom_mergic.py
import mergic

def distance(a, b):  # Any custom distance you want to try! e.g.,
    return max(i for i, (x, y) in enumerate(zip(a, b)) if x == y)

blender = mergic.Blender(distance)
blender.script()

Now custom_mergic.py can be used just like the standard mergic script! You can also use a custom function for generating the keys that values are de-duped to; by default mergic.Blender will use the first longest of a group’s values in sorted order.

The distance calculation, cutoff evaluation, and partition creation are currently all in mergic make:

# see all the possible partitions by their statistics
mergic make names_all.txt

# make a partition using a cutoff of 0.303
mergic make 0.303 > partition.json

Edit the partition until it’s good. Save it as partition_edited.json.

You can check that your partition is valid and see a cute summary:

mergic check partition_edited.json
# 669 items in 354 groups

You could proceed directly, but there are also diffing tools! Generate a diff:

mergic diff partition.json partition_edited.json > partition_diff.json

You can apply a diff to reconstruct an edited version:

mergic apply partition.json partition_diff.json > partition_rebuilt.json

Now if you mergic diff the files partition_edited.json and partition_rebuilt.json the result should just be {} (no difference).

To generate a CSV merge table that you’ll be able to use with any other tool:

mergic table partition_edited.json > partition.csv

Now the file partition.csv has two columns, original and mergic, where original contains all the values that appeared in the original data and mergic contains the deduplicated keys. You can join this on to your original data and go to town.

Distances

Here are some popular distances and how to do them with Python:

# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance("fuzzy", "wuzzy")
# 1
  • SeatGeek’s fuzzywuzzy: As described in a blog post, some distance variants that people have found to work well in practice. Its responses are phrased as integer percent similarities; one way to make a distance is to subtract from 100.

# pip install fuzzywuzzy
from fuzzywuzzy import fuzz
100 - fuzz.ratio("Levensthein", "Leviathan")
# 50

There are a ton of distances, even just within the two packages mentioned! You can also roll your own! (This is encouraged!)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mergic-0.0.2.tar.gz (5.2 kB view details)

Uploaded Source

File details

Details for the file mergic-0.0.2.tar.gz.

File metadata

  • Download URL: mergic-0.0.2.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mergic-0.0.2.tar.gz
Algorithm Hash digest
SHA256 083b8b1dcad5d3e3884daf5dcfcd3fa53c7016d00d5a56fd03011ef73214185f
MD5 afc83085551b5ce7d1bea89314b948e3
BLAKE2b-256 cb275ecacc251168e1160ca8df38ed0c9b7bdab6f1a968d08933853d14eded37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page