Skip to main content

A Python library for grouping duplicate data efficiently.

Project description

A Python library for grouping duplicate data efficiently.

PyPI Version PyPI - Python Version

Introduction

dupegrouper can be used for various deduplication use cases. It's intended purpose is to implement a uniform API that allows for both exact and near deduplication — whilst collecting duplicate instances into sets — i.e. "groups".

Deduplicating data is a hard task — validating approaches takes time, can require a lot of testing, validating, and iterating through approaches that may, or may not, be applicable to your dataset.

dupegrouper abstracts away the task of actually deduplicating, so that you can focus on the most important thing: implementing an appropriate "strategy" to achieve your stated end goal ...

...In fact a "strategy" is key to dupegrouper's API. dupegrouper has:

  • Ready-to-use deduplication strategies
  • Pandas and Polars support
  • A flexible API

Checkout the API Documentation.

Installation

pip install dupegrouper

Example

import dupegrouper

dg = dupegrouper.DupeGrouper(df) # input dataframe

dg.add_strategy(dupegrouper.strategies.Exact())

dg.dedupe("address")

dg.df # retrieve dataframe

Usage Guide

Adding Strategies

dupegrouper comes with ready-to-use deduplication strategies:

  • dupegrouper.strategies.Exact
  • dupegrouper.strategies.Fuzzy
  • dupegrouper.strategies.TfIdf

You can then add these in the order you want to apply them:

# Deduplicate the address column

dg = dupegrouper.DupeGrouper(df)

dg.add_strategy(dupegrouper.strategies.Exact())
dg.add_strategy(dupegrouper.strategies.Fuzzy(tolerance=0.3))

dg.dedupe("address")

Or, add a map of strategies:

# Also deduplicates the address column

dg = dupegrouper.DupeGrouper(df)

dg.add_strategy({
    "address": [
        dupegrouper.strategies.Exact(),
        dupegrouper.strategies.Fuzzy(tolerance=0.3),
    ]
})

dg.dedupe() # No Argument!

Custom Strategies

An insance of dupegrouper.DupeGrouper can accept custom functions too.

def my_func(df: pd.DataFrame, attr: str, /, match_str: str) -> dict[str, str]:
    """deduplicates df if any given row contains `match_str`"""
    my_map = {}
    for irow, _ in df.iterrows():
        left: str = df.at[irow, attr]
        my_map[left] = left
        for jrow, _ in df.iterrows():
            right: str = df.at[jrow, attr]
            if match_str in left.lower() and match_str in right.lower():
                my_map[left] = right
                break
    return my_map

Above, my_func deserves a custom implementation: it deduplicates rows only if said rows contain a the partial string match_str. You can then proceed to add your custom function as a strategy:

dg = dupegrouper.DupeGrouper(df)

dg.add_strategy((my_func, {"match_str": "london"}))

print(dg.strategies) # returns ("my_func",)

dg.dedupe("address")

[!NOTE] Your custom function's signature must be two positional arguments followed by keyword arguments:

(df: DataFrame, attr: str, /, **kwargs) -> dict[str, str]

Where attr is the attribute you wish to deduplicate.

[!WARNING] In the current implementation, any custom callable will also always dedupe exact matches!

Creating a Comprehensive Strategy

You can use the above techniques for a comprehensive strategy to deduplicate your data:

import dupegrouper
import pandas # or polars

df = pd.read_csv("example.csv")

dg = dupegrouper.DupeGrouper(df)

strategies = {
    "address": [
        dupegrouper.strategies.Exact(),
        dupegrouper.strategies.Fuzzy(tolerance=0.5),
        (my_func, {"match_str": "london"}),
    ],
    "email": [
        dupegrouper.strategies.Exact(),
        dupegrouper.strategies.Fuzzy(tolerance=0.3),
        dupegrouper.strategies.TfIdf(tolerance=0.4, ngram=3, topn=2),
    ],
}

dg.add_strategy(strategies)

dg.dedupe()

df = dg.df

Extending the API for Custom Implementations

It's recommended that for simple custom implementations you use the approach discussed for custom functions. (see Custom Strategies).

However, you can derive directly from the abstract base class dupegrouper.strategy.DeduplicationStrategy, and thus make direct use of the efficient, core deduplication methods implemented in this library, as described in it's API. This will expose a dedupe() method, ready for direct use within an instance of DupeGrouper, much the same way that other dupegrouper.strategies are passed in as strategies.

About

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupegrouper-0.1.2.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dupegrouper-0.1.2-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file dupegrouper-0.1.2.tar.gz.

File metadata

  • Download URL: dupegrouper-0.1.2.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dupegrouper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0e1c4407953c58e82a25f61c7f94934f8752256a770bd8e74f5db3c608305be0
MD5 7d535ba9836bb422fb92b04417f03020
BLAKE2b-256 31b95b9859b6dacbeae773a42ef1bcec8227e6f72a39a2b150043ac0c14a43e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for dupegrouper-0.1.2.tar.gz:

Publisher: release.yml on VictorAut/dupegrouper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dupegrouper-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dupegrouper-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 21.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dupegrouper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 203bddb3b58226fcd19a6addf7e6b9943aeb98526099266e1326e0b6edd32c59
MD5 91a946155976ab8a943b09c95eca5526
BLAKE2b-256 f0c5e25a1fe1140eb30a50453bd89c7ec41c3a20dadd754f0cce61245d0cfdd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for dupegrouper-0.1.2-py3-none-any.whl:

Publisher: release.yml on VictorAut/dupegrouper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page