A Python library for grouping duplicate data efficiently.
Project description
A Python library for grouping duplicate data efficiently.
Introduction
dupegrouper can be used for various deduplication use cases. It's intended purpose is to implement a uniform API that allows for both exact and near deduplication — whilst also offering record selection, based on the first instance of a set of duplicates i.e. a "group".
Deduplicating data is a hard task — validating approaches takes time, can require a lot of testing, validating, and iterating through approaches that may, or may not, be applicable to your dataset.
dupegrouper abstracts away the task of actually deduplicating, so that you can focus on the most important thing: implementing an appropriate "strategy" to achieve your stated end goal ...
...In fact a "strategy" is key to dupegrouper's API. dupegrouper has:
Ready-to-use deduplication strategies
dupegrouper currently offers the following deduplication strategies:
| string type | numeric type |
|---|---|
| Exact string | Jaccard* |
| Fuzzy matching | Cosine similarity* |
| TfIdf | - |
| LSH* | - |
* due for implementation in a future version
You can also implement custom deduplication logic, which dupegrouper can readily accept, as descripted in Custom Strategies.
Multiple backend support
dupegrouper aims to scale in line with your problem. The following backends are currently support:
- Pandas
- Polars
- PySpark
A flexible API
Checkout the API Documentation
Installation
pip install dupegrouper
Example
import dupegrouper
dg = dupegrouper.DupeGrouper(df) # input dataframe
dg.add_strategy(dupegrouper.strategies.Exact())
dg.dedupe("address")
dg.df # retrieve dataframe
Usage Guide
Adding Strategies
dupegrouper comes with ready-to-use deduplication strategies:
dupegrouper.strategies.Exactdupegrouper.strategies.Fuzzydupegrouper.strategies.TfIdf
Strategies can be added one-by-one and are executed in the order in which they are added. In the below case the, the address column will firstly be deduplicted exactly, and then using Fuzzy matching.
# Deduplicate the address column
dg = dupegrouper.DupeGrouper(df)
dg.add_strategy(dupegrouper.strategies.Exact())
dg.add_strategy(dupegrouper.strategies.Fuzzy(tolerance=0.3))
dg.dedupe("address")
Or, you can add a map of strategies. In this case, strategies are executed in their defined order, for each map key. The below implementation will produce the same as above.
# Also deduplicates the address column
dg = dupegrouper.DupeGrouper(df)
dg.add_strategy({
"address": [
dupegrouper.strategies.Exact(),
dupegrouper.strategies.Fuzzy(tolerance=0.3),
]
})
print(dg.strategies)
# {'address': ('Exact', 'Fuzzy', 'TfIdf')}
dg.dedupe() # No Argument!
A call of dedupe() will reset the strategies:
...
print(dg.strategies)
# {'address': ('Exact', 'Fuzzy', 'TfIdf')}
dg.dedupe()
print(dg.strategies)
# None
Custom Strategies
Maybe you need some custom deduplication methodology. An instance of dupegrouper.DupeGrouper can accept custom functions too.
def my_func(df, attr: str, /, **kwargs) -> dict[str, str]:
my_map = {}
for row in df:
# e.g. use **kwargs
my_map = ...
return my_map
Above, my_func is a (very) boilerplate custom deduplication implementation:
- it accepts a dataframe (
df) - it will deduplicate on a specific attribute (
attr) - And accepts other keyword arguments specific to your problem (
**kwargs)
Look closely at the function signature — your function needs to implement this exactly. Additionally it produces a map where a key-value pair represents a deduplication match where the value is the new selected record ("group").
[!WARNING] In the current implementation, there is no guarantee that a generator can be used to
yielddeduplicated value maps
You can proceed to add your custom function as a strategy:
dg = dupegrouper.DupeGrouper(df)
dg.add_strategy((my_func, {"match_str": "london"}))
print(dg.strategies) # returns ("my_func",)
dg.dedupe("address")
[!WARNING] In the current implementation, any custom callable will also always dedupe exact matches!
Creating a Comprehensive Strategy
You can use the above techniques for a comprehensive strategy to deduplicate your data:
import dupegrouper
import pandas # | polars | pyspark
df = pd.read_csv("example.csv")
dg = dupegrouper.DupeGrouper(df)
strategies = {
"address": [
dupegrouper.strategies.Exact(),
dupegrouper.strategies.Fuzzy(tolerance=0.5),
(my_func, {"match": "london"}), # any address that contains "london"
],
"email": [
dupegrouper.strategies.Exact(),
dupegrouper.strategies.Fuzzy(tolerance=0.3),
dupegrouper.strategies.TfIdf(tolerance=0.4, ngram=3, topn=2),
],
}
dg.add_strategy(strategies)
dg.dedupe()
df = dg.df
Using the PySpark backend
dupegrouper can be used as described in Creating a Comprehensive Strategy. dupegrouper is partition-aware, and will deduplicate each partition, per worker node, given the defined strategy. Such a distributed implementation puts the onus on appropriate planning:
- partitions must already be containers of expected duplicates
- partitioning (or re-partitioning) must be planned ahead of time
The above problem is typically dealt with the use of a "blocking key" which is the partitioning/repartitioning key. Whilst several approaches may be valid, a blocking key be typically be computed as a general property of several records that are expected to contain duplicates. As an example, that might be the first N characters of a given attribute needing deduplicating.
Extending the API for Custom Implementations
It's recommended that for simple custom implementations you use the approach discussed for custom functions. (see Custom Strategies).
However, you can derive directly from the abstract base class dupegrouper.strategy.DeduplicationStrategy, and thus make direct use of the efficient, core deduplication methods implemented in this library, as described in it's API. This will expose a dedupe() method, ready for direct use within an instance of DupeGrouper, much the same way that other dupegrouper.strategies are passed in as strategies.
About
License
This project is licensed under the Apache-2.0 License. See the LICENSE file for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dupegrouper-0.2.0.tar.gz.
File metadata
- Download URL: dupegrouper-0.2.0.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdcab9d5e41cacac5db581649c8fecfe0ac2210a741b234235d7165723424498
|
|
| MD5 |
5e407f568b19850aec409e8a6554a6e2
|
|
| BLAKE2b-256 |
ebb91b50273bd94f708b3814ec2893f25ac510099b9d140c15ca82dd4a968207
|
Provenance
The following attestation bundles were made for dupegrouper-0.2.0.tar.gz:
Publisher:
release.yml on VictorAut/dupegrouper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dupegrouper-0.2.0.tar.gz -
Subject digest:
cdcab9d5e41cacac5db581649c8fecfe0ac2210a741b234235d7165723424498 - Sigstore transparency entry: 214628616
- Sigstore integration time:
-
Permalink:
VictorAut/dupegrouper@533338e16331311e24aba3550e396c10a2e7302d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/VictorAut
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@533338e16331311e24aba3550e396c10a2e7302d -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file dupegrouper-0.2.0-py3-none-any.whl.
File metadata
- Download URL: dupegrouper-0.2.0-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3cd2dd1e91dbe2cbe95796848bd2bbc1c06e86fb23ddb1de70421f05f1ad284
|
|
| MD5 |
f0ad0aea6f1a59cd91651443a2136b94
|
|
| BLAKE2b-256 |
2513daedb9fc18994153babbd0ea392611d7eb9f33858caa0bb84e45471bff4b
|
Provenance
The following attestation bundles were made for dupegrouper-0.2.0-py3-none-any.whl:
Publisher:
release.yml on VictorAut/dupegrouper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dupegrouper-0.2.0-py3-none-any.whl -
Subject digest:
a3cd2dd1e91dbe2cbe95796848bd2bbc1c06e86fb23ddb1de70421f05f1ad284 - Sigstore transparency entry: 214628617
- Sigstore integration time:
-
Permalink:
VictorAut/dupegrouper@533338e16331311e24aba3550e396c10a2e7302d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/VictorAut
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@533338e16331311e24aba3550e396c10a2e7302d -
Trigger Event:
workflow_dispatch
-
Statement type: