A Python library for grouping duplicate data efficiently.
Project description
A Python library for grouping duplicate data efficiently.
Introduction
dupegrouper can be used for various deduplication use cases. It's intended purpose is to implement a uniform API that allows for both exact and near deduplication — whilst collecting duplicate instances into sets — i.e. "groups".
Deduplicating data is a hard task — validating approaches takes time, can require a lot of testing, validating, and iterating through approaches that may, or may not, be applicable to your dataset.
dupegrouper abstracts away the task of actually deduplicating, so that you can focus on the most important thing: implementing an appropriate "strategy" to achieve your stated end goal ...
...In fact a "strategy" is key to dupegrouper's API. dupegrouper has:
- Ready-to-use deduplication strategies
- Pandas and Polars support
- A flexible API
Checkout the API Documentation.
Installation
pip install dupegrouper
Example
import dupegrouper
dg = dupegrouper.DupeGrouper(df) # input dataframe
dg.add_strategy(dupegrouper.strategies.Exact())
dg.dedupe("address")
dg.df # retrieve dataframe
Usage Guide
Adding Strategies
dupegrouper comes with ready-to-use deduplication strategies:
dupegrouper.strategies.Exactdupegrouper.strategies.Fuzzydupegrouper.strategies.TfIdf
You can then add these in the order you want to apply them:
# Deduplicate the address column
dg = dupegrouper.DupeGrouper(df)
dg.add_strategy(dupegrouper.strategies.Exact())
dg.add_strategy(dupegrouper.strategies.Fuzzy(tolerance=0.3))
dg.dedupe("address")
Or, add a map of strategies:
# Also deduplicates the address column
dg = dupegrouper.DupeGrouper(df)
dg.add_strategy({
"address": [
dupegrouper.strategies.Exact(),
dupegrouper.strategies.Fuzzy(tolerance=0.3),
]
})
dg.dedupe() # No Argument!
Custom Strategies
An insance of dupegrouper.DupeGrouper can accept custom functions too.
def my_func(df: pd.DataFrame, attr: str, /, match_str: str) -> dict[str, str]:
"""deduplicates df if any given row contains `match_str`"""
my_map = {}
for irow, _ in df.iterrows():
left: str = df.at[irow, attr]
my_map[left] = left
for jrow, _ in df.iterrows():
right: str = df.at[jrow, attr]
if match_str in left.lower() and match_str in right.lower():
my_map[left] = right
break
return my_map
Above, my_func deserves a custom implementation: it deduplicates rows only if said rows contain a the partial string match_str. You can then proceed to add your custom function as a strategy:
dg = dupegrouper.DupeGrouper(df)
dg.add_strategy((my_func, {"match_str": "london"}))
print(dg.strategies) # returns ("my_func",)
dg.dedupe("address")
[!NOTE] Your custom function's signature must be two positional arguments followed by keyword arguments:
(df: DataFrame, attr: str, /, **kwargs) -> dict[str, str]Where
attris the attribute you wish to deduplicate.
[!WARNING] In the current implementation, any custom callable will also always dedupe exact matches!
Creating a Comprehensive Strategy
You can use the above techniques for a comprehensive strategy to deduplicate your data:
import dupegrouper
import pandas # or polars
df = pd.read_csv("example.csv")
dg = dupegrouper.DupeGrouper(df)
strategies = {
"address": [
dupegrouper.strategies.Exact(),
dupegrouper.strategies.Fuzzy(tolerance=0.5),
(my_func, {"match_str": "london"}),
],
"email": [
dupegrouper.strategies.Exact(),
dupegrouper.strategies.Fuzzy(tolerance=0.3),
dupegrouper.strategies.TfIdf(tolerance=0.4, ngram=3, topn=2),
],
}
dg.add_strategy(strategies)
dg.dedupe()
df = dg.df
Extending the API for Custom Implementations
It's recommended that for simple custom implementations you use the approach discussed for custom functions. (see Custom Strategies).
However, you can derive directly from the abstract base class dupegrouper.strategy.DeduplicationStrategy, and thus make direct use of the efficient, core deduplication methods implemented in this library, as described in it's API. This will expose a dedupe() method, ready for direct use within an instance of DupeGrouper, much the same way that other dupegrouper.strategies are passed in as strategies.
About
License
This project is licensed under the Apache-2.0 License. See the LICENSE file for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dupegrouper-0.1.2.tar.gz.
File metadata
- Download URL: dupegrouper-0.1.2.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e1c4407953c58e82a25f61c7f94934f8752256a770bd8e74f5db3c608305be0
|
|
| MD5 |
7d535ba9836bb422fb92b04417f03020
|
|
| BLAKE2b-256 |
31b95b9859b6dacbeae773a42ef1bcec8227e6f72a39a2b150043ac0c14a43e9
|
Provenance
The following attestation bundles were made for dupegrouper-0.1.2.tar.gz:
Publisher:
release.yml on VictorAut/dupegrouper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dupegrouper-0.1.2.tar.gz -
Subject digest:
0e1c4407953c58e82a25f61c7f94934f8752256a770bd8e74f5db3c608305be0 - Sigstore transparency entry: 202152212
- Sigstore integration time:
-
Permalink:
VictorAut/dupegrouper@497041a9ecaeaeba430240ccab51ece407359cdf -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/VictorAut
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@497041a9ecaeaeba430240ccab51ece407359cdf -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file dupegrouper-0.1.2-py3-none-any.whl.
File metadata
- Download URL: dupegrouper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
203bddb3b58226fcd19a6addf7e6b9943aeb98526099266e1326e0b6edd32c59
|
|
| MD5 |
91a946155976ab8a943b09c95eca5526
|
|
| BLAKE2b-256 |
f0c5e25a1fe1140eb30a50453bd89c7ec41c3a20dadd754f0cce61245d0cfdd4
|
Provenance
The following attestation bundles were made for dupegrouper-0.1.2-py3-none-any.whl:
Publisher:
release.yml on VictorAut/dupegrouper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dupegrouper-0.1.2-py3-none-any.whl -
Subject digest:
203bddb3b58226fcd19a6addf7e6b9943aeb98526099266e1326e0b6edd32c59 - Sigstore transparency entry: 202152214
- Sigstore integration time:
-
Permalink:
VictorAut/dupegrouper@497041a9ecaeaeba430240ccab51ece407359cdf -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/VictorAut
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@497041a9ecaeaeba430240ccab51ece407359cdf -
Trigger Event:
workflow_dispatch
-
Statement type: