Linking rows of pandas dataframes
Project description
# pandas-linker
`pandas-linker` runs comparison windows over different sortings of a pandas DataFrame and links the rows via assigned UUIDs. This library does not actually do any duplicate detection. Instead it provides a harness to run your own comparison functions on your data.
This library is meant for datasets of a size where comparing every row with every other is undesirable. Instead you can decide on a sorting order of the DataFrame and only compare every row with every other inside a sliding window.
## Install
```
pip install pandas-linker
```
## Example
Let's say you have a DataFrame like this:
[ix] | name | country
------|------|--------
0 | Pete | Spain
1 | Mary | USA
2 | Bart | US
3 | Mary | US
and you want to detect similar rows and mark them as such. Here's how to do that:
```python
from pandas_linker import get_linker
def compare_rows(a, b):
''' Example function that decides if two rows represent same entity.'''
return a['name'] in b['name'] or b['name'] in a['name']
# df is a pandas.DataFrame with a unique index
with get_linker(df, field='uid') as linker:
print('Comparing in 10 row window sorted by name')
linker(sort_cols=['name'], window_size=10, cmp=compare_rows)
print('Comparing in 15 row window sorted by country')
linker(sort_cols=['country'], window_size=15, cmp=compare_rows)
```
After running the linker the process is complete
[ix] | name | country | uid
------|------|---------|----
0 | Pete | Spain | 7509781940fc471cad5dc32944652d70
1 | Mary | USA | 8f8dccd91568472daf740e9160349d6c
2 | Bart | US | 12b55fbe80f64d378193acd727b0e051
3 | Mary | US | 8f8dccd91568472daf740e9160349d6c
Note that both "Mary" rows in the DataFrame have been identified as representing
the same entity and were assigned the same UUID.
`pandas-linker` runs comparison windows over different sortings of a pandas DataFrame and links the rows via assigned UUIDs. This library does not actually do any duplicate detection. Instead it provides a harness to run your own comparison functions on your data.
This library is meant for datasets of a size where comparing every row with every other is undesirable. Instead you can decide on a sorting order of the DataFrame and only compare every row with every other inside a sliding window.
## Install
```
pip install pandas-linker
```
## Example
Let's say you have a DataFrame like this:
[ix] | name | country
------|------|--------
0 | Pete | Spain
1 | Mary | USA
2 | Bart | US
3 | Mary | US
and you want to detect similar rows and mark them as such. Here's how to do that:
```python
from pandas_linker import get_linker
def compare_rows(a, b):
''' Example function that decides if two rows represent same entity.'''
return a['name'] in b['name'] or b['name'] in a['name']
# df is a pandas.DataFrame with a unique index
with get_linker(df, field='uid') as linker:
print('Comparing in 10 row window sorted by name')
linker(sort_cols=['name'], window_size=10, cmp=compare_rows)
print('Comparing in 15 row window sorted by country')
linker(sort_cols=['country'], window_size=15, cmp=compare_rows)
```
After running the linker the process is complete
[ix] | name | country | uid
------|------|---------|----
0 | Pete | Spain | 7509781940fc471cad5dc32944652d70
1 | Mary | USA | 8f8dccd91568472daf740e9160349d6c
2 | Bart | US | 12b55fbe80f64d378193acd727b0e051
3 | Mary | US | 8f8dccd91568472daf740e9160349d6c
Note that both "Mary" rows in the DataFrame have been identified as representing
the same entity and were assigned the same UUID.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size pandas_linker-0.0.2-py2.py3-none-any.whl (6.0 kB) | File type Wheel | Python version py2.py3 | Upload date | Hashes View |
Close
Hashes for pandas_linker-0.0.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a22e81adc57d72bcd8fb6dc98b6c5e37d2c830ca2d3c963504dfcd2bb76949d2 |
|
MD5 | 30036c489287bc3fbcdcb83d05128e47 |
|
BLAKE2-256 | 7c17946988f2d48bf0aa52440779c90dc5f0d257b256c188e796f8d5e1f8b3a1 |