Python utility to extract differences between two pandas dataframes.
Project description
Installation
Install pandas_diff with pip
pip install pandas_diff
Usage/Examples
import pandas_diff as pd_diff
import pandas as pd
# Create two example dataframes
df_infinity = pd.DataFrame([
{"hero" : "hulk" , "power" : "strength"},
{"hero" : "black_widow" , "power" : "spy"},
{"hero" : "thor" , "hammers" : 0 },
{"hero" : "thor" , "hammers" : 1 } ] )
df_endgame = pd.DataFrame([
{"hero" : "hulk" , "power" : "smart"},
{"hero" : "captain marvel" , "power" : "strength"},
{"hero" : "thor" , "hammers" : 2 } ] )
# Get differences, using the key "hero"
df = pd_diff.get_diffs(df_infinity ,df_endgame ,"hero")
df
#operation object_keys object_values object_json attribute_changed old_value new_value
#0 create [hero] captain marvel {'hero': 'captain marvel', 'power': 'strength'... NaN NaN NaN
#1 delete [hero] black_widow {'hero': 'black_widow', 'power': 'spy', 'hamme... NaN NaN NaN
#2 modify [hero] thor {'hero': 'thor', 'power': nan, 'hammers': 2.0} hammers 1 2
#3 modify [hero] hulk {'hero': 'hulk', 'power': 'smart', 'hammers': ... power strength smart
Why pandas diff ? Cases of use
Migrating from batch to an event driven architecture
In my work, we use a lot of data pipelines to get info from external platforms, (active directory, github, jira). We load the new data replacing the entire table.
By using pandas_diff we detect how the infraestructure changes between executions, and stream those change events into a kafka cluster, so other teams could suscribe to their favourite events. Also, by defining a pandas_diff step in the master pipeline, every item in our project has ther life cycle events controlled.
Events log
For every item in a table, by using pandas_diff you will have an event log of how the resources are being consumed.
Roadmap
Support for stand alone app
Blacklist of columns
Documentation
History
0.1.0 (2021-12-02)
* First release on PyPI.
0.7.10 (2021-12-05)
* Bugfix rst 0.7.11 (2021-12-05) ——————-
* Bugfix rst
0.7.11 (2021-12-05)
* Bugfix rst
0.7.11 (2021-12-05)
* Add pandas req
0.7.11 (2021-12-05)
* Add pandas req
0.7.12 (2021-12-05)
* Bump to test doc
0.7.13 (2021-12-05)
* bump version
0.7.13 (2021-12-05)
* bump version
0.7.13 (2021-12-05)
* bump version
0.7.14 (2021-12-05)
* bump version
0.7.15 (2021-12-05)
* bump version
0.7.16 (2021-12-05)
* bump version
0.7.18 (2021-12-05)
* bump version
0.7.18 (2021-12-05)
* Add codacy badge
0.7.18 (2021-12-05)
* Add codacy badge
0.7.19 (2021-12-05)
* Add codacy badge
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas_diff-0.7.19-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09a4d70017fdba55cdf06e3119d8ed8054b1b07d9636fb556038e27067110d9f |
|
MD5 | 7b2cfbc2ad8b74a1df96472d899175e2 |
|
BLAKE2b-256 | 978e1133ec0d3b492f76f60ed253b8d5ad608c21688695605a821174d293e86f |