Python utility to extract differences between two pandas dataframes.
Project description
Installation
Install pandas_diff with pip
pip install pandas_diff
Usage/Examples
import pandas_diff as pd_diff
import pandas as pd
# Create two example dataframes
df_infinity_war = pd.DataFrame([
{"hero" : "hulk" , "power" : "strength"},
{"hero" : "black_widow" , "power" : "spy"},
{"hero" : "thor" , "hammers" : 0 },
{"hero" : "thor" , "hammers" : 1 } ] )
df_endgame = pd.DataFrame([
{"hero" : "hulk" , "power" : "smart"},
{"hero" : "captain marvel" , "power" : "strength"},
{"hero" : "thor" , "hammers" : 2 } ] )
# Get differences, using the key "hero"
df = pd_diff.get_diffs(df_infinity_war ,df_endgame ,"hero")
df
#operation object_keys object_values object_json attribute_changed old_value new_value
#0 create [hero] captain marvel {'hero': 'captain marvel', 'power': 'strength'... NaN NaN NaN
#1 delete [hero] black_widow {'hero': 'black_widow', 'power': 'spy', 'hamme... NaN NaN NaN
#2 modify [hero] thor {'hero': 'thor', 'power': nan, 'hammers': 2.0} hammers 1 2
#3 modify [hero] hulk {'hero': 'hulk', 'power': 'smart', 'hammers': ... power strength smart
Why pandas diff ? Cases of use
Migrating from batch to an event driven architecture
In my work, we use a lot of data pipelines to get info from external platforms, (active directory, github, jira). We load the new data replacing the entire table.
By using pandas_diff we detect how the infraestructure changes between executions, and stream those change events into a kafka cluster, so other teams could suscribe to their favourite events. Also, by defining a pandas_diff step in the master pipeline, every item in our project has ther life cycle events controlled.
Events log
For every item in a table, by using pandas_diff you will have an event log to audit of how the resources are being consumed.
Conciliation
To conciliate one datasource against the source of truth. Eg: You have a CMDB controlling with info regarding virtual machines. As there are several methods for creating those VMs, you use pandas_diff to replicate state of the infraestructure against the CMDB.
Features
Filtering of columns
Roadmap
Support for stand alone app
Documentation
History
0.7.18 (2021-12-05)
* Add codacy badge
0.7.19 (2021-12-05)
* Feat filter column
0.7.20 (2021-12-05)
* Feat filter column
0.7.21 (2021-12-05)
* Add filter fest
0.7.22 (2021-12-06)
* Add confition keys exist in df’s
1.1.0 (2021-12-06)
* Add confition keys exist in df’s 1.2.0 (2021-12-06) ——————
* Improve doc
1.2.0 (2021-12-06)
* Improve doc
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas_diff-1.3.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6154d333a08dd0b1ee3ba7f9a7c64f4c735111c406fe07a0859c42f0235e49d5 |
|
MD5 | 1d6ccefe49ef57b824cd050b1589377b |
|
BLAKE2b-256 | e936380a82bbec75adccc83c06df0a3d612335771c5f71e4c39ed31e398ea3b7 |