Python utility to extract differences between two pandas dataframes.
Project description
Installation
Install pandas_diff with pip
pip install pandas_diff
Usage/Examples
import pandas_diff as pd_diff
import pandas as pd
# Create two example dataframes
df_infinity_war = pd.DataFrame([
{"hero" : "hulk" , "power" : "strength"},
{"hero" : "black_widow" , "power" : "spy"},
{"hero" : "thor" , "hammers" : 0 },
{"hero" : "thor" , "hammers" : 1 } ] )
df_endgame = pd.DataFrame([
{"hero" : "hulk" , "power" : "smart"},
{"hero" : "captain marvel" , "power" : "strength"},
{"hero" : "thor" , "hammers" : 2 } ] )
# Get differences, using the key "hero"
df = pd_diff.get_diffs(df_infinity_war ,df_endgame ,"hero")
df
#operation object_keys object_values object_json attribute_changed old_value new_value
#0 create [hero] captain marvel {'hero': 'captain marvel', 'power': 'strength'... NaN NaN NaN
#1 delete [hero] black_widow {'hero': 'black_widow', 'power': 'spy', 'hamme... NaN NaN NaN
#2 modify [hero] thor {'hero': 'thor', 'power': nan, 'hammers': 2.0} hammers 1 2
#3 modify [hero] hulk {'hero': 'hulk', 'power': 'smart', 'hammers': ... power strength smart
Why pandas diff ? Cases of use
Migrating from batch to an event driven architecture
In my work, we use a lot of data pipelines to get info from external platforms, (active directory, github, jira). We load the new data replacing the entire table.
By using pandas_diff we detect how the infraestructure changes between executions, and stream those change events into a kafka cluster, so other teams could suscribe to their favourite events. Also, by defining a pandas_diff step in the master pipeline, every item in our project has ther life cycle events controlled.
Events log
For every item in a table, by using pandas_diff you will have an event log to audit of how the resources are being consumed.
Conciliation
To conciliate one datasource against the source of truth. Eg: You have a CMDB controlling with info regarding virtual machines. As there are several methods for creating those VMs, you use pandas_diff to replicate state of the infraestructure against the CMDB.
Features
Filtering of columns
Roadmap
Support for stand alone app
Documentation
History
0.7.18 (2021-12-05)
* Add codacy badge
0.7.19 (2021-12-05)
* Feat filter column
0.7.20 (2021-12-05)
* Feat filter column
0.7.21 (2021-12-05)
* Add filter fest
0.7.22 (2021-12-06)
* Add confition keys exist in df’s
1.1.0 (2021-12-06)
* Add confition keys exist in df’s 1.2.0 (2021-12-06) ——————
* Improve doc
1.2.0 (2021-12-06)
* Improve doc
1.3.0 (2021-12-06)
* Remove workflows
1.4.0 (2021-12-06)
* Remove workflows
1.4.0 (2023-09-01)
* Improve doc
1.4.1 (2023-09-01)
* Improve doc
1.4.2 (2023-09-17)
* Bugfix version string
1.4.3 (2023-09-17)
* bugfix version tag
1.4.4 (2023-09-17)
* bugfix version tag
1.4.5 (2023-09-17)
* bugfixx history string
1.4.6 (2023-09-17)
* bugfix history string
1.4.7 (2023-09-17)
* bugfix release description
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pandas_diff-1.4.7.tar.gz
.
File metadata
- Download URL: pandas_diff-1.4.7.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.14.0 pkginfo/1.8.1 requests/2.26.0 setuptools/58.0.4 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe5e4567ec3402eb77096a04cd7f2488950722fcdc488ca14bb71364f07fbdb1 |
|
MD5 | c2e3c979e39731f2c4836e5e41de91dd |
|
BLAKE2b-256 | b719115c112b5d1f21900a0409e08db618e1d156c0d5ecb8c55c4f8d6bab7c8a |