Skip to main content

Python utility to extract differences between two pandas dataframes.

Project description

CodeFactor Python 3

Installation

Install pandas_diff with pip

pip install pandas_diff

Usage/Examples

import pandas_diff as pd_diff

import pandas as pd

# Create two example dataframes
df_infinity_war = pd.DataFrame([
                {"hero" : "hulk" , "power" : "strength"},
                {"hero" : "black_widow" , "power" : "spy"},
                {"hero" : "thor" , "hammers" : 0 },
                {"hero" : "thor" , "hammers" : 1 } ] )
df_endgame = pd.DataFrame([
                {"hero" : "hulk" , "power" : "smart"},
                {"hero" : "captain marvel" , "power" : "strength"},
                {"hero" : "thor" , "hammers" : 2 } ] )

# Get differences, using the key "hero"
df = pd_diff.get_diffs(df_infinity_war ,df_endgame ,"hero")

df

#operation object_keys  object_values                     object_json                     attribute_changed old_value new_value
#0   create     [hero]    captain marvel  {'hero': 'captain marvel', 'power': 'strength'...           NaN           NaN      NaN
#1   delete     [hero]       black_widow  {'hero': 'black_widow', 'power': 'spy', 'hamme...           NaN           NaN      NaN
#2   modify     [hero]              thor     {'hero': 'thor', 'power': nan, 'hammers': 2.0}       hammers             1        2
#3   modify     [hero]              hulk  {'hero': 'hulk', 'power': 'smart', 'hammers': ...         power      strength    smart

Why pandas diff ? Cases of use

Migrating from batch to an event driven architecture

In my work, we use a lot of data pipelines to get info from external platforms, (active directory, github, jira). We load the new data replacing the entire table.

By using pandas_diff we detect how the infraestructure changes between executions, and stream those change events into a kafka cluster, so other teams could suscribe to their favourite events. Also, by defining a pandas_diff step in the master pipeline, every item in our project has ther life cycle events controlled.

Events log

For every item in a table, by using pandas_diff you will have an event log to audit of how the resources are being consumed.

Conciliation

To conciliate one datasource against the source of truth. Eg: You have a CMDB controlling with info regarding virtual machines. As there are several methods for creating those VMs, you use pandas_diff to replicate state of the infraestructure against the CMDB.

Features

  • Filtering of columns

Roadmap

  • Support for stand alone app

Documentation

Documentation

History

0.7.18 (2021-12-05)

* Add codacy badge

0.7.19 (2021-12-05)

* Feat filter column

0.7.20 (2021-12-05)

* Feat filter column

0.7.21 (2021-12-05)

* Add filter fest

0.7.22 (2021-12-06)

* Add confition keys exist in df’s

1.1.0 (2021-12-06)

* Add confition keys exist in df’s 1.2.0 (2021-12-06) ——————

* Improve doc

1.2.0 (2021-12-06)

* Improve doc

1.3.0 (2021-12-06)

* Remove workflows

1.4.0 (2021-12-06)

* Remove workflows

1.4.0 (2023-09-01)

* Improve doc

1.4.1 (2023-09-01)

* Improve doc

1.4.2 (2023-09-17)

* Bugfix version string

1.4.3 (2023-09-17)

* bugfix version tag

1.4.4 (2023-09-17)

* bugfix version tag

1.4.5 (2023-09-17)

* bugfixx history string

1.4.6 (2023-09-17)

* bugfix history string

1.4.7 (2023-09-17)

* bugfix release description

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_diff-1.4.7.tar.gz (12.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page