Skip to main content

Python utility to extract differences between two pandas dataframes.

Project description

CodeFactor Python 3

Installation

Install pandas_diff with pip

pip install pandas_diff

Usage/Examples

import pandas_diff as pd_diff

import pandas as pd

# Create two example dataframes
df_infinity_war = pd.DataFrame([
                {"hero" : "hulk" , "power" : "strength"},
                {"hero" : "black_widow" , "power" : "spy"},
                {"hero" : "thor" , "hammers" : 0 },
                {"hero" : "thor" , "hammers" : 1 } ] )
df_endgame = pd.DataFrame([
                {"hero" : "hulk" , "power" : "smart"},
                {"hero" : "captain marvel" , "power" : "strength"},
                {"hero" : "thor" , "hammers" : 2 } ] )

# Get differences, using the key "hero"
df = pd_diff.get_diffs(df_infinity_war ,df_endgame ,"hero")

df

#operation object_keys  object_values                     object_json                     attribute_changed old_value new_value
#0   create     [hero]    captain marvel  {'hero': 'captain marvel', 'power': 'strength'...           NaN           NaN      NaN
#1   delete     [hero]       black_widow  {'hero': 'black_widow', 'power': 'spy', 'hamme...           NaN           NaN      NaN
#2   modify     [hero]              thor     {'hero': 'thor', 'power': nan, 'hammers': 2.0}       hammers             1        2
#3   modify     [hero]              hulk  {'hero': 'hulk', 'power': 'smart', 'hammers': ...         power      strength    smart

Why pandas diff ? Cases of use

Migrating from batch to an event driven architecture

In my work, we use a lot of data pipelines to get info from external platforms, (active directory, github, jira). We load the new data replacing the entire table.

By using pandas_diff we detect how the infraestructure changes between executions, and stream those change events into a kafka cluster, so other teams could suscribe to their favourite events. Also, by defining a pandas_diff step in the master pipeline, every item in our project has ther life cycle events controlled.

Events log

For every item in a table, by using pandas_diff you will have an event log to audit of how the resources are being consumed.

Conciliation

To conciliate one datasource against the source of truth. Eg: You have a CMDB controlling with info regarding virtual machines. As there are several methods for creating those VMs, you use pandas_diff to replicate state of the infraestructure against the CMDB.

Features

  • Filtering of columns

Roadmap

  • Support for stand alone app

Documentation

Documentation

History

0.7.18 (2021-12-05)

* Add codacy badge

0.7.19 (2021-12-05)

* Feat filter column

0.7.20 (2021-12-05)

* Feat filter column

0.7.21 (2021-12-05)

* Add filter fest

0.7.22 (2021-12-06)

* Add confition keys exist in df’s

1.1.0 (2021-12-06)

* Add confition keys exist in df’s 1.2.0 (2021-12-06) ——————

* Improve doc

1.2.0 (2021-12-06)

* Improve doc

1.3.0 (2021-12-06)

* Remove workflows

1.4.0 (2021-12-06)

* Remove workflows

1.4.0 (2023-09-01)

* Improve doc

1.4.1 (2023-09-01)

* Improve doc

1.4.2 (2023-09-17)

* Bugfix version string

1.4.3 (2023-09-17)

* bugfix version tag

1.4.4 (2023-09-17)

* bugfix version tag

1.4.5 (2023-09-17)

* bugfixx history string

1.4.6 (2023-09-17)

* bugfix history string

1.4.7 (2023-09-17)

* bugfix release description

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_diff-1.4.7.tar.gz (12.8 kB view details)

Uploaded Source

File details

Details for the file pandas_diff-1.4.7.tar.gz.

File metadata

  • Download URL: pandas_diff-1.4.7.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.8.1 requests/2.26.0 setuptools/58.0.4 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.15

File hashes

Hashes for pandas_diff-1.4.7.tar.gz
Algorithm Hash digest
SHA256 fe5e4567ec3402eb77096a04cd7f2488950722fcdc488ca14bb71364f07fbdb1
MD5 c2e3c979e39731f2c4836e5e41de91dd
BLAKE2b-256 b719115c112b5d1f21900a0409e08db618e1d156c0d5ecb8c55c4f8d6bab7c8a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page