Skip to main content

No project description provided

Project description

Introduction

This repository was created to provide reusable tools for validating "double-runs" in the context of the Cockpit migration project, but also for other refactoring projects (MLOps, LACI, etc.).

Usage

Install the doublerun package using pip:

In Databricks:

%pip install doublerun

If you are using an old databricks runtime and you use outdated pandas or pyspark versions, run this instead to avoid updating pandas and pyspark (at your own risk):

%pip install --no-deps doublerun

Locally:

pip install --proxy http://127.0.0.1:3128 doublerun

Pandas Usage

Import the comparison functions:

import pandas as pd
from doublerun.pandas import (
    comparison_count,
    comparison_schema,
    comparison_columns,
    comparison_records,
    global_comparison
)

df1 = pd.DataFrame({
        'Column1' : [1, 2, 3],
        'Column2' : ['A', 'B', 'C']
    })

df2 = pd.DataFrame({
    'Column1' : [1, 2, 3, 4],
    'Column3' : ['E', 'B', 'C', 'F']
})

Compare the number of rows between the two DataFrames.

df1_count, df2_count = comparison_count(df1, df2)

Output:

************************COMPARAISON DU NOMBRE DE LIGNES*************************

    Le nombre de lignes des deux DataFrames ne sont pas égaux :
        - Pour le DataFrame 1 on a 3 lignes.
        - Pour le DataFrame 2 on a 4 lignes.

Compare the schema of the two DataFrames and return the missing columns in each.

missing_cols_in_1, missing_cols_in_2 = comparison_schema(df1, df2)

Output:

*****************************COMPARAISON DU SCHEMA******************************

    Les schémas ne sont pas équivalents.
    Dans le DataFrame 1 il y a ces colonnes manquantes par rapport au DataFrame 2:
        ['Column3']
    
    Dans le DataFrame 2 il y a ces colonnes manquantes par rapport au DataFrame 1:
        ['Column2']

Compare the number of columns in the two DataFrames and return the columns in common.

common_columns = comparison_columns(df1, df2)

Output:

****************************COMPARAISON DES COLONNES****************************

    Le nombre de colonnes est identiques entre les deux DataFrames.
        Nombre de colonnes communes entre les deux DataFrames: 1
        Colonnes communes entre les deux DataFrames: 
            ['Column1']

Compare the records of the two DataFrames and return:

  • common : Records present in both DataFrames.
  • left_only, right_only : Records present only in df1 or df2.
common, left_only, right_only = comparison_records(df1, df2)

Output:

***************COMPARAISON DES DONNÉES ENTRE LES DEUX DATAFRAMES****************

    Certaines lignes ne sont pas présentes dans les deux DataFrames:
            0 lignes sont uniquement dans le DataFrame 1.
            
            1 lignes sont uniquement dans le DataFrame 2.
            
            3 lignes sont présentes dans les deux DataFrames.

Global Comparison

You can also run the global_comparison function and print all of the above at once.

common, left_only, right_only = global_comparison(df1, df2)

Spark Usage

Usage with spark DataFrames is exactly the same but functions need to be imported from doublerun.spark instead:

import pandas as pd
from doublerun.spark import (
    comparison_count,
    comparison_schema,
    comparison_columns,
    comparison_records,
    global_comparison
)

df1 = pd.DataFrame({
        'Column1' : [1, 2, 3],
        'Column2' : ['A', 'B', 'C']
    })

df2 = pd.DataFrame({
    'Column1' : [1, 2, 3, 4],
    'Column3' : ['E', 'B', 'C', 'F']
})

df1 = spark.createDataFrame(df1)
df2 = spark.createDataFrame(df2)

comparison_count(df1, df2)
comparison_schema(df1, df2)
comparison_columns(df1, df2)
common, left_diff, right_diff = comparison_records(df1, df2)
# common, left_diff, right_diff = global_comparison(df1, df2)

Contributing

In order to contribute, create your branch with a meaningful title representing a feature you would like to develop (Examples: pandas_visualisation_mismatches, pandas_high_perf_dask, spark_notebooks, etc.). Please, have a look at existing branches before creating a new one.

Then, make a pull request to the dev branch to make sure no conflicts are created when we will be merging multiple branches together.

Credits

Thanks to Bilel BOUACHA of the HyperVision Team for providing the basis for the code contained in this package. This code was slightly refactored to be used as a general comparison tool between two spark or pandas DataFrames.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doublerun-0.0.3.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doublerun-0.0.3-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file doublerun-0.0.3.tar.gz.

File metadata

  • Download URL: doublerun-0.0.3.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for doublerun-0.0.3.tar.gz
Algorithm Hash digest
SHA256 a3e19e520f111c2227c14c3bc15e9fd5d80a50c57570df8a26d729a06d5bf935
MD5 202b19118e4835e9024d1414ba471573
BLAKE2b-256 315f0187aeefe18eb00f5c26b65cf1334f3859a56cd73e56fea7f72523e852e0

See more details on using hashes here.

File details

Details for the file doublerun-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: doublerun-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for doublerun-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2095593e2ed30b241862fbee561606c50c4a8427dcb05acd2c6873765f53632f
MD5 8a57873e9c472ac6246eb379df3f393f
BLAKE2b-256 42ea908e6ae1348c41775f92320f5c06038efb3de545f5a20bda4a59df6dc274

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page