No project description provided

Project description

Introduction

This repository was created to provide reusable tools for validating "double-runs" in the context of the Cockpit migration project, but also for other refactoring projects (MLOps, LACI, etc.).

Usage

Install the doublerun package using pip:

In Databricks:

%pip install doublerun

If you are using an old databricks runtime and you use outdated pandas or pyspark versions, run this instead to avoid updating pandas and pyspark (at your own risk):

%pip install --no-deps doublerun

Locally:

pip install --proxy http://127.0.0.1:3128 doublerun

Pandas Usage

Import the comparison functions:

import pandas as pd
from doublerun.pandas import (
    comparison_count,
    comparison_schema,
    comparison_columns,
    comparison_records,
    global_comparison
)

df1 = pd.DataFrame({
        'Column1' : [1, 2, 3],
        'Column2' : ['A', 'B', 'C']
    })

df2 = pd.DataFrame({
    'Column1' : [1, 2, 3, 4],
    'Column3' : ['E', 'B', 'C', 'F']
})

Compare the number of rows between the two DataFrames.

df1_count, df2_count = comparison_count(df1, df2)

Output:

************************COMPARAISON DU NOMBRE DE LIGNES*************************

    Le nombre de lignes des deux DataFrames ne sont pas égaux :
        - Pour le DataFrame 1 on a 3 lignes.
        - Pour le DataFrame 2 on a 4 lignes.

Compare the schema of the two DataFrames and return the missing columns in each.

missing_cols_in_1, missing_cols_in_2 = comparison_schema(df1, df2)

Output:

*****************************COMPARAISON DU SCHEMA******************************

    Les schémas ne sont pas équivalents.
    Dans le DataFrame 1 il y a ces colonnes manquantes par rapport au DataFrame 2:
        ['Column3']
    
    Dans le DataFrame 2 il y a ces colonnes manquantes par rapport au DataFrame 1:
        ['Column2']

Compare the number of columns in the two DataFrames and return the columns in common.

common_columns = comparison_columns(df1, df2)

Output:

****************************COMPARAISON DES COLONNES****************************

    Le nombre de colonnes est identiques entre les deux DataFrames.
        Nombre de colonnes communes entre les deux DataFrames: 1
        Colonnes communes entre les deux DataFrames: 
            ['Column1']

Compare the records of the two DataFrames and return:

common : Records present in both DataFrames.
left_only, right_only : Records present only in df1 or df2.

common, left_only, right_only = comparison_records(df1, df2)

Output:

***************COMPARAISON DES DONNÉES ENTRE LES DEUX DATAFRAMES****************

    Certaines lignes ne sont pas présentes dans les deux DataFrames:
            0 lignes sont uniquement dans le DataFrame 1.
            
            1 lignes sont uniquement dans le DataFrame 2.
            
            3 lignes sont présentes dans les deux DataFrames.

Global Comparison

You can also run the global_comparison function and print all of the above at once.

common, left_only, right_only = global_comparison(df1, df2)

Spark Usage

Usage with spark DataFrames is exactly the same but functions need to be imported from doublerun.spark instead:

import pandas as pd
from doublerun.spark import (
    comparison_count,
    comparison_schema,
    comparison_columns,
    comparison_records,
    global_comparison
)

df1 = pd.DataFrame({
        'Column1' : [1, 2, 3],
        'Column2' : ['A', 'B', 'C']
    })

df2 = pd.DataFrame({
    'Column1' : [1, 2, 3, 4],
    'Column3' : ['E', 'B', 'C', 'F']
})

df1 = spark.createDataFrame(df1)
df2 = spark.createDataFrame(df2)

comparison_count(df1, df2)
comparison_schema(df1, df2)
comparison_columns(df1, df2)
common, left_diff, right_diff = comparison_records(df1, df2)
# common, left_diff, right_diff = global_comparison(df1, df2)

Contributing

In order to contribute, create your branch with a meaningful title representing a feature you would like to develop (Examples: pandas_visualisation_mismatches, pandas_high_perf_dask, spark_notebooks, etc.). Please, have a look at existing branches before creating a new one.

Then, make a pull request to the dev branch to make sure no conflicts are created when we will be merging multiple branches together.

Credits

Thanks to Bilel BOUACHA of the HyperVision Team for providing the basis for the code contained in this package. This code was slightly refactored to be used as a general comparison tool between two spark or pandas DataFrames.

Project details

Release history Release notifications | RSS feed

This version

0.0.3

Oct 31, 2023

0.0.2

Oct 27, 2023

0.0.1

Oct 27, 2023

0.0.0

Oct 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doublerun-0.0.3.tar.gz (6.3 kB view details)

Uploaded Oct 31, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doublerun-0.0.3-py3-none-any.whl (5.7 kB view details)

Uploaded Oct 31, 2023 Python 3

File details

Details for the file doublerun-0.0.3.tar.gz.

File metadata

Download URL: doublerun-0.0.3.tar.gz
Upload date: Oct 31, 2023
Size: 6.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for doublerun-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`a3e19e520f111c2227c14c3bc15e9fd5d80a50c57570df8a26d729a06d5bf935`
MD5	`202b19118e4835e9024d1414ba471573`
BLAKE2b-256	`315f0187aeefe18eb00f5c26b65cf1334f3859a56cd73e56fea7f72523e852e0`

See more details on using hashes here.

File details

Details for the file doublerun-0.0.3-py3-none-any.whl.

File metadata

Download URL: doublerun-0.0.3-py3-none-any.whl
Upload date: Oct 31, 2023
Size: 5.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for doublerun-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2095593e2ed30b241862fbee561606c50c4a8427dcb05acd2c6873765f53632f`
MD5	`8a57873e9c472ac6246eb379df3f393f`
BLAKE2b-256	`42ea908e6ae1348c41775f92320f5c06038efb3de545f5a20bda4a59df6dc274`

See more details on using hashes here.

doublerun 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Introduction

Usage

Pandas Usage

Compare the number of rows between the two DataFrames.

Compare the schema of the two DataFrames and return the missing columns in each.

Compare the number of columns in the two DataFrames and return the columns in common.

Compare the records of the two DataFrames and return:

Global Comparison

Spark Usage

Contributing

Credits

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes