Skip to main content

DATAFRAMES COMPARISON

Project description

What is this about?

Imagine you want to compare various iterations of a table or data file. You may need to ensure that your versions of a list that is maintained by more than one person or system are same. When you have to solve this manually, it is difficult. So, let's make use of this library right now!

How does it work?

The script takes an entry's ID value from the first table, looks for it in the second table and sees if it can find it there. If it doesn't, it will list this entry as an entry that was found in table 1, but not in table 2. This will happen vice versa for table 2 against table 1. Moreover, the script compares values of entries with the same ID, and if they are different, the script lists these differences.

What does it require?

The table entries MUST have a unique identifier, or key value. For example, this could be an order number, or an employee number - but it could also be other characters, as long as it is unique to an entry. This key value MUST be located in the first column of your table. The orders of the other columns, as well as the orders of the rows in general, is completely irrelevant.You have a use case, too, we hope this library will save you some time!

Version Release Beta

0.0.1 (29/06/2022)

pip install pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pip install numpy

NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices.

import pandas as pd
import numpy as np

'''// 
DESCRIPTION:
The comparison requires comparing table1 against table2 and vice versa in order to find values that are in one table, 
but not the other,therefore, the files are imported based on current comparison mode, and in mode 2to1, their order 
is switched to avoid differences being identified based on different data formats: convert numbers to equal format/
"syntax", then to string 
///'''

def toString(x):
    # print('from ', x, type(x))
    if type(x) == str:
        return x
    if type(x) == np.float64 or type(x) == np.int64:
        x.tolist()
    if x == None:
        x = 'NaN'
    x = str(float(x))
    if x[-1] == '0':
        x = x[0:-2]  # removing .0 from the float
# print('to ', x, type(x))
return str(x)


'''///----Comparison funct----///''' 

def runComparison(findDifferingValues, fileName1, fileName2, df1, df2):
    df_comparison = pd.DataFrame()  # tables have to be initalized because they are later referred to for concatenation
    df_entrynotFound = pd.DataFrame()

# in both tables, the first column must be the one that contains the entries' ID/key
keyCol = df2.iloc[:, 0]  # 1st column in the table we're comparing to
keyColArray = list(map(toString, keyCol.values))  # converting entries to string for comparison
# print('to str: ', test, pd.Series(keyColArray))
keyCol = pd.Series(keyColArray)

'''//---compare rows funct---//'''

def compareRows(rowA):

    key = toString(rowA[0])  # saving the key value from 1st table so I can get corresponding value from 2nd table
    deviations = pd.DataFrame()  # here's where I'll store the info on deviating entries

    # now going through every col entry in rowA
    for category, value in rowA.items():

        if category not in df2.columns:  # if category(column) does not exist in table 2, ignore & break
            break

        valueBList = pd.Series(df2[category].where(
            keyCol == key).values)  # identifiying value for same ID/col in table2 > array with comparison resut for each row
        valueB = list(filter(lambda x: pd.isna(x) == False,
                             valueBList))  # from results array, filtering out only those values that are not 'NaN'

        if (pd.isna(valueBList.values)).all():  # if the result contains only NaN values > cell is empty:
            if (pd.isna(value) or value == float(
                    'nan')):  # see if corresponding value in rowA is NaN, too, in which case don't list as difference
                break
            else:  # if corresponding value in rowA is not NaN, redefine valueB so it's not an empty array
                valueB = ['#empty!']

        value = toString(value)
        valueB[0] = toString(valueB[0])

        # print(value, valueB[0])
        if value != valueB[
            0]:  # now actually comparing value of key/colName in table1 vs same key/colName in table2

            # print('deviating value: ', valueB[0])
            # print(f'Difference found in entry with ID {key}, in category: {category}') #optional: log the info
            newRow = [[key, category, value, str(valueB[0])]]
            # print(newRow) #activate when debugging
            add = pd.DataFrame(newRow, columns=[('Entry ID in ' + fileName1), 'Difference found in',
                                                ('Value in ' + fileName1), ('Value in ' + fileName2)])
            deviations = pd.concat([deviations, add])

    return deviations

 #--------------------------------end of compare rows funct----------------------------------//'''

for currentRow, values in df1.iterrows():  # going through each row in table 1

    rowDf1 = pd.Series(df1.iloc[currentRow])  # grab values from current row of table 1, save as Series
    key = toString(rowDf1.values[0])  # set entry in 1st column as key to look for in table 2

    '''//now, look for this key in table 2. If it is not found, add to "not found" sheet for final results file//'''
    if df2.loc[keyCol == key].empty:
        newRow = [[key, rowDf1.index[1], rowDf1.values[1]]]
        result = pd.DataFrame(newRow, columns=[('Key not found in ' + fileName2), ('name next col ' + fileName1),
                                               ('value next col ' + fileName1)])
        df_entrynotFound = pd.concat([df_entrynotFound, result])  # add result to compare df
        # print(f'ID no. {key} not found in table 2') #optional: log the info

        '''//if it is found: identify differences and store them in table for for final results file, except for 
            comparison mode '2to1' where differences have already been found + listed and we are only interested 
            in entries that are in table2 but not table1 //'''
    
    if df2.loc[keyCol == key].empty == False and findDifferingValues == True:
        result = compareRows(
            rowDf1)    # pass rowDf1 to compare funct to compare it to corresponding row in table2
        df_comparison = pd.concat([df_comparison, result])
        # print(result)

return [df_comparison, df_entrynotFound]


'''//-----End of comparison funct-------//'''

'''//save results as CSV files, as well as different sheets in same Excel file. 
    Use excel format and writer for that. Index not included in files//'''

def saveToFile(data, name, writer):
    data.to_excel(writer, sheet_name=str(name), index=False)
    #data.to_csv(str(name), index=False)

Usage

Install the published library:

    pip install comp-df-syren==0.2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

comp_df_syren-0.2.1-py3-none-any.whl (7.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page