Data Quality Framework Governance is a structured approach to assessing, monitoring, and improving the quality of data.
Project description
Data Quality Framework Governance (DQFG)
Data Quality Framework Governance is a structured approach to assessing, monitoring, and improving the quality of data. An effective Data Quality Framework considers these dimensions and integrates them into a structured approach to ensure that data serves its intended purpose, supports informed decision-making, and maintains the trust of users and stakeholders.
Data Quality is an ongoing process that requires continuous monitoring, assessment, and improvement to adapt to changing data requirements and evolving business needs.
Example: To call functions from the library.
from DataQualityFrameworkGovernance import Uniqueness as uq
print(uq.duplicate_rows(dataframe))
Library structure
Accuracy
accuracy_tolerance_numeric
Calculating data quality accuracy of a set of values (base values) by comparing them to a known correct value (lookup value) by setting a user-defined tolerance percentage, applicable for numeric values.
from DataQualityFrameworkGovernance import Accuracy as ac
print(ac.accuracy_tolerance_numeric(dataframe, 'base_column', 'lookup_column', tolerance_percentage))
email_pattern
Validating accuracy of email addresses in a dataset by verifying that they follow a valid email format.
from DataQualityFrameworkGovernance import Accuracy as ac
print(ac.email_pattern(dataframe,'email_column_name'))
filter_number_range
Number range ensures that data values are accurate and conform to expected values or constraints. It is applicable to a variety of contexts, including exam scores, weather conditions, pricing, stock prices, age, income, speed limits for vehicles, water levels, and numerous other scenarios.
from DataQualityFrameworkGovernance import Accuracy as ac
print(ac.filter_number_range(dataframe, 'range_column_name', lower_bound, upper_bound))
Example:
print(ac.filter_number_range(df, 'Age', 4, 12))
(Output will extract the age between 4(lower bound) and 12 (upper bound) from column 'Age' in the dataset 'df')
filter_datetime_range
The datetime range filter guarantees the accuracy and adherence of data values to predetermined criteria or constraints. It is applicable to a variety of contexts, including capturing outliers in date of birth, age and many more.
from DataQualityFrameworkGovernance import Accuracy as ac
print(ac.filter_datetime_range(Dataframe, 'range_column_name', 'from_date', 'to_date', 'date_format'))
Example:
print(ac.filter_datetime_range(df, 'Date', '2023-01-15', '2023-03-01', '%Y-%m-%d'))
Important: Specify date format in '%Y-%m-%d %H:%M:%S.%f' (It can be specified in any format, parameter value to be aligned appropriately).
Completeness
missing_values
Summary of missing values in each column.
from DataQualityFrameworkGovernance import Completeness as cp
print(cp.missing_values(dataframe))
overall_completeness_percentage
Percentage of missing values in a DataFrame.
from DataQualityFrameworkGovernance import Completeness as cp
print(cp.overall_completeness_percentage(dataframe))
Consistency
start_end_date_consistency
If data in two columns is consistent, check if the "Start Date" and "End Date" column are in the correct chronological order.
from DataQualityFrameworkGovernance import Consistency as ct
print(ct.start_end_date_consistency(dataframe, 'start_date_column_name', 'end_date_column_name', 'date_format'))
Important: Specify date format in '%Y-%m-%d %H:%M:%S.%f' (It can be specified in any format, parameter value to be aligned appropriately).
count_start_end_date_consistency
Count of data in two columns is consistent, check if the "Start Date" and "End Date" column are in the correct chronological order.
from DataQualityFrameworkGovernance import Consistency as ct
print(ct.count_start_end_date_consistency(dataframe, 'start_date_column_name', 'end_date_column_name', 'date_format'))
Important: Specify date format in '%Y-%m-%d %H:%M:%S.%f' (It can be specified in any format, parameter value to be aligned appropriately).
Uniqueness
duplicate_rows
Identify and display duplicate rows in a dataset.
from DataQualityFrameworkGovernance import Uniqueness as uq
print(uq.duplicate_rows(dataframe))
unique_column_values
Display unique column values in a dataset.
from DataQualityFrameworkGovernance import Uniqueness as uq
print(uq.unique_column_values(dataframe, 'column_name'))
unique_column_count
Count unique column values in a dataset.
from DataQualityFrameworkGovernance import Uniqueness as uq
print(uq.unique_column_count(dataframe, 'column_name'))
Validity
validate_age
Validate age based on the criteria in a dataset.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.validate_age(dataframe, 'age_column', min_age, max_age))
validate_age_count
Count age based on the criteria in a dataset.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.validate_age_count(dataframe, 'age_column', min_age, max_age))
is_within_range
If all values in a given array list are present in a specific column of a dataset then it provides a status message indicating whether all names are found or not. Array values must be within square brackets.
#Examples
#array list = ["Tom", "Jerry", "Donald"] - Text
#array list = [10, 20, 30] - Numeric
#array list = [True, False] - Boolean
#array list = [0, 1] - Flag
from DataQualityFrameworkGovernance import Validity as vl
print(vl.is_within_range(dataframe, 'column_name_to_look', [array_list]))
is_number_in_column
Examines each value in a column and appends a new column to the existing column, indicating whether the values are numeric.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.is_number_in_column(dataframe, 'column_name'))
is_number_in_dataset
Examines each value in a dataset and appends a new column for each existing column, indicating whether the values are numeric.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.is_number_in_dataset(dataframe))
#Example for specific column selection
is_number_in_dataset(dataframe[['column1','column7']])
is_text_in_column
Examines each value in a column and appends a new column to the existing column, indicating whether the values are text. Result would be false, if text or string contains number.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.is_text_in_column(dataframe, 'column_name'))
is_text_in_dataset
Examines each value in a dataset and appends a new column for each existing column, indicating whether the values are text. Result would be false, if text or string contains number.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.is_text_in_dataset(dataframe))
#Example for specific column selection
is_text_in_dataset(dataframe[['column1','column7']])
is_date_in_column
Examines each value in a column and appends a new column to the existing column, indicating whether the values are in date time, in a speciifed format.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.is_date_in_column(dataframe,'column_name', date_format))
Important: Specify date format in '%Y-%m-%d %H:%M:%S.%f' (It can be specified in any format, parameter value to be aligned appropriately).
is_date_in_dataset
Examines each value in a dataset and appends a new column for each existing column, indicating whether the values are in date time, in a speciifed format.
from DataQualityFrameworkGovernance import Validity as vl
print(vl.is_date_in_dataset(dataframe, date_format))
#Example for specific column selection
is_date_in_dataset(dataframe[['column1','column7']], date_format='%Y-%m-%d')
Important: Specify date format in '%Y-%m-%d %H:%M:%S.%f' (It can be specified in any format, parameter value to be aligned appropriately).
Datastats
count_rows
Count the number of rows in a DataFrame.
from DataQualityFrameworkGovernance import Datastats as ds
print(ds.count_rows(dataframe))
count_columns
Count the number of columns in a DataFrame.
from DataQualityFrameworkGovernance import Datastats as ds
print(ds.count_columns(dataframe))
count_dataset
Count the number of rows & columns in a DataFrame.
from DataQualityFrameworkGovernance import Datastats as ds
print(ds.count_dataset(dataframe))
limit_max_length
Limits the maximum length of a string to specific length. Example, when applied to the input string 'ABCDEFGH', the function returns 'ABCDE', effectively truncating the original string to the first 5 characters.
from DataQualityFrameworkGovernance import Datastats as ds
print(ds.limit_max_length(dataframe, column_name, start_length, length))
#Example: 'ABCDEFGH' input string returns 'ABCDE'
print(limit_max_length(df,'column_name',0,5))
Data Interoperability
data_migration_reconciliation
Data migration reconciliation is a crucial step in ensuring the accuracy and integrity of data transfer between a source and target system. The process involves comparison of the source and target data to identify any disparities. If the columns in both datasets differ, the process returns a message to align the source and target dataset. Once the structural alignment is confirmed, a comprehensive check is performed by comparing the content of each column.
Any inconsistencies between the source and target data are flagged as mismatches. This includes the identification of specific 'column name(s)' where discrepancies occur, 'row number or position' and 'mismatched records' in both the source and target datasets. This comprehensive reporting ensures that discrepancies can be easily located and addressed, promoting data accuracy and the successful completion of the migration process.
from DataQualityFrameworkGovernance import Interoperability as io
print(io.data_migration_reconciliation(source_dataframe, target_dataframe))
#Example of saving source and target dataframe from csv file
import pandas as pd
source_dataframe = pd.read_csv('source_data.csv')
target_dataframe = pd.read_csv('target_data.csv')
Result
Column | Row no. / Position | Source Data | Target Data |
---|---|---|---|
Column name | 2 | 33 | 3 |
Column name | 289 | Donald Trump | Donald Duck |
Supporting python libraries:
- Pandas
- re
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file DataQualityFrameworkGovernance-0.0.10.9.tar.gz
.
File metadata
- Download URL: DataQualityFrameworkGovernance-0.0.10.9.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24180a6053d5a3d3fddbc402a2d16c47a3f4be7bf4ad5a057517f8c073922bf8 |
|
MD5 | 11cffe28268088a4dcb04c52e3d79559 |
|
BLAKE2b-256 | e1e503c2767b2ca4428e67a85fcf8c5c947b5b328868ee876b5a179e6184b887 |
File details
Details for the file DataQualityFrameworkGovernance-0.0.10.9-py3-none-any.whl
.
File metadata
- Download URL: DataQualityFrameworkGovernance-0.0.10.9-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bba15c9c0ea0d6cc48cec3da44816b234570a9d5c1476e8a78632b1fcd34d48 |
|
MD5 | 770530afec806b7fe20c28e28d00abf2 |
|
BLAKE2b-256 | 3e022bea0f51c99ad73fba36aaae2141d8a87bf9d0748532a796d58a99141c54 |