Skip to main content

A Python Tool for Structured Data Quality Profiling

Project description

DQMaRC: A Python Tool for Structured Data Quality Profiling

Version: 1.0.2 Author: Anthony Lighterness and Michael Adcock
License: MIT License and Open Government License v3

Project Status: Active – The project has reached a stable, usable state and is being actively developed.


Overview

DQMaRC (Data Quality Markup and Ready-to-Connect) is a Python tool designed to facilitate comprehensive data quality profiling of structured tabular data. It allows data analysts, engineers, and scientists to systematically assess and manage the quality of their datasets across multiple dimensions including completeness, validity, uniqueness, timeliness, consistency, and accuracy.

DQMaRC can be used both programmatically within Python scripts and interactively through a Shiny web application front-end user interface, making it versatile for different use cases ranging from ad-hoc analysis to integration within larger data pipelines.

Key Features

  • Multi-dimensional Data Quality Checks: Evaluate datasets across key dimensions including Completeness, Validity, Uniqueness, Timeliness, Consistency, and Accuracy.
  • Customisable Test Parameters: Configure data quality test parameters easily via python or a user friendly spreadsheet to tailor your data quality assessment to your dataset.
  • Interactive Shiny App: Setup, run, explore and visualise data quality issues interactively through a Shiny app for Python.
  • Integration with Data Pipelines: Easily integrate DQMaRC into your data processing pipelines for scheduled data quality checks.
  • Detailed Reporting: Generate comprehensive reports detailing data quality issues at both the cell and aggregate levels.

Installation

Using Pip or Conda

You can install DQMaRC using pip or conda. Ensure you have a virtual environment activated.

pip DQMaRC
conda install DQMaRC

Dependencies

The package dependencies are listed in the requirements.txt file and will be installed automatically during the installation of DQMaRC.

Getting Started

1. Import Libraries

Start by importing the necessary libraries and DQMaRC modules in your Python environment.

import pandas as pd
from DQMaRC import DataQuality

2. Load Your Data

Load the dataset you wish to profile.

# Load your data
df = pd.read_csv('path_to_your_data.csv')

3. Initialise DQMaRC and Set Test Parameters

Initialise the DQ tool and set the test parameters. You can generate a template or import predefined parameters.

# Initialise the Data Quality object
dq = DataQuality(df)

# Generate test parameters template
test_params = dq.get_param_template()

# (Optional) Load pre-configured test parameters
# test_params = pd.read_csv('path_to_test_parameters.csv')

# Set the test parameters
dq.set_test_params(test_params)

4. Run Data Quality Checks

Run the data quality checks across all dimensions.

dq.run_all_metrics()

5. Retrieve and Save Results

Retrieve the full results and join them with your original dataset for detailed analysis.

# Get the full results
full_results = dq.raw_results()

# Join results with the original dataset
df_with_results = df.join(full_results, how="left")

# Save results to a CSV file
df_with_results.to_csv('path_to_save_results.csv', index=False)

Using the Shiny App

In addition to programmatic usage, DQMaRC includes an interactive Shiny web app for Python that allows users to explore and visualise data quality issues.

You can test the DQMaRC ShinyLive Demo by copying and pasting the URL located HERE into your webbrowser. This link will take you to a ShinyLive Editor where you can test the DQMaRC functionality. If you encounter an error, try refreshing the webpage once or twice. If you still encounter an error after this, please feel free to get in touch by contacting us or raising an issue on our repository.

PLEASE NOTE The ShinyLive UI is recommended only for testing and getting used to the DQMaRC too functionality. This interface is deployed on your machine, meaning it is only as secure as your machine is. It will store data you upload in its local memory before being wiped when you exit the app.

Running the Shiny App

To run the Shiny app, use the following command in your terminal:

shiny run --reload --launch-browser path_to_your_app/app.py

Deploying the Shiny App

For deploying the Shiny app on a server, follow the official Shiny for Python deployment guide.

Documentation

Comprehensive documentation for DQMaRC, including detailed API references and user guides, is available HERE or in the project docs/ directory.

Repo Structure

Top-level Structure


DQMaRC	
│   requirements.txt 			# package dependencies
│   setup.py	 			# setup configuration for the python package distribution
│       
├───docs	 			# user docs material
│   │...   
│           
├───DQMaRC  				# source code
│   │   Accuracy.py
│   │   app.py
│   │   Completeness.py
│   │   Consistency.py
│   │   DataQuality.py
│   │   Dimension.py
│   │   Timeliness.py
│   │   Uniqueness.py
│   │   UtilitiesDQMaRC.py
│   │   Validity.py
│   │   __init__.py
│   │   
│   ├───data	 			# data used in the tutorial(s)
│   │   │   DQ_df_full.csv
│   │   │   test_params_definitions.csv
│   │   │   toydf_subset.csv
│   │   │   toydf_subset_test_params_24.05.16.csv
│   │   │   
│   │   └───lookups	 		# data standards and or value lists for data validity checks
│   │           LU_toydf_gender.csv
│   │           LU_toydf_ICD10_v5.csv
│   │           LU_toydf_M_stage.csv
│   │           LU_toydf_tumour_stage.csv
│   │           
│   ├───notebooks	
│   │      Backend_Tutorial.ipynb   	# Tutorial for python users
│...

Contributing

Contributions to DQMaRC are welcome! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project.

License

DQMaRC is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

This project was developed by Anthony Lighterness and Michael Adcock. Special thanks to all contributors and testers who helped in the development of this tool.

Citation

Please use the following citation if you use DQMaRC:

Lighterness, A., Adcock, M.A., and Price, G. (2024). DQMaRC: A Python Tool for Structured Data Quality Profiling (Version 1.0.0) [Software]. Available from https://github.com/christie-nhs-data-science/DQMaRC.

Notice on Maintenance and Support

Please Note: This library is an open-source project maintained by a small team of contributors. While we strive to keep the package updated and well-maintained, ongoing support and development may vary depending on resource availability.

We strongly encourage users to engage with the project by reporting any issues, errors, or suggestions for improvements. Your feedback is invaluable in helping us identify and prioritise areas for improvement. Please feel free to submit questions, bug reports, or feature requests via our GitHub issues page or by reaching out.

Thank you for your understanding and for contributing to the growth and improvement of this project!


For more information, please visit the project repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqmarc-1.0.3.tar.gz (328.6 kB view details)

Uploaded Source

Built Distribution

DQMaRC-1.0.3-py3-none-any.whl (337.5 kB view details)

Uploaded Python 3

File details

Details for the file dqmarc-1.0.3.tar.gz.

File metadata

  • Download URL: dqmarc-1.0.3.tar.gz
  • Upload date:
  • Size: 328.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dqmarc-1.0.3.tar.gz
Algorithm Hash digest
SHA256 fa5c6fc7fe69152fcacb2f7dcb8557ea90a2a5e4728163d57896fe4e87e235fa
MD5 8ef5f424365dee0c9f90224b5335d012
BLAKE2b-256 1caf76a36b06c71215005aab3c81537cd06fa3f00a3edd65b77f441b28c54756

See more details on using hashes here.

File details

Details for the file DQMaRC-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: DQMaRC-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 337.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for DQMaRC-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 57f9e5a53486bb07a47fca57df462bdcd52a219d8d0b54f291b60c6cf75a6165
MD5 566cad88c4bd28a7af97d642fc94516a
BLAKE2b-256 35dd7a5fa11eceecfbc21dab543066eb15d1ee3397fbc5f75256f0f513c44cf5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page