A Python Tool for Structured Data Quality Profiling
Project description
DQMaRC: A Python Tool for Structured Data Quality Profiling
Version: 1.0.2
Author: Anthony Lighterness and Michael Adcock
License: MIT License and Open Government License v3
Overview
DQMaRC (Data Quality Markup and Ready-to-Connect) is a Python tool designed to facilitate comprehensive data quality profiling of structured tabular data. It allows data analysts, engineers, and scientists to systematically assess and manage the quality of their datasets across multiple dimensions including completeness, validity, uniqueness, timeliness, consistency, and accuracy.
DQMaRC can be used both programmatically within Python scripts and interactively through a Shiny web application front-end user interface, making it versatile for different use cases ranging from ad-hoc analysis to integration within larger data pipelines.
Key Features
- Multi-dimensional Data Quality Checks: Evaluate datasets across key dimensions including Completeness, Validity, Uniqueness, Timeliness, Consistency, and Accuracy.
- Customisable Test Parameters: Configure data quality test parameters easily via python or a user friendly spreadsheet to tailor your data quality assessment to your dataset.
- Interactive Shiny App: Setup, run, explore and visualise data quality issues interactively through a Shiny app for Python.
- Integration with Data Pipelines: Easily integrate DQMaRC into your data processing pipelines for scheduled data quality checks.
- Detailed Reporting: Generate comprehensive reports detailing data quality issues at both the cell and aggregate levels.
Installation
Using Pip or Conda
You can install DQMaRC using pip or conda. Ensure you have a virtual environment activated.
pip DQMaRC
conda install DQMaRC
Dependencies
The package dependencies are listed in the requirements.txt
file and will be installed automatically during the installation of DQMaRC.
Getting Started
1. Import Libraries
Start by importing the necessary libraries and DQMaRC modules in your Python environment.
import pandas as pd
from DQMaRC import DataQuality
2. Load Your Data
Load the dataset you wish to profile.
# Load your data
df = pd.read_csv('path_to_your_data.csv')
3. Initialise DQMaRC and Set Test Parameters
Initialise the DQ tool and set the test parameters. You can generate a template or import predefined parameters.
# Initialise the Data Quality object
dq = DataQuality(df)
# Generate test parameters template
test_params = dq.get_param_template()
# (Optional) Load pre-configured test parameters
# test_params = pd.read_csv('path_to_test_parameters.csv')
# Set the test parameters
dq.set_test_params(test_params)
4. Run Data Quality Checks
Run the data quality checks across all dimensions.
dq.run_all_metrics()
5. Retrieve and Save Results
Retrieve the full results and join them with your original dataset for detailed analysis.
# Get the full results
full_results = dq.raw_results()
# Join results with the original dataset
df_with_results = df.join(full_results, how="left")
# Save results to a CSV file
df_with_results.to_csv('path_to_save_results.csv', index=False)
Using the Shiny App
In addition to programmatic usage, DQMaRC includes an interactive Shiny web app for Python that allows users to explore and visualise data quality issues.
You can test the DQMaRC ShinyLive Demo by copying and pasting the URL located HERE into your webbrowser. This link will take you to a ShinyLive Editor where you can test the DQMaRC functionality. If you encounter an error, try refreshing the webpage once or twice. If you still encounter an error after this, please feel free to get in touch by contacting us or raising an issue on our repository.
PLEASE NOTE The ShinyLive UI is recommended only for testing and getting used to the DQMaRC too functionality. This interface is deployed on your machine, meaning it is only as secure as your machine is. It will store data you upload in its local memory before being wiped when you exit the app.
Running the Shiny App
To run the Shiny app, use the following command in your terminal:
shiny run --reload --launch-browser path_to_your_app/app.py
Deploying the Shiny App
For deploying the Shiny app on a server, follow the official Shiny for Python deployment guide.
Documentation
Comprehensive documentation for DQMaRC, including detailed API references and user guides, is available HERE or in the project docs/
directory.
Repo Structure
Top-level Structure
DQMaRC
│ requirements.txt # package dependencies
│ setup.py # setup configuration for the python package distribution
│
├───docs # user docs material
│ │...
│
├───DQMaRC # source code
│ │ Accuracy.py
│ │ app.py
│ │ Completeness.py
│ │ Consistency.py
│ │ DataQuality.py
│ │ Dimension.py
│ │ Timeliness.py
│ │ Uniqueness.py
│ │ UtilitiesDQMaRC.py
│ │ Validity.py
│ │ __init__.py
│ │
│ ├───data # data used in the tutorial(s)
│ │ │ DQ_df_full.csv
│ │ │ test_params_definitions.csv
│ │ │ toydf_subset.csv
│ │ │ toydf_subset_test_params_24.05.16.csv
│ │ │
│ │ └───lookups # data standards and or value lists for data validity checks
│ │ LU_toydf_gender.csv
│ │ LU_toydf_ICD10_v5.csv
│ │ LU_toydf_M_stage.csv
│ │ LU_toydf_tumour_stage.csv
│ │
│ ├───notebooks
│ │ Backend_Tutorial.ipynb # Tutorial for python users
│...
Contributing
Contributions to DQMaRC are welcome! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project.
License
DQMaRC is licensed under the MIT License. See the LICENSE file for more details.
Acknowledgments
This project was developed by Anthony Lighterness and Michael Adcock. Special thanks to all contributors and testers who helped in the development of this tool.
Citation
Please use the following citation if you use DQMaRC:
Lighterness, A., Adcock, M.A., and Price, G. (2024). DQMaRC: A Python Tool for Structured Data Quality Profiling (Version 1.0.0) [Software]. Available from https://github.com/christie-nhs-data-science/DQMaRC.
Notice on Maintenance and Support
Please Note: This library is an open-source project maintained by a small team of contributors. While we strive to keep the package updated and well-maintained, ongoing support and development may vary depending on resource availability.
We strongly encourage users to engage with the project by reporting any issues, errors, or suggestions for improvements. Your feedback is invaluable in helping us identify and prioritise areas for improvement. Please feel free to submit questions, bug reports, or feature requests via our GitHub issues page or by reaching out.
Thank you for your understanding and for contributing to the growth and improvement of this project!
For more information, please visit the project repository
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dqmarc-1.0.3.tar.gz
.
File metadata
- Download URL: dqmarc-1.0.3.tar.gz
- Upload date:
- Size: 328.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa5c6fc7fe69152fcacb2f7dcb8557ea90a2a5e4728163d57896fe4e87e235fa |
|
MD5 | 8ef5f424365dee0c9f90224b5335d012 |
|
BLAKE2b-256 | 1caf76a36b06c71215005aab3c81537cd06fa3f00a3edd65b77f441b28c54756 |
File details
Details for the file DQMaRC-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: DQMaRC-1.0.3-py3-none-any.whl
- Upload date:
- Size: 337.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57f9e5a53486bb07a47fca57df462bdcd52a219d8d0b54f291b60c6cf75a6165 |
|
MD5 | 566cad88c4bd28a7af97d642fc94516a |
|
BLAKE2b-256 | 35dd7a5fa11eceecfbc21dab543066eb15d1ee3397fbc5f75256f0f513c44cf5 |