Skip to main content

SDQCPy is a comprehensive Python package designed for synthetic data management, quality control, and validation.

Project description

SDQCPy

SDQCPy: A Comprehensive Python Package for Synthetic Data Management

中文版本

Table of Contents

Features

SDQCPy offers a comprehensive toolkit for synthetic data generation, quality assessment, and analysis:

  1. Data Synthesis: Generate synthetic data using various models.
  2. Quality Evaluation: Assess synthetic data quality through statistical tests, classification metrics, explainability analysis, and causal inference.
  3. End-to-End Analysis: Perform holistic analysis by integrating multiple evaluation methods to provide a comprehensive view of synthetic data quality.
  4. Results Display: Store the results in a HTML file.

Installation

You can install SDQCPy using pip:

pip install sdqcpy

Alternatively, you can install it from the source:

git clone https://github.com/T0217/sdqcpy.git
cd sdqcpy
pip install -e .

Results Display

SDQCPy provides a SequentialAnalysis class to perform the sequential analysis and store the results in a HTML file.

Sample Result

Usage

Demo

You can use the following code to achieve the sequential analysis and store the results in a HTML file:

from sdqc_integration import SequentialAnalysis
from sdqc_data import read_data
import logging
import warnings

# Ignore warnings and set logging level to ERROR
warnings.filterwarnings('ignore')
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

# Set random seed
random_seed = 17

# Replace with your own data path and use pandas to read the data
raw_data = read_data('3_raw')
synthetic_data = read_data('3_synth')

output_path = 'raw_synth.html'

# Perform sequential analysis
sequential = SequentialAnalysis(
    raw_data=raw_data,
    synthetic_data=synthetic_data,
    random_seed=random_seed,
    use_cols=None,
)
results = sequential.run()
sequential.visualize_html(output_path)

Data Synthesis

SDQCPy supports various methods, the implementation of these methods are using ydata-synthetic and SDV.

[!TIP]

We only display simple code here, and the parameters of each model can be further modified as needed.

  • YData Synthesizer

    import pandas as pd
    from sdqc_synthesize import YDataSynthesizer
    
    raw_data = pd.read_csv("raw_data.csv")  # Please replace with your own data path
    ydata_synth = YDataSynthesizer(data=raw_data)
    synthetic_data = ydata_synth.generate()
    

[!IMPORTANT]

In the latest version, ydata-synthetic has switched to using ydata-sdk. However, since synthetic data is only a supplementary feature of this library, it has not been updated yet.

  • SDV Synthesizer

    import pandas as pd
    from sdqc_synthesize import SDVSynthesizer
    
    raw_data = pd.read_csv("raw_data.csv")  # Please replace with your own data path
    sdv_synth = SDVSynthesizer(data=raw_data)
    synthetic_data = sdv_synth.generate()
    

Workflow

SDQCPy use the process shown below to perform the quality check and analysis:

---
title Main Idea
---
flowchart TB
	%% Define the style
	classDef default stroke:#000,fill:none

	%% Define the nodes
	initial([Input Real Data and Synthetic Data])
	step1[Statistical Test]
	step2[Classification]
	step3[Explainability]
	step4[Causal Analysis]
	endprocess[Export HTML file]

    %% Define the relationships between nodes
    initial --> step1
    step1 --> step2
    step2 --> step3
    step3 --> step4
    step4 --> endprocess
  • Statistical Test SDQCPy employs various methods for descriptive analysis, distribution comparison, and correlation testing tailored to different data types.
  • Classification SDQCPy employs machine learning models(SVC, RandomForestClassifier, XGBClassifier, LGBMClassifier) to evaluate the similarity between the real and synthetic data.
  • Explainability SDQCPy employs several of the current mainstream explainability methods(Model-Based,SHAP, PFI) to evaluate the explainability of the synthetic data.
  • Causal Analysis SDQCPy employs several causal structure learning methods and evaluation metrics to compare the adjacency matrix of the raw and synthetic data. The implementation of these methods are using gCastle.
  • End-to-End Analysis(named SequentialAnalysis) To streamline the process of calling individual modules one by one, we have integrated all the functions. If you have specific needs, you can also use these functions along your lines.

Support

Need help? Report a bug? Ideas for collaborations? Reach out via GitHub Issues

[!IMPORTANT]

Before reporting an issue on GitHub, please check the existing Issues to avoid duplicates.

If you wish to contribute to this library, please first open an Issue to discuss your proposed changes. Once discussed, you are welcome to submit a Pull Request.

License

Apache-2.0 @T0217

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdqcpy-1.0.1.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

sdqcpy-1.0.1-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file sdqcpy-1.0.1.tar.gz.

File metadata

  • Download URL: sdqcpy-1.0.1.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for sdqcpy-1.0.1.tar.gz
Algorithm Hash digest
SHA256 4aef2958940e67b6e353b476cdb2405e0571521993daf3be400af0921fbb26b7
MD5 05d09e87ddb1c77bd2bbec15f114b56b
BLAKE2b-256 1a4336373aa80af71ac1ac7f4a690d98a04caeef3c61b6c85ca5f32f4e406753

See more details on using hashes here.

File details

Details for the file sdqcpy-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: sdqcpy-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for sdqcpy-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 62f334c4bcfabb105eb21def736ef7eb79b9e404b85dad3ec9ca950e271b2a0b
MD5 835cd551ebefc5e2453467b0152efa33
BLAKE2b-256 a892029e50f22fe52bb23f17ea2ff88b344b25ffa5805ac252c89ebc5db8b28e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page