Data Quality Check for Machine Learning
Project description
DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate
for curation of text classification datasets (binary / multi-class) using cross validation based selection.
Installation
Installation of DQC-toolkit can be done as shown below
pip install dqc-toolkit
Quick Start
Assuming your text classification data is stored as a pandas dataframe data
, with each sample represented by the text
column and its corresponding noisy label represented by the label
column, here is how you use CrossValCurate
-
from dqc import CrossValCurate
cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])
The result stored in data_curated
which is a pandas dataframe similar to data
with the following columns -
>>> data_curated.columns
['text', 'label', 'label_correctness_score', 'is_label_correct', 'predicted_label', 'prediction_probability']
'label_correctness_score'
represents a normalized score quantifying the correctness of'label'
.'is_label_correct'
is a boolean flag indicating whether the given'label'
is correct (True
) or incorrect (False
).'predicted_label'
and'prediction_probability'
represent the curation model's prediction and the corresponding probability score.
For more details regarding different hyperparameters available in CrossValCurate
, please refer to the API documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dqc_toolkit-0.1.2.tar.gz
.
File metadata
- Download URL: dqc_toolkit-0.1.2.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b79c49eb4212e944d7697955a79eb9af4b3256308d8ee52ae34841fa4077423 |
|
MD5 | a04034adf99aa35518f2d0ffd8b00a31 |
|
BLAKE2b-256 | 13ffdf333ebc535604011db9e3eba7efdb7826c8a1834dd446b2a07b9772027c |
File details
Details for the file dqc_toolkit-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: dqc_toolkit-0.1.2-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39179b23cd624dac5fa049fd8652a0a1d7cbde3369e603681017862aec26cdf7 |
|
MD5 | 7d867c2a8505ed22772b26e9051e9e5e |
|
BLAKE2b-256 | c0e9331ee29ddd34630f0ea5ae6648c5d2ab744d6e4caf49fa3b0386ac9292a8 |