Skip to main content

Data Quality Check for Machine Learning

Project description

Build status Docs status Python version Pypi version Licence

DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate for curation of text classification datasets (binary / multi-class) using cross validation based selection.

Installation

Installation of DQC-toolkit can be done as shown below

pip install dqc-toolkit

Quick Start

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -

from dqc import CrossValCurate

cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])

The result stored in data_curated which is a pandas dataframe similar to data with the following columns -

>>> data_curated.columns
['text', 'label', 'label_correctness_score', 'is_label_correct', 'predicted_label', 'prediction_probability']
  • 'label_correctness_score' represents a normalized score quantifying the correctness of 'label'.
  • 'is_label_correct' is a boolean flag indicating whether the given 'label' is correct (True) or incorrect (False).
  • 'predicted_label' and 'prediction_probability' represent the curation model's prediction and the corresponding probability score.

For more details regarding different hyperparameters available in CrossValCurate, please refer to the API documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqc_toolkit-0.1.2.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

dqc_toolkit-0.1.2-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file dqc_toolkit-0.1.2.tar.gz.

File metadata

  • Download URL: dqc_toolkit-0.1.2.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5b79c49eb4212e944d7697955a79eb9af4b3256308d8ee52ae34841fa4077423
MD5 a04034adf99aa35518f2d0ffd8b00a31
BLAKE2b-256 13ffdf333ebc535604011db9e3eba7efdb7826c8a1834dd446b2a07b9772027c

See more details on using hashes here.

File details

Details for the file dqc_toolkit-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dqc_toolkit-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 39179b23cd624dac5fa049fd8652a0a1d7cbde3369e603681017862aec26cdf7
MD5 7d867c2a8505ed22772b26e9051e9e5e
BLAKE2b-256 c0e9331ee29ddd34630f0ea5ae6648c5d2ab744d6e4caf49fa3b0386ac9292a8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page