Skip to main content

Data Quality Check for Machine Learning

Project description

DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate for curation of text classification datasets (binary / multi-class) using cross validation based selection.

Installation

Installation of DQC-toolkit can be done as shown below

pip install dqc-toolkit

Quick Start

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -

from dqc import CrossValCurate

cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])

The result stored in data_curated which is a pandas dataframe similar to data with the following columns -

>>> data_curated.columns
['text', 'label', 'label_correctness_score', 'is_label_correct', 'predicted_label', 'prediction_probability']
  • 'label_correctness_score' represents a normalized score quantifying the correctness of 'label'.
  • 'is_label_correct' is a boolean flag indicating whether the given 'label' is correct (True) or incorrect (False).
  • 'predicted_label' and 'prediction_probability' represent the curation model's prediction and the corresponding probability score.

For more details regarding different hyperparameters available in CrossValCurate, please refer to the API documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqc_toolkit-0.1.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

dqc_toolkit-0.1.1-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file dqc_toolkit-0.1.1.tar.gz.

File metadata

  • Download URL: dqc_toolkit-0.1.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9b6487128458834ac8b8525f80c8fa9f35b4280c0cd3dec3be299b8b0bc22e9b
MD5 4e0155ae63c804f5547830df2dfbe1c0
BLAKE2b-256 90adb60b1121afc658d94cbeced904e6adcac82bea87ddd2b0828a3ece1abd34

See more details on using hashes here.

File details

Details for the file dqc_toolkit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dqc_toolkit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e658626ace0beff7682395513f30873bf1efcba6e30ee3855b52397d6205aa7d
MD5 16e9eb9afbed4836ca292d94cb515ced
BLAKE2b-256 a7074c1726c82d0112b2ebc88aa4dec3f4d690338070b295e5970b0c3fd426d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page