Skip to main content

Data Quality Check for Machine Learning

Project description

Build status Docs status Python version Pypi version Licence

DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate for curation of text classification datasets (binary / multi-class) using cross validation based selection.

Installation

Installation of DQC-toolkit can be done as shown below

pip install dqc-toolkit

Quick Start

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -

from dqc import CrossValCurate

cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])

The result stored in data_curated which is a pandas dataframe similar to data with the following columns -

>>> data_curated.columns
['text', 'label', 'label_correctness_score', 'is_label_correct', 'predicted_label', 'prediction_probability']
  • 'label_correctness_score' represents a normalized score quantifying the correctness of 'label'.
  • 'is_label_correct' is a boolean flag indicating whether the given 'label' is correct (True) or incorrect (False).
  • 'predicted_label' and 'prediction_probability' represent the curation model's prediction and the corresponding probability score.

For more details regarding different hyperparameters available in CrossValCurate, please refer to the API documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqc_toolkit-0.1.3.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

dqc_toolkit-0.1.3-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file dqc_toolkit-0.1.3.tar.gz.

File metadata

  • Download URL: dqc_toolkit-0.1.3.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f7bb0f1d2f3b825cbe2ef39587bddb4be241b16b2a761221f1489648c673972e
MD5 d5575df718cfc522dfdc5bde90ca5357
BLAKE2b-256 6459c29fb6fa963126e9fcae00fbe8851a7954e848522a74d9153f9ac8dfb8b3

See more details on using hashes here.

File details

Details for the file dqc_toolkit-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: dqc_toolkit-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a2efffe867c3b390ec0585e9bc4652e4ffd4732937dd3a831389b3ea5f72036f
MD5 205cef27268e21afd62e066a20873a86
BLAKE2b-256 be9368354aaaeef1a41d960417f108661894f0d15f5bb83360af70d8268089d7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page