Skip to main content

Data Quality Check for Machine Learning

Project description

Build status Docs status Python version Pypi version Licence

DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate and LLMCurate. CrossValCurate can be used for label error detection / correction in text classification (binary / multi-class) based on cross validation. LLMCurate extends PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars to compute LLM-based confidence scores for free-text labels.

Installation

Installation of DQC-toolkit can be done as shown below

pip install dqc-toolkit

Quick Start

CrossValCurate

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -

from dqc import CrossValCurate

cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])

The result stored in data_curated is a pandas dataframe similar to data with the following columns -

>>> data_curated.columns
['text', 'label', 'label_correctness_score', 'is_label_correct', 'predicted_label', 'prediction_probability']
  • 'label_correctness_score' represents a normalized score quantifying the correctness of 'label'.
  • 'is_label_correct' is a boolean flag indicating whether the given 'label' is correct (True) or incorrect (False).
  • 'predicted_label' and 'prediction_probability' represent the curation model's prediction and the corresponding probability score.

LLMCurate

Assuming data is a pandas dataframe containing samples with our target text for curation under column column_to_curate, here is how you use LLMCurate -

    
    llmc = LLMCurate(model, tokenizer)
    ds = llmc.run(
            data,
            column_to_curate,
            ds_column_mapping,
            prompt_variants,
            llm_response_cleaned_column_list,
            answer_start_token,
            answer_end_token,
            batch_size,
            max_new_tokens
            )

where

  • model and tokenizer are the instantiated LLM model and tokenizer objects respectively
  • ds_column_mapping is the dictionary mapping of entities used in the LLM prompt to the corresponding columns in data. For example, ds_column_mapping={'INPUT' : 'input_column'} would imply that text under input_column in data would be passed to the LLM in the format "[INPUT]row['input_column'][/INPUT]" for each row in data
  • prompt_variants is the list of LLM prompts to be used to curate column_to_curate and llm_response_cleaned_column_list is the corresponding list of column names to store the reference responses generated using each prompt
  • answer_start_token and answer_end_token are optional text phrases representing the start and end of the answer respectively.

ds is a dataset object with the following additional features -

  1. Feature for each column name in llm_response_cleaned_column_list
  2. LLM Confidence score for each text in column_to_curate

For more details regarding different hyperparameters available in CrossValCurate and LLMCurate, please refer to the API documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqc_toolkit-0.2.0.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

dqc_toolkit-0.2.0-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file dqc_toolkit-0.2.0.tar.gz.

File metadata

  • Download URL: dqc_toolkit-0.2.0.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0093885e01c558fd5906b2b95a82a2bedae8a906165ba1ec5c2d182e10c4b07e
MD5 4f35851015f6688fe887c26a502012b4
BLAKE2b-256 c00cab4d956b4b2b537be563df964c9ae2fab0beff9fccd9e7aab913ef79dc53

See more details on using hashes here.

File details

Details for the file dqc_toolkit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dqc_toolkit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.14

File hashes

Hashes for dqc_toolkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7feb9b24c1b7f8b319d8d9017a440fb4d92c6aecdad5b3701cd9202c23003182
MD5 580778503cdca6a9df4fb18ce2ada24f
BLAKE2b-256 1060257e76802bfe85aa9cf84450a028e3c2f50785c206db63446f9b389d4917

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page