Data Quality Check for Machine Learning
Project description
DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate and LLMCurate. CrossValCurate can be used for label error detection / correction in text classification (binary / multi-class) based on cross validation. LLMCurate extends PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars to compute LLM-based confidence scores for free-text labels.
Installation
Installation of DQC-toolkit can be done as shown below
pip install dqc-toolkit
Quick Start
CrossValCurate
Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -
from dqc import CrossValCurate
cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])
The result stored in data_curated is a pandas dataframe similar to data with the following columns -
>>> data_curated.columns
['text', 'label', 'label_correctness_score', 'is_label_correct', 'predicted_label', 'prediction_probability']
'label_correctness_score'represents a normalized score quantifying the correctness of'label'.'is_label_correct'is a boolean flag indicating whether the given'label'is correct (True) or incorrect (False).'predicted_label'and'prediction_probability'represent the curation model's prediction and the corresponding probability score.
LLMCurate
Assuming data is a pandas dataframe containing samples with our target text for curation under column column_to_curate, here is how you use LLMCurate -
llmc = LLMCurate(model, tokenizer)
ds = llmc.run(
data,
column_to_curate,
ds_column_mapping,
prompt_variants,
llm_response_cleaned_column_list,
answer_start_token,
answer_end_token,
batch_size,
max_new_tokens
)
where
modelandtokenizerare the instantiated LLM model and tokenizer objects respectivelyds_column_mappingis the dictionary mapping of entities used in the LLM prompt to the corresponding columns indata. For example,ds_column_mapping={'INPUT' : 'input_column'}would imply that text underinput_columnindatawould be passed to the LLM in the format"[INPUT]row['input_column'][/INPUT]"for eachrowindataprompt_variantsis the list of LLM prompts to be used to curatecolumn_to_curateandllm_response_cleaned_column_listis the corresponding list of column names to store the reference responses generated using each promptanswer_start_tokenandanswer_end_tokenare optional text phrases representing the start and end of the answer respectively.
ds is a dataset object with the following additional features -
- Feature for each column name in
llm_response_cleaned_column_list - LLM Confidence score for each text in
column_to_curate
For more details regarding different hyperparameters available in CrossValCurate and LLMCurate, please refer to the API documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dqc_toolkit-0.2.0.tar.gz.
File metadata
- Download URL: dqc_toolkit-0.2.0.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0093885e01c558fd5906b2b95a82a2bedae8a906165ba1ec5c2d182e10c4b07e
|
|
| MD5 |
4f35851015f6688fe887c26a502012b4
|
|
| BLAKE2b-256 |
c00cab4d956b4b2b537be563df964c9ae2fab0beff9fccd9e7aab913ef79dc53
|
File details
Details for the file dqc_toolkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: dqc_toolkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7feb9b24c1b7f8b319d8d9017a440fb4d92c6aecdad5b3701cd9202c23003182
|
|
| MD5 |
580778503cdca6a9df4fb18ce2ada24f
|
|
| BLAKE2b-256 |
1060257e76802bfe85aa9cf84450a028e3c2f50785c206db63446f9b389d4917
|