Transparent Data Valuation

These details have not been verified by PyPI

Project links

Project description

OpenDataVal: a Unified Benchmark for Data Valuation

Assessing the quality of individual data points is critical for improving model performance and mitigating biases. However, there is no way to systematically benchmark different algorithims.

OpenDataVal is an open-source initiative that with a diverse array of datasets/models (image, NLP, and tabular), data valuation algorithims, and evaluation tasks using just a few lines of code.

OpenDataVal also provides a leaderboards for data evaluation tasks. We've curated and added artificial noise to some datasets. Create your own DataEvaluator to top the leaderboards.

Overview
Python
Dependencies
Documentation
CI/CD
Issues
License

:sparkles: Features

Feature	Status	Links	Notes
Datasets	Stable	Docs	Embeddings available for image/NLP datasets
Models	Stable	Docs	Support available for sk-learn models
Data Evaluators	Stable	Docs
Experiments	Stable	Docs
CLI	Experimental	`opendataval --help`	No support for null values

(Back to top)

:hourglass_flowing_sand: Installation options

Install with pip
```
pip install opendataval
```
Clone the repo and install
```
git clone https://github.com/opendataval/opendataval.git
make install
```
a. Install optional dependencies if you're contributing
```
make install-dev
```
b. If you want to pull in kaggle datasets, I'd reccomend looking how to add a kaggle folder to the current directory. Tutorial here

(Back to top)

:zap: Quick Start

To set up an experiment on DataEvaluators. Feel free to change the source code as needed for a project.

from opendataval.experiment import ExperimentMediator

exper_med = ExperimentMediator.model_factory_setup(
    dataset_name='iris',
    force_download=False,
    train_count=100,
    valid_count=50,
    test_count=50,
    model_name='ClassifierMLP',
    train_kwargs={'epochs': 5, 'batch_size': 20},
)
list_of_data_evaluators = [ChildEvaluator(), ...]  # Define evaluators here
eval_med = exper_med.compute_data_values(list_of_data_evaluators)

# Runs a discover the noisy data experiment for each DataEvaluator and plots
data, fig = eval_med.plot(discover_corrupted_sample)

# Runs non-plottable experiment
data = eval_method.evaluate(noisy_detection)

:computer: CLI

opendataval comes with a quick CLI tool, The tool is under development and the template for a csv input is found at cli.csv. Note that for kwarg arguments, the input must be valid json.

To use run the following command if installed with make install:

opendataval --file cli.csv -n [job_id] -o [path/to/file/]

To run without installing the script:

python opendataval --file cli.csv -n [job_id] -o [path/to/file/]

(Back to top)

:control_knobs: API

Here are the 4 interacting parts of opendataval

DataFetcher, Loads data and holds meta data regarding splits
Model, trainable prediction model.
DataEvaluator, Measures the data values of input data point for a specified model.
ExperimentMediator, facilitates experiments regarding data values across several DataEvaluators

(Back to top)

`ExperimentMediator`

ExperimentMediator is helps make a cohesive and controlled experiment. NOTE Warnings are raised if errors occur in a specific DataEvaluator.

expermed = ExperimentrMediator(fetcher, model, train_kwargs, metric_name).compute_data_values(data_evaluators)

Run experiments by passing in an experiment function: (DataEvaluator, DataFetcher, ...) - > dict[str, Any]. There are 5 found exper_methods.py with three being plotable.

df = expermed.evaluate(noisy_detection)
df, figure = expermed.plot(discover_corrupted_sample)

For more examples, please refer to the Documentation

(Back to top)

:medal_sports: opendataval Leaderboards

For datasets that start with the prefix challenge, we provide leaderboards. Compute the data values with an ExperimentMediator and use the save_dataval function to save a csv. Upload it to here! Uploading will allow us to systematically compare your DataEvaluator against others in the field.

The available challenges are currently:

challenge-iris

exper_med = ExperimentMediator.model_factory_setup(
    dataset_name='challenge-...', model_name=model_name, train_kwargs={...}, metric_name=metric_name
)
exper_med.compute_data_values([custom_data_evaluator]).evaluate(save_dataval, save_output=True)

(Back to top)

:wave: Contributing

If you have a quick suggestion, reccomendation, bug-fixes please open an issue. If you want to contribute to the project, either through data sets, experiments, presets, or fix stuff, please see our Contribution page.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(Back to top)

:bulb: Vision

clean, descriptive specification syntax -- based on modern object-oriented design principles for data science.
fair model assessment and benchmarking -- Easily build and evaluate your Data Evaluators
easily extensible -- Easily add your own data sets, data evaluators, models, tests etc!

(Back to top)

:classical_building: License

Distributed under the MIT License. See LICENSE.txt for more information.

(Back to top)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.0

Nov 17, 2023

1.2.1

Aug 30, 2023

1.2.0

Aug 29, 2023

1.1.0

Aug 8, 2023

This version

1.0.0

Jun 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendataval-1.0.0.tar.gz (61.3 kB view hashes)

Uploaded Jun 20, 2023 Source

Built Distribution

opendataval-1.0.0-py3-none-any.whl (89.2 kB view hashes)

Uploaded Jun 20, 2023 Python 3

Hashes for opendataval-1.0.0.tar.gz

Hashes for opendataval-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`2a42903786b8e671375f4f7030fc4a711587e750ef069f99725d665d2bb7f36a`
MD5	`e2877a39b7f42e2ae339a711d1289641`
BLAKE2b-256	`b405508745faf9bda9fa363f544989eac4972b88a5521c315d027b85b1295d0a`

Hashes for opendataval-1.0.0-py3-none-any.whl

Hashes for opendataval-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c52cd1c32d0146f57e39651f3fef91d828bd8f239a718a31414497b489aa07aa`
MD5	`77c123acdf2504583f975fcbd1364fc8`
BLAKE2b-256	`54e6f6fe853b8e804a399d304819083814cd2b677de86ec9e60a03b6fd3ed432`