This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.
Project description
A benchmark of data-centric tasks from across the machine learning lifecycle.
Getting Started | What is dcbench? | Docs | Contributing | Website | About
⚡️ Quickstart
pip install dcbench
Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like
pip install dcbench[dev]
instead. See setup.py for a full list of optional dependencies.
Installing from dev:
pip install "dcbench[dev] @ git+https://github.com/data-centric-ai/dcbench@main"
Using a Jupyter notebook or some other interactive environment, you can import the library and explore the data-centric problems in the benchmark:
import dcbench
dcbench.tasks
💡 What is dcbench?
This benchmark evaluates data-centric aspects of improving the quality of machine learning workflows.
It features a growing list of data-centric tasks:
- Minimal data cleaning (
miniclean
) - Label Correction (
labelfix
) - Slice Discovery (
errmod
) - Minimal training dataset selection (
minitrain
)
Each task features a a collection of problems. What is a problem? A useful analogy is: chess problems are to a full chess game as problems are to the full data-centric ML lifecycle. For example, many machine-learning workflows include a label correction phase where labels are audited and fixed. Our benchmark includes a collection of label cleaning problems each with a different dataset and set of sullied labels to be cleaned.
⚙️ How does it work?
Problem
The benchmark supports a diverse set of problems that may look very different from one another. For example, a slice discovery problem has different inputs and outputs than a data cleaning problem. To deal with this, we group problems by task. In dcbench
, each task is represented by a subclass of Problem
(e.g. SliceDiscoveryProblem
, MinicleanProblem
). The problems themselves are represented by instances of these subclasses.
We can get a list all of the problem classes in dcbench
with:
import dcbench
dcbench.tasks
# OUT:
[<class 'dcbench.tasks.miniclean.problem.MinicleanProblem'>, <class 'dcbench.tasks.slice.SliceDiscoveryProblem'>]
dcbench
includes a set of problems for each task. We can list them with:
from dcbench import SliceDiscoveryProblem
SliceDiscoveryProblem.instances
# Out: TODO, get the actual dataframe output here
dataframe
We can get one of these problems with
problem = SliceDiscoveryProblem.from_id("eda4")
Artefact
Each problem is made up of a set of artefacts: a dataset with labels to clean, a dataset and a model to perform error analysis on. In dcbench
, these artefacts are represented by instances of Artefact
. We can think of each Problem
object as a container for Artefact
objects.
problem.artefacts
# Out:
{
"dataset": CSVArtefact()
}
artefact: CSVArtefact = problem["dataset"]
Note that Artefact
objects don't actually hold their underlying data in memory. Instead, they hold pointers to where the Artefact
lives in dcbench cloud storage and, if it's been downloaded, where it lives locally on disk. This makes the Problem
objects very lightweight.
Downloading to disk. By default, dcbench
downloads artefacts to ~/.dcbench/artefacts
but this can be configured in the dcbench settings TODO: add support for configuration. To download an Artefact
via the Python API, use artefact.download()
. You can also download all the artefacts in a problem with problem.download()
.
Loading into memory. dcbench
includes loading functionality for each artefact type. To load an artefact into memory you can use artefact.load()
. Note that this will also download the artefact if it hasn't yet been downloaded.
Finally, we should point out that problem
is a Python mapping, so we can index it directly to load artefacts.
# this is equivalent to problem.artefacts["dataset"].load()
df: pd.DataFrame = problem["dataset"]
✉️ About
dcbench
is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dcbench-0.0.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0772f183d081b712eb90007b36e2c8af0797b2fecd14d80aed22406753b70328 |
|
MD5 | 13c86a4c73e91c94068d9d304a39880b |
|
BLAKE2b-256 | bec3eaa03d37a0a8f425ed9f26c8562b04c3cea48ee86fb7e88022ccfc622994 |