This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.
Project description
A benchmark of aspects of improving the quality of machine learning workflows.
Getting Started | What is dcbench? | Docs | Contributing | Website | About
⚡️ Quickstart
pip install dcbench
Using a Jupyter notebook or some other interactive environment, you can import the library and explore the data-centric problems in the benchmark:
import dcbench
dcbench.problem_classes
💡 What is dcbench?
This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.
It features a growing list of tasks:
- Minimal data cleaning (
miniclean
) - Task-specific Label Correction (
labelfix
) - Discovery of validation Error Modalities (
errmod
) - Minimal training dataset selection (
minitrain
)
Each task features a collection of scenarios which are defined by datasets and ML pipeline elements (e.g. a model, feature pre-processors, etc.)
⚙️ How does it work?
Problem
This benchmark is a collection of data-centric problems. What is a data-centric problem? A useful analogy is: chess problems are to a full chess game as data-centric problems are to the full data-centric ML lifecycle. For example, many machine-learning workflows include a label cleaning phase where labels are audited and corrected. Therefore, our benchmark includes a collection of label cleaning problems each with a different dataset and set of sullied labels to be cleaned.
The benchmark supports a diverse set of problems that may look very different from one another. For example, a slice discovery problem has different inputs and outputs than a data cleaning problem. To deal with this, we group problems by problem class. In dcbench
, each problem class is represented by a subclass of Problem
(e.g. SliceDiscoveryProblem
, MiniCleanProblem
). The problems themselves are represented by instances of these subclasses.
We can get a list all of the problem classes in dcbench
with:
import dcbench
dcbench.problem_classes
# OUT:
[SliceDiscoveryProblem, MiniCleanProblem]
dcbench
includes a set of problems for each task. We can list them with:
from dcbench import SliceDiscoveryProblem
SliceDiscoveryProblem.instances
# Out: TODO, get the actual dataframe output here
dataframe
We can get one of these problems with
problem = SliceDiscoveryProblem.from_id("eda4")
Artefact
Each problem is made up of a set of artefacts: a dataset with labels to clean, a dataset and a model to perform error analysis on. In dcbench
, these artefacts are represented by instances of Artefact
. We can think of each Problem
object as a container for Artefact
objects.
problem.artefacts
# Out:
{
"dataset": CSVArtefact()
}
artefact: CSVArtefact = problem["dataset"]
Note that Artefact
objects don't actually hold their underlying data in memory. Instead, they hold pointers to where the Artefact
lives in dcbench cloud storage and, if it's been downloaded, where it lives locally on disk. This makes the Problem
objects very lightweight.
Downloading to disk. By default, dcbench
downloads artefacts to ~/.dcbench/artefacts
but this can be configured in the dcbench settings TODO: add support for configuration. To download an Artefact
via the Python API, use artefact.download()
. You can also download all the artefacts in a problem with problem.download()
.
Loading into memory. dcbench
includes loading functionality for each artefact type. To load an artefact into memory you can use artefact.load()
. Note that this will also download the artefact if it hasn't yet been downloaded.
Finally, we should point out that problem
is a Python mapping, so we can index it directly to load artefacts.
# this is equivalent to problem.artefacts["dataset"].load()
df: pd.DataFrame = problem["dataset"]
✉️ About
dcbench
is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dcbench-0.0.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3046e41c88ef718fd6b24f89edf537522293a3d122aeabee18fa8e1a80aa730 |
|
MD5 | 367860be83fe0cb1e7434f0f8cb28239 |
|
BLAKE2b-256 | 6efe5c4fcc524791074dbf114338cf7a23d067661977704fe8b31eb3dcb30aa6 |