Skip to main content

This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.

Project description

Data-Centric AI Benchmark

This is a benchmark that tests various data-centric aspects of improving the quality of machine learning workflows.

It features a growing list of tasks:

  • Minimal data cleaning (miniclean)
  • Task-specific Label Correction (labelfix)
  • Discovery of validation Error Modalities (errmod)
  • Minimal training dataset selection (minitrain)

Each task features a collection of scenarios which are defined by datasets and ML pipeline elements (e.g. a model, feature pre-processors, etc.)

Basic Usage

The very first step is to install the PyPI package:

pip install dcai

Then, we advise using Jupyter notebooks or some other interactive environment. You start off by importing the library and listing all the available artefacts:

from dcai import scenarios

scenarios.list()

You can then load a specific scenario and view its artefacts:

scenario = scenarios.get("miniclean/bank")
scenario.artefacts

In the above example we are loading the bank scenario of the miniclean task. We can then load all the artefacts into a dictionary:

a = scenario.artefacts.load()

This automatically downloads all the available artefacts, saves a local copy and loads it into memory. Artefacts can be accessed directly from the dictionary. We can then go ahead and write the code that will provide us with a scenario-specific solution:

model.fit(a["X_train_dirty"], a["y_train"])

X_train_selection = ...

Once we have an object (e.g. X_train_selection) containing the scenario-specific solution, we can package it into a solution object:

solution = scenario.solve(X_train_selection=X_train_selection)

We can then perform an evaluation on that solution that will give us the result:

solution.evaluate()
solution.result

After you're happy with the obtained result, you can bundle your solution artefacts and see their location.

solution.save()
solution.location

After obtaining the /path/to/your/artefacts you can upload it as a bundle to CodaLab:

cl upload /path/to/your/artefacts

This command will display the URL of your uploaded bundle. It assumes that you have a user account on CodaLab (click here for more info).

After that, you simply go to our FORM LINK, fill it in with all required details and paste the bundle link so we can run a full evaluation on it.

Congratulations! Your solution is now uploaded to our system and after evaluation it will show up on the leaderboard.

Adding a Submitted solution to the Repo

This step is performed manually by us (although it could be possible to automate). It looks like this:

dcai add-solution \
    --scenario miniclean/bank \
    --name MySolution \
    --paper https://arxiv.org/abs/... \
    --code https://github.com/... \
    --artefacts-url https://worksheets.codalab.org/rest/bundles/...

Performing the Full Evaluation

This step is performed by GitHub Actions and is triggered after each commit.

dcai evaluate --leaderboard-output /path/to/leaderboard/dir

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcbench-0.0.1.tar.gz (20.3 kB view hashes)

Uploaded Source

Built Distribution

dcbench-0.0.1-py2.py3-none-any.whl (21.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page