Skip to main content

MMD Computation Package for Data Preparation Bench

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Data-Preparation-Bench

A benchmark for evaluating the data preparation capabilities of large language models (LLMs). The benchmark is organized into two modules:

Modules

1. Data Synthesis & Augmentation

Given raw metadata, the model is tasked with synthesizing or augmenting datasets to improve downstream model training.

2. Data Quality Assessment

Given raw metadata, the model is tasked with predicting the training data's impact on downstream task performance.

Quick Start

Usage

This project uses uv for dependency management. To get started:

git clone https://github.com/haolpku/Data-Preparation-Bench.git
cd Data-Preparation-Bench
uv sync

To use your own datasets, modify the configuration dictionaries and formatters in compute_mmd.py:

DS1_CONFIG = {
    "name": "oda-math",
    "data_path": "OpenDataArena/ODA-Math-460k",
    "data_size": 5000,
    "split": "train",
    "shuffle_seed": 42,
}
formatter1 = AlpacaFormatter(
    user_key="question",
    assistant_key="response",
)

DS2_CONFIG = {
    "name": "infinity-instruct",
    "data_path": "BAAI/Infinity-Instruct",
    "data_size": 5000,
    "split": "train",
    "shuffle_seed": 42,
}
formatter2 = ShareGptFormatter(
    conversations_key="conversations",
)

Typically, you only need to update data_path with your dataset and define a formatter that converts raw items to the required format. After making these changes, run the MMD computation with:

uv run examples/compute_mmd.py

Development

To set up the development environment locally:

uv sync --extra dev
uv run pre-commit install

Before committing, format and lint the code:

uv run pre-commit run --all-files

Experiment Settings

Please refer to Experiment.md for detailed experiment configurations.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_mmd-1.0.0.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_mmd-1.0.0-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file open_mmd-1.0.0.tar.gz.

File metadata

  • Download URL: open_mmd-1.0.0.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for open_mmd-1.0.0.tar.gz
Algorithm Hash digest
SHA256 99c959b26acae4f204ec29a44f64bcf099a71a8a09c3409874bfb57ffde1f8f3
MD5 fbd3f56b37ac9e14937a4e42eff40dc3
BLAKE2b-256 5e0fb9d9eec1a769a8b6e55766ae81eff46d1b079a110a50cf378562ae76d417

See more details on using hashes here.

File details

Details for the file open_mmd-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: open_mmd-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for open_mmd-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6df050f33503747388e3a944aeaa14a3e9caec50dab518f793be9e0f4c703bd2
MD5 7f49a603f8d889e76163f05ff52fd74f
BLAKE2b-256 644a0a3cc1a6b4d17c7940533d6dae224e110429c28bba88f02818c7ad69cafa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page