Skip to main content

Distance Computation Package for Data Preparation Bench

Reason this release was yanked:

published by mistake

Project description

Data-Preparation-Bench

A benchmark for evaluating the data preparation capabilities of large language models (LLMs). The benchmark is organized into two modules:

Modules

1. Data Synthesis & Augmentation

Given raw metadata, the model is tasked with synthesizing or augmenting datasets to improve downstream model training.

2. Data Quality Assessment

Given raw metadata, the model is tasked with predicting the training data's impact on downstream task performance.

Quick Start

Usage

The package is published on PyPI and can be installed via pip:

pip install distflow

For vLLM embedding support, install the optional dependency:

pip install distflow[vllm]

This project uses uv for dependency management. To get started:

git clone https://github.com/haolpku/Data-Preparation-Bench.git
cd Data-Preparation-Bench
uv sync

To use your own datasets, modify the configuration dictionaries and formatters in compute_mmd.py:

DS1_CONFIG = {
    "name": "oda-math",
    "data_path": "OpenDataArena/ODA-Math-460k",
    "data_size": 5000,
    "split": "train",
    "shuffle_seed": 42,
}
formatter1 = AlpacaFormatter(
    user_key="question",
    assistant_key="response",
)

DS2_CONFIG = {
    "name": "infinity-instruct",
    "data_path": "BAAI/Infinity-Instruct",
    "data_size": 5000,
    "split": "train",
    "shuffle_seed": 42,
}
formatter2 = ShareGptFormatter(
    conversations_key="conversations",
)

Typically, you only need to update data_path with your dataset and define a formatter that converts raw items to the required format. After making these changes, run the MMD computation with:

uv run examples/compute_mmd.py

Development

To set up the development environment locally:

uv sync --extra dev
uv run pre-commit install

Before committing, format and lint the code:

uv run pre-commit run --all-files

Experiment Settings

Please refer to Experiment.md for detailed experiment configurations.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distflow-0.0.0.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distflow-0.0.0-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file distflow-0.0.0.tar.gz.

File metadata

  • Download URL: distflow-0.0.0.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for distflow-0.0.0.tar.gz
Algorithm Hash digest
SHA256 421ea8d6efd5c59388fe527db4a6750e700334ef92ed32ccc6041d935ce3025b
MD5 715c68a5299f9e7c17c2ff842a02abab
BLAKE2b-256 106b37acd6555d75e50cdb8eead68f38860d30bd462cbfe7d9ea90acef399a33

See more details on using hashes here.

File details

Details for the file distflow-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: distflow-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for distflow-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 83e1f8807707c32cb07fe42e3a0f0e548e71956dd57946bd62596a2014a2b32b
MD5 60efbb13ec5e22cfa4d51e997835174c
BLAKE2b-256 55d21df546eab5120458c3446996ee5157b76a26631c7ac049db9f7ba31afb84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page