Skip to main content

This repository contains code for cleaning your training data of benchmark data to help combat data snooping.

Project description

hf_clean_benchmarks

This repository is heavily inspired by the BigCode repository and is mostly a refactoring of their code. Specifically, the main person who worked on this repository is Chenghao Mou (Awesome work!).

Install

pip install hf_clean_benchmarks

How to use

Using the API

First you need to specify which benchmarks you want to clean your data of. You can do this by creating dictionary with the benchmark name in huggingface’s datasets repository as the key and the name of the column containing the benchmark data as the value. For example, if you want to clean your data of the HumanEval and LAMBADA benchmarks, you would do the following:

# Benchmarks to clean
benchmarks = [
    {
        "name": "openai_humaneval",
        "splits": ["test"],
        "columns": ["prompt", "canonical_solution", "test"],
    },
    {
        "name": "lambada",
        "splits": ["test"],
        "columns": ["text"],
    },
]

You then pass this dictionary to the BenchmarkCleaner class. This class will download the benchmarks and construct the suffix array for each benchmark. You can then use the clean method to clean a huggingface dataset. For example:

from datasets import load_dataset
from hf_clean_benchmarks.core import BenchmarkCleaner

cleaner = BenchmarkCleaner(benchmarks, threshold=0.1, num_perm=128)

# load your dataset
dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")

# clean the dataset
cleaned_dataset = cleaner.clean(dataset, column="content")
Checking for false positives...: 100%|██████████| 8780/8780 [00:34<00:00, 251.05it/s]
Checking for false positives...: 100%|██████████| 8805/8805 [07:34<00:00, 19.39it/s]
[11/06/22 10:34:43] INFO     Data Number                   : 10000                                      core.py:210
                    INFO     Duplicate Number              : 4033                                       core.py:211
                    INFO     Duplicate Rate                : 40.33%                                     core.py:212
                    INFO     Total Time                    : 493.73 seconds                             core.py:213
cleaned_dataset
Dataset({
    features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang', '__id__'],
    num_rows: 5967
})

Using the CLI

First you need to specify which benchmarks you want to clean your data of. You can do this by creating a json file with the benchmark name in huggingface’s datasets repository as the key and the name of the column containing the benchmark data as the value. For example, if you want to clean your data of the HumanEval and LAMBADA benchmarks, you would do the following:

file: benchmarks.json

[
    {
        "name": "openai_humaneval",
        "splits": ["test"],
        "columns": ["prompt", "canonical_solution", "test"],
    },
    {
        "name": "lambada",
        "splits": ["test"],
        "columns": ["text"],
    },
]

You then pass this json file to the clean_dataset command. This command will download the benchmarks and construct the suffix array for each benchmark. You can then use the clean command to clean a huggingface dataset. For example:

clean_dataset \
    --dataset_name bigcode/the-stack-smol \
    --column_name content \
    --benchmark_configs_path benchmarks.json \
    --output_path /tmp/test.jsonl \
    --data_dir data/python \
    --dataset_split train \
    --save_json

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_clean_benchmarks-0.0.1.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

hf_clean_benchmarks-0.0.1-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file hf_clean_benchmarks-0.0.1.tar.gz.

File metadata

  • Download URL: hf_clean_benchmarks-0.0.1.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for hf_clean_benchmarks-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a9181477b9429aa97b43ac2a19a2cab674ea98a34d288c3d6f69fb120abbc2dd
MD5 9d734f954aed272e44eed79335e0f67d
BLAKE2b-256 be11419652ab74aaedd7181c503d3ba820905e95c50ac0f03051e33fc0916437

See more details on using hashes here.

File details

Details for the file hf_clean_benchmarks-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for hf_clean_benchmarks-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 99c571f776277cd919c80f9b7b68fdc0bd03ade551cd9d3b3aa1e5ce7a53c9f9
MD5 5f6e31c93c2aac77bb82f42cfc54928a
BLAKE2b-256 ea7a4c4e17d6eec4c2930bbfe8553e454cd9314d39dfba259901cd72cca046bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page