This repository contains code for cleaning your training data of benchmark data to help combat data snooping.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

hf_clean_benchmarks

This repository is heavily inspired by the BigCode repository and is mostly a refactoring of their code. Specifically, the main person who worked on this repository is Chenghao Mou (Awesome work!).

Install

pip install hf_clean_benchmarks

How to use

Using the API

First you need to specify which benchmarks you want to clean your data of. You can do this by creating dictionary with the benchmark name in huggingface’s datasets repository as the key and the name of the column containing the benchmark data as the value. For example, if you want to clean your data of the HumanEval and LAMBADA benchmarks, you would do the following:

# Benchmarks to clean
benchmarks = [
    {
        "name": "openai_humaneval",
        "splits": ["test"],
        "columns": ["prompt", "canonical_solution", "test"],
    },
    {
        "name": "lambada",
        "splits": ["test"],
        "columns": ["text"],
    },
]

You then pass this dictionary to the BenchmarkCleaner class. This class will download the benchmarks and construct the suffix array for each benchmark. You can then use the clean method to clean a huggingface dataset. For example:

from datasets import load_dataset
from hf_clean_benchmarks.core import BenchmarkCleaner

cleaner = BenchmarkCleaner(benchmarks, threshold=0.1, num_perm=128)

# load your dataset
dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")

# clean the dataset
cleaned_dataset = cleaner.clean(dataset, column="content")

Checking for false positives...: 100%|██████████| 8780/8780 [00:34<00:00, 251.05it/s]
Checking for false positives...: 100%|██████████| 8805/8805 [07:34<00:00, 19.39it/s]

[11/06/22 10:34:43] INFO     Data Number                   : 10000                                      core.py:210

                    INFO     Duplicate Number              : 4033                                       core.py:211

                    INFO     Duplicate Rate                : 40.33%                                     core.py:212

                    INFO     Total Time                    : 493.73 seconds                             core.py:213

cleaned_dataset

Dataset({
    features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang', '__id__'],
    num_rows: 5967
})

Using the CLI

First you need to specify which benchmarks you want to clean your data of. You can do this by creating a json file with the benchmark name in huggingface’s datasets repository as the key and the name of the column containing the benchmark data as the value. For example, if you want to clean your data of the HumanEval and LAMBADA benchmarks, you would do the following:

file: benchmarks.json

[
    {
        "name": "openai_humaneval",
        "splits": ["test"],
        "columns": ["prompt", "canonical_solution", "test"],
    },
    {
        "name": "lambada",
        "splits": ["test"],
        "columns": ["text"],
    },
]

You then pass this json file to the clean_dataset command. This command will download the benchmarks and construct the suffix array for each benchmark. You can then use the clean command to clean a huggingface dataset. For example:

clean_dataset \
    --dataset_name bigcode/the-stack-smol \
    --column_name content \
    --benchmark_configs_path benchmarks.json \
    --output_path /tmp/test.jsonl \
    --data_dir data/python \
    --dataset_split train \
    --save_json

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.0.1

Nov 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_clean_benchmarks-0.0.1.tar.gz (12.1 kB view details)

Uploaded Nov 6, 2022 Source

Built Distribution

hf_clean_benchmarks-0.0.1-py3-none-any.whl (11.7 kB view details)

Uploaded Nov 6, 2022 Python 3

File details

Details for the file hf_clean_benchmarks-0.0.1.tar.gz.

File metadata

Download URL: hf_clean_benchmarks-0.0.1.tar.gz
Upload date: Nov 6, 2022
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for hf_clean_benchmarks-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`a9181477b9429aa97b43ac2a19a2cab674ea98a34d288c3d6f69fb120abbc2dd`
MD5	`9d734f954aed272e44eed79335e0f67d`
BLAKE2b-256	`be11419652ab74aaedd7181c503d3ba820905e95c50ac0f03051e33fc0916437`

See more details on using hashes here.

File details

Details for the file hf_clean_benchmarks-0.0.1-py3-none-any.whl.

File metadata

Download URL: hf_clean_benchmarks-0.0.1-py3-none-any.whl
Upload date: Nov 6, 2022
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for hf_clean_benchmarks-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99c571f776277cd919c80f9b7b68fdc0bd03ade551cd9d3b3aa1e5ce7a53c9f9`
MD5	`5f6e31c93c2aac77bb82f42cfc54928a`
BLAKE2b-256	`ea7a4c4e17d6eec4c2930bbfe8553e454cd9314d39dfba259901cd72cca046bc`