Data Selection via Compression-Based Alignment

These details have not been verified by PyPI

Project description

Data Selection for Language Models via Compression

This repository hosts the ZIP-FIT data selection framework, designed to effectively and efficiently select relevant training data for language models from any data source based on a specified target dataset.

ZIP-FIT is optimized for:

Rapid, large-scale data selection from extensive raw text datasets.
Identifying data that closely aligns with the distribution of a given target dataset (e.g., domain-specific data, HumanEval, etc.).

Compute needed:

1 CPU node

ZIP-FIT figure

Quickstart

Install with pip:

pip install zip-fit

To select data, simply initialize a ZIPFIT object and call the following functions:

from zip_fit import ZIPFIT

source_dataset = <path>
target_dataset = <path>
top_k = 10000

zipfit = ZIPFIT(source_dataset, target_dataset, k=top_k, output_file="top_k_sequences.jsonl")
zipfit.run()

Executing this process will generate a jsonl file named 'top_k_sequences.jsonl', containing 10,000 documents. For optimal performance, it is recommended to use uncompressed jsonl files stored on local file storage for all data paths, and to utilize as many CPU cores as possible. You can provide custom functions for reading the data paths and extracting the text field from each example using the {source,target}_load_dataset_fn and {source,target}_parse_example_fn parameters in the constructor.

Examples

HuggingFace datasets can also be used in either source_dataset or target_dataset. However, please note that streaming a large raw dataset directly may result in slow performance; this approach is better suited for target datasets:

from zip_fit import ZIPFIT
from datasets import load_dataset

source_dataset = f'/path/to/source.jsonl'
target_dataset = 'openai/openai_humaneval'

# Define the function to load the target dataset
def target_load_dataset_fn(dataset):
    ds = load_dataset(dataset, split='test', trust_remote_code=True)
    return ds

# Define the function to parse examples from the target dataset
def target_parse_example_fn(ex):
    text = f"Problem description: {ex['prompt']} \nCanonical solution: {ex['canonical_solution']}"
    return text

# Create an instance of ZIPFIT
zip_fit_instance = ZIPFIT(
    source_dataset=source_dataset,
    target_dataset=target_dataset,
    target_load_fn=target_load_dataset_fn,
    target_parse_fn=target_parse_example_fn,
    k=100000,  
    output_file="top_k_sequences.jsonl",
    compression_algorithm='gzip'  # Change to 'lz4' if desired
)

# Run the ZIPFIT process
zip_fit_instance.run()

You can specify different compression algorithms. The ZIP-FIT paper uses gzip, however other compression algorithms like lz4 are faster.

Citation Information

Paper: https://arxiv.org/abs/2410.18194

@article{obbad2024zipfit,
  author = {Elyas Obbad and Iddah Mlauzi and Brando Miranda and Rylan Schaeffer and Kamal Obbad and Suhana Bedi and Sanmi Koyejo},
  title = {ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment},
  year = {2024},
  journal = {arXiv preprint arXiv:2410.18194},
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.9

Nov 30, 2024

1.0.8

Nov 18, 2024

1.0.7

Nov 18, 2024

1.0.6

Nov 18, 2024

1.0.5

Nov 17, 2024

1.0.4

Nov 17, 2024

1.0.3

Nov 17, 2024

1.0.2

Nov 17, 2024

1.0.1

Nov 17, 2024

This version

1.0.0

Nov 3, 2024

0.0.9

Nov 3, 2024

0.0.8

Nov 3, 2024

0.0.7

Nov 3, 2024

0.0.6

Nov 3, 2024

0.0.5

Nov 2, 2024

0.0.4

Nov 2, 2024

0.0.3

Nov 2, 2024

0.0.2

Oct 31, 2024

0.0.1

Oct 31, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zip_fit-1.0.0.tar.gz (7.2 kB view details)

Uploaded Nov 3, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zip_fit-1.0.0-py3-none-any.whl (6.8 kB view details)

Uploaded Nov 3, 2024 Python 3

File details

Details for the file zip_fit-1.0.0.tar.gz.

File metadata

Download URL: zip_fit-1.0.0.tar.gz
Upload date: Nov 3, 2024
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for zip_fit-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`93c42e438554894e5aeb45638f588dd783348ac2e2932b488a4977db24f07984`
MD5	`8a1a1204087bede0a369825c05228728`
BLAKE2b-256	`f6c76fcb699d6c8e5ca75b45a09d7aa822727e15ebdb65691f691a7ff8650e79`

See more details on using hashes here.

File details

Details for the file zip_fit-1.0.0-py3-none-any.whl.

File metadata

Download URL: zip_fit-1.0.0-py3-none-any.whl
Upload date: Nov 3, 2024
Size: 6.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for zip_fit-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a55bbcebe1abd0ad1c74490de1b16e3b89934caa634ce4d0dd195202929ddb39`
MD5	`ec38eee27d30924c6dd6a66dda4f7bf8`
BLAKE2b-256	`29e0ec98382e36fd8526e9ef36b1f1cf09e53c6a55f325a785b1b135a1990b54`

See more details on using hashes here.

zip-fit 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Data Selection for Language Models via Compression

Quickstart

Examples

Citation Information

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes