Skip to main content

Implementation of the UniMax sampling method for effective language sampling for multilingual pretraining

Project description

UniMax

unimax_sampling implements the UniMax sampling method introduced by Chung et al. (2023). This method aims to balance language representation in multilingual large language models by explicitly capping the number of repeats over each language's corpus, thereby mitigating overfitting on tail languages while delivering more uniform coverage of head languages.

Installation

# UniMax algorithm only
pip install unimax_sampling
# Including optional dependencies for the count-characters sub-command
pip install unimax_sampling[count]

Programmatic Usage

from unimax import unimax, count_characters

# (Optional) count characters in each dataset, only available when optional dependencies are installed via unimax_sampling[count]
character_counts = {}
for subset in ("swe_Latn", "fas_Arab", "ekk_Latn", "isl_Latn", "fao_Latn"):
    character_counts[subset.split("_")[0]] = count_characters("HuggingFaceFW/fineweb-2", subset)

# Compute UniMax distribution from available characters per language
character_counts = {
    "swe": 179955884499,
    "fas": 184595788282,
    "ekk": 42541080893,
    "isl": 10027573389,
    "fao": 549707867,
}

distribution = unimax(
    character_counts,
    character_budget=250_000_000_000,
    max_epochs=4,
)

Output:

UniMaxDistribution(
    budgets={
        "fao": 2198831468,
        "isl": 40110293556,
        "ekk": 69230291658.66667,
        "swe": 69230291658.66666,
        "fas": 69230291658.66666,
    },
    epochs={
        "fao": 4.0,
        "isl": 4.0,
        "ekk": 1.627375003300828,
        "swe": 0.3847070177860806,
        "fas": 0.37503722215431134,
    },
    probabilities={
        "fao": 0.008795325872,
        "isl": 0.160441174224,
        "ekk": 0.2769211666346667,
        "swe": 0.27692116663466665,
        "fas": 0.27692116663466665,
    },
)

Commandline Usage

For convenience, the package can be executed as a commandline utility

Counting Characters

python -m unimax count-characters <dataset_path> <language_code> <output_json_file> [-c <dataset_configuration>] [-s <split>]

[!NOTE] count-characters requires unimax_sampling to be installed via pip install unimax_sampling[count]

Example:

python -m unimax count-characters HuggingFaceFW/fineweb-2 fra french.json -c fra_Latn

Calculating the UniMax Distribution

python -m unimax unimax <character_count_files> -c <character_budget> -m <max_epochs> [-r <language_codes>] [-o <distribution_json_file>]

Example:

python -m unimax unimax character_counts.json -c 250000000000 -m 4 -o distribution.json

References

Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. 2023. UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unimax_sampling-1.0.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unimax_sampling-1.0.0-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file unimax_sampling-1.0.0.tar.gz.

File metadata

  • Download URL: unimax_sampling-1.0.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for unimax_sampling-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ef89e8c3893fc10f90e4281cd8733eae604c0406619c1af32ee01c03cb89a955
MD5 af0b0c2a0a846b6b4cf8d6f17853132c
BLAKE2b-256 46dfdf406b7ff00c62230a7fafd4b12438ccd95fad3ee4479fdc144300951109

See more details on using hashes here.

File details

Details for the file unimax_sampling-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for unimax_sampling-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8af5ccb0584cc9a2c007103ce8644fc0b99387269633e1ccc6362253f21e9c0
MD5 18836349e9300cf0e476ca7b7d023796
BLAKE2b-256 80ea51f7df7b328f63bbfa641577d4ffafc9c094d7ce0705cf8d89fdf6cda8ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page