Implementation of the UniMax sampling method for effective language sampling for multilingual pretraining
Project description
UniMax
unimax_sampling implements the UniMax sampling method introduced by Chung et al. (2023). This method aims to balance language representation in multilingual large language models by explicitly capping the number of repeats over each language's corpus, thereby mitigating overfitting on tail languages while delivering more uniform coverage of head languages.
Installation
# UniMax algorithm only
pip install unimax_sampling
# Including optional dependencies for the count-characters sub-command
pip install unimax_sampling[count]
Programmatic Usage
from unimax import unimax, count_characters
# (Optional) count characters in each dataset, only available when optional dependencies are installed via unimax_sampling[count]
character_counts = {}
for subset in ("swe_Latn", "fas_Arab", "ekk_Latn", "isl_Latn", "fao_Latn"):
character_counts[subset.split("_")[0]] = count_characters("HuggingFaceFW/fineweb-2", subset)
# Compute UniMax distribution from available characters per language
character_counts = {
"swe": 179955884499,
"fas": 184595788282,
"ekk": 42541080893,
"isl": 10027573389,
"fao": 549707867,
}
distribution = unimax(
character_counts,
character_budget=250_000_000_000,
max_epochs=4,
)
Output:
UniMaxDistribution(
budgets={
"fao": 2198831468,
"isl": 40110293556,
"ekk": 69230291658.66667,
"swe": 69230291658.66666,
"fas": 69230291658.66666,
},
epochs={
"fao": 4.0,
"isl": 4.0,
"ekk": 1.627375003300828,
"swe": 0.3847070177860806,
"fas": 0.37503722215431134,
},
probabilities={
"fao": 0.008795325872,
"isl": 0.160441174224,
"ekk": 0.2769211666346667,
"swe": 0.27692116663466665,
"fas": 0.27692116663466665,
},
)
Commandline Usage
For convenience, the package can be executed as a commandline utility
Counting Characters
python -m unimax count-characters <dataset_path> <language_code> <output_json_file> [-c <dataset_configuration>] [-s <split>]
[!NOTE]
count-charactersrequiresunimax_samplingto be installed viapip install unimax_sampling[count]
Example:
python -m unimax count-characters HuggingFaceFW/fineweb-2 fra french.json -c fra_Latn
Calculating the UniMax Distribution
python -m unimax unimax <character_count_files> -c <character_budget> -m <max_epochs> [-r <language_codes>] [-o <distribution_json_file>]
Example:
python -m unimax unimax character_counts.json -c 250000000000 -m 4 -o distribution.json
References
Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. 2023. UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unimax_sampling-1.0.0.tar.gz.
File metadata
- Download URL: unimax_sampling-1.0.0.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef89e8c3893fc10f90e4281cd8733eae604c0406619c1af32ee01c03cb89a955
|
|
| MD5 |
af0b0c2a0a846b6b4cf8d6f17853132c
|
|
| BLAKE2b-256 |
46dfdf406b7ff00c62230a7fafd4b12438ccd95fad3ee4479fdc144300951109
|
File details
Details for the file unimax_sampling-1.0.0-py3-none-any.whl.
File metadata
- Download URL: unimax_sampling-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8af5ccb0584cc9a2c007103ce8644fc0b99387269633e1ccc6362253f21e9c0
|
|
| MD5 |
18836349e9300cf0e476ca7b7d023796
|
|
| BLAKE2b-256 |
80ea51f7df7b328f63bbfa641577d4ffafc9c094d7ce0705cf8d89fdf6cda8ed
|