DialUp! Generating linguistically plausible artificial dialects; preprocessing low-resource language inputs.
Project description
DialUp
This package contains code for the noising and denoising techniques introduced in DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models and Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization.
Install
pip3 install dialup
Noising
Introduction
This code is for generating synthetic dialectal data from text in a given language. This is done by applying linguistically-motivated augmentation (noising) that simulates dialectal variation to source language text. We present several kinds of noisers, introduced briefly below, and described in detail in our papers.
Noisers
- Phonological: Simulates regular sound change by swapping out sounds (approximated by graphemes) for phonetically similar sounds.
- Morphological: Noises suffixes of words.
- Lexical: Noises function and content words separately. Content words are swapped out for non-words generated by a chargram model. Function words are noised using a high dial of phonological noise.
- Random char: Makes random character substitutions.
- Random word: Makes random word substitutions.
These can also be applied in composition.
Example
In order to create artificial dialectal versions of your text, you will need monolingual data in your language, used to train a character 3-gram model as a component of the noiser. Ideally, it should have at least a few thousand sentences.
You can optionally configure your own composition of the above noisers (see below).
Here is an example for noising some Italian text:
>>> from dialup import Noiser, print_languages_with_inbuilt_noising_support
>>> print_languages_with_inbuilt_noising_support()
Supported languages for artificial related dialect / variant generation: ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra']
For any other language, you can include support by following the steps here: https://github.com/niyatibafna/dialup/tree/master/mtd/generating_artificial_dialects.
>>> noiser_ita = Noiser(lang = "ita", noiser_params=None, text_file = "WikiMatrix.en-it.it") # noiser_params not set, using default parameters
Character set: {'Ì', 'Q', 'v', 'H', 'n', 'P', 'Î', 'î', 'Ó', 'å', 'C', 'O', 'Û', 'ø', 'J', 'd', 'j', 'M', 'z', 'Z', 'ð', 'ò', 'û', 'A', 'Ü', 'Õ', 'ï', 'Þ', 'u', 'V', 'K', 'Ð', 'À', 'ù', 'I', 'à', 'R', 'Ç', 'Ñ', 'Í', 'Ý', 'L', 'È', 'é', '×', 'ñ', 'D', 'æ', 'ä', 'T', 'á', 'ö', 'ý', 'Ä', 'S', 'x', 'õ', 'Ò', '÷', 'Ô', 'í', 'ß', 'p', 'Â', 'E', 'ü', 'w', 'k', 'Æ', 'ã', 'e', 'f', 't', 'y', 'ô', 'ú', 'Y', 'W', 'ó', 'Ø', 'o', 'i', 'F', 'Ù', 'â', 'g', 'Å', 'B', 'Ú', 'ç', 'ê', 'm', 'Á', 'l', 'ì', 'b', 'N', 'þ', 'É', 'q', 'è', 'Ê', 'h', 'Ë', 'c', 's', 'a', 'Ï', 'ÿ', 'U', 'X', 'Ö', 'r', 'Ã', 'G', 'ë'}
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 650576
Training chargram model with chargram length 3...
Finished training chargram model with chargram length 3!
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 879046
Skipping random_char_aug as all thetas are 0
Skipping random_word_aug as all thetas are 0
>>> input = "È importante rendere la traduzione automatica robusta alla variazione dialettale."
>>> noised_input = noiser_ita.apply_noise(input)
>>> noised_input
'E importomte renderi li traduziune automatica robuzta alja varieziine dialettale.'
Custom parametrization
You can use your own parameterization of the noisers by passing a config like below to noiser_params:
text_file = "WikiMatrix.en-it.it"
params = {
"lexical_aug": {
"lang": lang,
"theta_content_global": 0.001,
"theta_func_global": 0.8,
"text_file": text_file
},
"morph_aug": {
"lang": lang,
"theta_morph_global": 0.5,
"text_file": text_file
},
"phonological_aug": {
"lang": lang,
"theta_phon": 0.07,
"text_file": text_file
},
"random_char_aug": {
"lang": lang,
"theta_random_char": 0
},
"random_word_aug": {
"lang": lang,
"theta_random_word": 0,
"text_file": text_file
},
}
noiser_ita = Noiser(lang = "ita", noiser_params=params, text_file = "WikiMatrix.en-it.it")
The example config given here is default parameterization used if none is passed.
Denoising
Denoising replaces low-resource language (LRL) words in the input text with their high-resource language (HRL) equivalents, using bilingual dictionaries.
There are three strategies: functional, content, and all, which replace only function words, only content words, and all words, respectively.
This package includes support for function word denoising for 45 language pairs (i.e. no need to pass your own lexicon).
This package can also be used for any language pair that has hrl as one of ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra'] and lrl as any other language if you have a bilingual lexicon.
Note that your lexicon can include both function and content words; the strategy that you use will determine what class of words are replaced in your LRL. Make sure your lexicon has suitable coverage for your application scenario!
Example
>>> from dialup import Denoiser, print_language_pairs_with_inbuilt_denoising_support
>>> print_language_pairs_with_inbuilt_denoising_support()
Language pairs with included function word lexicons (strategy='functional'): ['acf-hat', 'ajp-arb', 'arz-arb', 'bho-hin', 'crs-hat', 'glg-ita', 'jav-ind', 'mai-hin', 'pag-ind', 'scn-ita', 'sun-ind', 'vec-ita', 'acm-arb', 'apc-arb', 'ast-ita', 'cat-ita', 'fij-ind', 'hne-hin', 'lij-ita', 'mfe-hat', 'plt-ind', 'smo-ind', 'tgl-ind', 'zsm-ind', 'acq-arb', 'ars-arb', 'awa-hin', 'ceb-ind', 'fra-ita', 'ilo-ind', 'lmo-ita', 'mri-ind', 'por-ita', 'spa-ita', 'tuk-tur', 'aeb-arb', 'ary-arb', 'azj-tur', 'crh-tur', 'fur-ita', 'mag-hin', 'oci-ita', 'ron-ita', 'srd-ita', 'uzn-tur']
You can also perform denoising for any of the following high-resource languages: ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra'] with any other language provided you pass a bilingual lexicon.
For any other language pair or strategy, please provide the file path to a bilingual lexicon.
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional") # Use included lexicon for this pair OR
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional", bilingual_lexicon_path = "cat-ita.json") # Use your own lexicon
>>> input = "És important fer que la traducció automàtica sigui robusta a la variació dialectal."
>>> denoised_input = denoiser.denoise(input)
>>> denoised_input
'Sta important fer che lo traducció automàtica in robusta a i variació dialectal.'
Bilingual lexicon format
Pass in a filepath to a JSON that looks like this:
{
<word in LRL>: {
<translated word in HRL>: <confidence score>,
<translated word in HRL>: <confidence score>,
<translated word in HRL>: <confidence score>,
...
}, ...
}
Confidence scores are optional; if present, the translation with the highest confidence is picked.
Cite
If you use our code, please cite:
@inproceedings{bafna-etal-2024-evaluating,
title = "Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization",
author = "Bafna, Niyati and Murray, Kenton and Yarowsky, David",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1044/",
doi = "10.18653/v1/2024.emnlp-main.1044",
pages = "18742--18762"
}
@article{bafna2025dialup,
title={DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models},
author={Bafna, Niyati and Chang, Emily and Robinson, Nathaniel R and Mortensen, David R and Murray, Kenton and Yarowsky, David and Sirin, Hale},
journal={arXiv preprint arXiv:2501.16581},
year={2025}
}
(Accepted at ACL 2025)
Contributors: Niyati Bafna, Emily Chang
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dialup-1.0.1.tar.gz.
File metadata
- Download URL: dialup-1.0.1.tar.gz
- Upload date:
- Size: 301.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52a730eaafc236623b20985cae342cb8d7321d5fe07470096e847ef791ac2737
|
|
| MD5 |
02b62165ec811f83bcff25212b58b047
|
|
| BLAKE2b-256 |
e9b02bfb77cf6a020b290926b182fbac5f1587d79134e14161c098f1a77e76e7
|
File details
Details for the file dialup-1.0.1-py3-none-any.whl.
File metadata
- Download URL: dialup-1.0.1-py3-none-any.whl
- Upload date:
- Size: 330.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
327c5e5f5bd8fc18e081b4eca77d945c56f45cadadc1729589be620eb3902904
|
|
| MD5 |
fdf5afbf17f1eab8a3e00e5cd76919e0
|
|
| BLAKE2b-256 |
ab34621625749e8bca7764e241ef1372aba58126f94cad8648244d2db79baa1b
|