DialUp! Generating linguistically plausible artificial dialects; preprocessing low-resource language inputs.

Project description

DialUp

This package contains code for the noising and denoising techniques introduced in DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models and Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization.

Install

pip3 install dialup

Noising

Introduction

This code is for generating synthetic dialectal data from text in a given language. This is done by applying linguistically-motivated augmentation (noising) that simulates dialectal variation to source language text. We present several kinds of noisers, introduced briefly below, and described in detail in our papers.

Noisers

Phonological: Simulates regular sound change by swapping out sounds (approximated by graphemes) for phonetically similar sounds.
Morphological: Noises suffixes of words.
Lexical: Noises function and content words separately. Content words are swapped out for non-words generated by a chargram model. Function words are noised using a high dial of phonological noise.
Random char: Makes random character substitutions.
Random word: Makes random word substitutions.

These can also be applied in composition.

Example

In order to create artificial dialectal versions of your text, you will need monolingual data in your language, used to train a character 3-gram model as a component of the noiser. Ideally, it should have at least a few thousand sentences.

You can optionally configure your own composition of the above noisers (see below).

Here is an example for noising some Italian text:

>>> from dialup import Noiser, print_languages_with_inbuilt_noising_support
>>> print_languages_with_inbuilt_noising_support()
Supported languages for artificial related dialect / variant generation:  ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra']
For any other language, you can include support by following the steps here: https://github.com/niyatibafna/dialup/tree/master/mtd/generating_artificial_dialects.
>>> noiser_ita = Noiser(lang = "ita", noiser_params=None, text_file = "WikiMatrix.en-it.it") # noiser_params not set, using default parameters
Character set: {'Ì', 'Q', 'v', 'H', 'n', 'P', 'Î', 'î', 'Ó', 'å', 'C', 'O', 'Û', 'ø', 'J', 'd', 'j', 'M', 'z', 'Z', 'ð', 'ò', 'û', 'A', 'Ü', 'Õ', 'ï', 'Þ', 'u', 'V', 'K', 'Ð', 'À', 'ù', 'I', 'à', 'R', 'Ç', 'Ñ', 'Í', 'Ý', 'L', 'È', 'é', '×', 'ñ', 'D', 'æ', 'ä', 'T', 'á', 'ö', 'ý', 'Ä', 'S', 'x', 'õ', 'Ò', '÷', 'Ô', 'í', 'ß', 'p', 'Â', 'E', 'ü', 'w', 'k', 'Æ', 'ã', 'e', 'f', 't', 'y', 'ô', 'ú', 'Y', 'W', 'ó', 'Ø', 'o', 'i', 'F', 'Ù', 'â', 'g', 'Å', 'B', 'Ú', 'ç', 'ê', 'm', 'Á', 'l', 'ì', 'b', 'N', 'þ', 'É', 'q', 'è', 'Ê', 'h', 'Ë', 'c', 's', 'a', 'Ï', 'ÿ', 'U', 'X', 'Ö', 'r', 'Ã', 'G', 'ë'}
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 650576
Training chargram model with chargram length 3...
Finished training chargram model with chargram length 3!
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 879046
Skipping random_char_aug as all thetas are 0
Skipping random_word_aug as all thetas are 0
>>> input = "È importante rendere la traduzione automatica robusta alla variazione dialettale."
>>> noised_input = noiser_ita.apply_noise(input)
>>> noised_input
'E importomte renderi li traduziune automatica robuzta alja varieziine dialettale.'

Custom parametrization

You can use your own parameterization of the noisers by passing a config like below to noiser_params:

text_file = "WikiMatrix.en-it.it"
params = {
    "lexical_aug": {
        "lang": lang,
        "theta_content_global": 0.001,
        "theta_func_global": 0.8,
        "text_file": text_file
    },
    "morph_aug": {
        "lang": lang,
        "theta_morph_global": 0.5,
        "text_file": text_file
    },
    "phonological_aug": {
        "lang": lang,
        "theta_phon": 0.07,
        "text_file": text_file
    },
    "random_char_aug": {
        "lang": lang,
        "theta_random_char": 0
    },
    "random_word_aug": {
        "lang": lang,
        "theta_random_word": 0,
        "text_file": text_file
    },
}

noiser_ita = Noiser(lang = "ita", noiser_params=params, text_file = "WikiMatrix.en-it.it")

The example config given here is default parameterization used if none is passed.

Denoising

Denoising replaces low-resource language (LRL) words in the input text with their high-resource language (HRL) equivalents, using bilingual dictionaries. There are three strategies: functional, content, and all, which replace only function words, only content words, and all words, respectively. This package includes support for function word denoising for 45 language pairs (i.e. no need to pass your own lexicon). This package can also be used for any language pair that has hrl as one of ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra'] and lrl as any other language if you have a bilingual lexicon. Note that your lexicon can include both function and content words; the strategy that you use will determine what class of words are replaced in your LRL. Make sure your lexicon has suitable coverage for your application scenario!

Example

>>> from dialup import Denoiser, print_language_pairs_with_inbuilt_denoising_support
>>> print_language_pairs_with_inbuilt_denoising_support()
Language pairs with included function word lexicons (strategy='functional'):  ['acf-hat', 'ajp-arb', 'arz-arb', 'bho-hin', 'crs-hat', 'glg-ita', 'jav-ind', 'mai-hin', 'pag-ind', 'scn-ita', 'sun-ind', 'vec-ita', 'acm-arb', 'apc-arb', 'ast-ita', 'cat-ita', 'fij-ind', 'hne-hin', 'lij-ita', 'mfe-hat', 'plt-ind', 'smo-ind', 'tgl-ind', 'zsm-ind', 'acq-arb', 'ars-arb', 'awa-hin', 'ceb-ind', 'fra-ita', 'ilo-ind', 'lmo-ita', 'mri-ind', 'por-ita', 'spa-ita', 'tuk-tur', 'aeb-arb', 'ary-arb', 'azj-tur', 'crh-tur', 'fur-ita', 'mag-hin', 'oci-ita', 'ron-ita', 'srd-ita', 'uzn-tur']
You can also perform denoising for any of the following high-resource languages: ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra'] with any other language provided you pass a bilingual lexicon.
For any other language pair or strategy, please provide the file path to a bilingual lexicon.
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional") # Use included lexicon for this pair OR
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional", bilingual_lexicon_path = "cat-ita.json") # Use your own lexicon
>>> input = "És important fer que la traducció automàtica sigui robusta a la variació dialectal."
>>> denoised_input = denoiser.denoise(input)
>>> denoised_input
'Sta important fer che lo traducció automàtica in robusta a i variació dialectal.'

Bilingual lexicon format

Pass in a filepath to a JSON that looks like this:

{
    <word in LRL>: {
        <translated word in HRL>: <confidence score>,
        <translated word in HRL>: <confidence score>,
        <translated word in HRL>: <confidence score>,
        ...
    }, ...
}

Confidence scores are optional; if present, the translation with the highest confidence is picked.

Cite

If you use our code, please cite:

@inproceedings{bafna-etal-2024-evaluating,
title = "Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization",
author = "Bafna, Niyati  and Murray, Kenton  and Yarowsky, David",
editor = "Al-Onaizan, Yaser  and
  Bansal, Mohit  and
  Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1044/",
doi = "10.18653/v1/2024.emnlp-main.1044",
pages = "18742--18762"
}

@article{bafna2025dialup,
title={DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models},
author={Bafna, Niyati and Chang, Emily and Robinson, Nathaniel R and Mortensen, David R and Murray, Kenton and Yarowsky, David and Sirin, Hale},
journal={arXiv preprint arXiv:2501.16581},
year={2025}
}
(Accepted at ACL 2025)

Contributors: Niyati Bafna, Emily Chang

Project details

Release history Release notifications | RSS feed

This version

1.0.1

Jul 2, 2025

1.0.0

Jul 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dialup-1.0.1.tar.gz (301.9 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dialup-1.0.1-py3-none-any.whl (330.4 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file dialup-1.0.1.tar.gz.

File metadata

Download URL: dialup-1.0.1.tar.gz
Upload date: Jul 2, 2025
Size: 301.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for dialup-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`52a730eaafc236623b20985cae342cb8d7321d5fe07470096e847ef791ac2737`
MD5	`02b62165ec811f83bcff25212b58b047`
BLAKE2b-256	`e9b02bfb77cf6a020b290926b182fbac5f1587d79134e14161c098f1a77e76e7`

See more details on using hashes here.

File details

Details for the file dialup-1.0.1-py3-none-any.whl.

File metadata

Download URL: dialup-1.0.1-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 330.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for dialup-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`327c5e5f5bd8fc18e081b4eca77d945c56f45cadadc1729589be620eb3902904`
MD5	`fdf5afbf17f1eab8a3e00e5cd76919e0`
BLAKE2b-256	`ab34621625749e8bca7764e241ef1372aba58126f94cad8648244d2db79baa1b`

See more details on using hashes here.

dialup 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

DialUp

Noising

Introduction

Noisers

Example

Custom parametrization

Denoising

Example

Bilingual lexicon format

Cite

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes