Skip to main content

DialUp! Generating linguistically plausible artificial dialects; preprocessing low-resource language inputs.

Project description

DialUp

This package contains code for the noising and denoising techniques introduced in DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models and Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization.

Install

pip3 install dialup

Noising

Introduction

This code is for generating synthetic dialectal data from text in a given language. This is done by applying linguistically-motivated augmentation (noising) that simulates dialectal variation to source language text. We present several kinds of noisers, introduced briefly below, and described in detail in our papers.

Noisers

  • Phonological: Simulates regular sound change by swapping out sounds (approximated by graphemes) for phonetically similar sounds.
  • Morphological: Noises suffixes of words.
  • Lexical: Noises function and content words separately. Content words are swapped out for non-words generated by a chargram model. Function words are noised using a high dial of phonological noise.
  • Random char: Makes random character substitutions.
  • Random word: Makes random word substitutions.

These can also be applied in composition.

Example

In order to create artificial dialectal versions of your text, you will need monolingual data in your language, used to train a character 3-gram model as a component of the noiser. Ideally, it should have at least a few thousand sentences.

You can optionally configure your own composition of the above noisers (see below).

Here is an example for noising some Italian text:

>>> from dialup import Noiser, print_languages_with_inbuilt_noising_support
>>> print_languages_with_inbuilt_noising_support()
Supported languages for artificial related dialect / variant generation:  ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra']
For any other language, you can include support by following the steps here: https://github.com/niyatibafna/dialup/tree/master/mtd/generating_artificial_dialects.
>>> noiser_ita = Noiser(lang = "ita", noiser_params=None, text_file = "WikiMatrix.en-it.it") # noiser_params not set, using default parameters
Character set: {'Ì', 'Q', 'v', 'H', 'n', 'P', 'Î', 'î', 'Ó', 'å', 'C', 'O', 'Û', 'ø', 'J', 'd', 'j', 'M', 'z', 'Z', 'ð', 'ò', 'û', 'A', 'Ü', 'Õ', 'ï', 'Þ', 'u', 'V', 'K', 'Ð', 'À', 'ù', 'I', 'à', 'R', 'Ç', 'Ñ', 'Í', 'Ý', 'L', 'È', 'é', '×', 'ñ', 'D', 'æ', 'ä', 'T', 'á', 'ö', 'ý', 'Ä', 'S', 'x', 'õ', 'Ò', '÷', 'Ô', 'í', 'ß', 'p', 'Â', 'E', 'ü', 'w', 'k', 'Æ', 'ã', 'e', 'f', 't', 'y', 'ô', 'ú', 'Y', 'W', 'ó', 'Ø', 'o', 'i', 'F', 'Ù', 'â', 'g', 'Å', 'B', 'Ú', 'ç', 'ê', 'm', 'Á', 'l', 'ì', 'b', 'N', 'þ', 'É', 'q', 'è', 'Ê', 'h', 'Ë', 'c', 's', 'a', 'Ï', 'ÿ', 'U', 'X', 'Ö', 'r', 'Ã', 'G', 'ë'}
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 650576
Training chargram model with chargram length 3...
Finished training chargram model with chargram length 3!
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 879046
Skipping random_char_aug as all thetas are 0
Skipping random_word_aug as all thetas are 0
>>> input = "È importante rendere la traduzione automatica robusta alla variazione dialettale."
>>> noised_input = noiser_ita.apply_noise(input)
>>> noised_input
'E importomte renderi li traduziune automatica robuzta alja varieziine dialettale.'

Custom parametrization

You can use your own parameterization of the noisers by passing a config like below to noiser_params:

text_file = "WikiMatrix.en-it.it"
params = {
    "lexical_aug": {
        "lang": lang,
        "theta_content_global": 0.001,
        "theta_func_global": 0.8,
        "text_file": text_file
    },
    "morph_aug": {
        "lang": lang,
        "theta_morph_global": 0.5,
        "text_file": text_file
    },
    "phonological_aug": {
        "lang": lang,
        "theta_phon": 0.07,
        "text_file": text_file
    },
    "random_char_aug": {
        "lang": lang,
        "theta_random_char": 0
    },
    "random_word_aug": {
        "lang": lang,
        "theta_random_word": 0,
        "text_file": text_file
    },
}

noiser_ita = Noiser(lang = "ita", noiser_params=params, text_file = "WikiMatrix.en-it.it")

The example config given here is default parameterization used if none is passed.

Denoising

Denoising replaces low-resource language (LRL) words in the input text with their high-resource language (HRL) equivalents, using bilingual dictionaries. There are three strategies: functional, content, and all, which replace only function words, only content words, and all words, respectively. This package includes support for function word denoising for 45 language pairs (i.e. no need to pass your own lexicon). This package can also be used for any language pair that has hrl as one of ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra'] and lrl as any other language if you have a bilingual lexicon. Note that your lexicon can include both function and content words; the strategy that you use will determine what class of words are replaced in your LRL. Make sure your lexicon has suitable coverage for your application scenario!

Example

>>> from dialup import Denoiser, print_language_pairs_with_inbuilt_denoising_support
>>> print_language_pairs_with_inbuilt_denoising_support()
Language pairs with included function word lexicons (strategy='functional'):  ['acf-hat', 'ajp-arb', 'arz-arb', 'bho-hin', 'crs-hat', 'glg-ita', 'jav-ind', 'mai-hin', 'pag-ind', 'scn-ita', 'sun-ind', 'vec-ita', 'acm-arb', 'apc-arb', 'ast-ita', 'cat-ita', 'fij-ind', 'hne-hin', 'lij-ita', 'mfe-hat', 'plt-ind', 'smo-ind', 'tgl-ind', 'zsm-ind', 'acq-arb', 'ars-arb', 'awa-hin', 'ceb-ind', 'fra-ita', 'ilo-ind', 'lmo-ita', 'mri-ind', 'por-ita', 'spa-ita', 'tuk-tur', 'aeb-arb', 'ary-arb', 'azj-tur', 'crh-tur', 'fur-ita', 'mag-hin', 'oci-ita', 'ron-ita', 'srd-ita', 'uzn-tur']
You can also perform denoising for any of the following high-resource languages: ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra'] with any other language provided you pass a bilingual lexicon.
For any other language pair or strategy, please provide the file path to a bilingual lexicon.
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional") # Use included lexicon for this pair OR
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional", bilingual_lexicon_path = "cat-ita.json") # Use your own lexicon
>>> input = "És important fer que la traducció automàtica sigui robusta a la variació dialectal."
>>> denoised_input = denoiser.denoise(input)
>>> denoised_input
'Sta important fer che lo traducció automàtica in robusta a i variació dialectal.'

Bilingual lexicon format

Pass in a filepath to a JSON that looks like this:

{
    <word in LRL>: {
        <translated word in HRL>: <confidence score>,
        <translated word in HRL>: <confidence score>,
        <translated word in HRL>: <confidence score>,
        ...
    }, ...
}

Confidence scores are optional; if present, the translation with the highest confidence is picked.

Cite

If you use our code, please cite:

@inproceedings{bafna-etal-2024-evaluating,
title = "Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization",
author = "Bafna, Niyati  and Murray, Kenton  and Yarowsky, David",
editor = "Al-Onaizan, Yaser  and
  Bansal, Mohit  and
  Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1044/",
doi = "10.18653/v1/2024.emnlp-main.1044",
pages = "18742--18762"
}

@article{bafna2025dialup,
title={DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models},
author={Bafna, Niyati and Chang, Emily and Robinson, Nathaniel R and Mortensen, David R and Murray, Kenton and Yarowsky, David and Sirin, Hale},
journal={arXiv preprint arXiv:2501.16581},
year={2025}
}
(Accepted at ACL 2025)

Contributors: Niyati Bafna, Emily Chang

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dialup-1.0.1.tar.gz (301.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dialup-1.0.1-py3-none-any.whl (330.4 kB view details)

Uploaded Python 3

File details

Details for the file dialup-1.0.1.tar.gz.

File metadata

  • Download URL: dialup-1.0.1.tar.gz
  • Upload date:
  • Size: 301.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for dialup-1.0.1.tar.gz
Algorithm Hash digest
SHA256 52a730eaafc236623b20985cae342cb8d7321d5fe07470096e847ef791ac2737
MD5 02b62165ec811f83bcff25212b58b047
BLAKE2b-256 e9b02bfb77cf6a020b290926b182fbac5f1587d79134e14161c098f1a77e76e7

See more details on using hashes here.

File details

Details for the file dialup-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: dialup-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 330.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for dialup-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 327c5e5f5bd8fc18e081b4eca77d945c56f45cadadc1729589be620eb3902904
MD5 fdf5afbf17f1eab8a3e00e5cd76919e0
BLAKE2b-256 ab34621625749e8bca7764e241ef1372aba58126f94cad8648244d2db79baa1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page