Skip to main content

No project description provided

Project description

Language distances and information

This repository combines a number of sources to obtain information and distances for languages. It focuses on ISO639-3 languages, and combines information from a variety of sources.

Usage

The package is located in src/distals/, and can be used from the commandline: python3 src/distals/distals.py.

Its main function is to provide a user with distance metrics between two languages. This can be obtained by adding --lang1 and --lang2 to the command. For example:

python3 src/distals/distals.py --lang1 fry --lang2 dan

The distances are calculated based on information from existing databases. This information is loaded from a pickle file (default distals-db.pickle.gz), and a full database is included in this repo. It can also easily be updated, by using the --update_database option and the --update_textbased option.

The package also depends on language code and language names conversion. The database for this is included in langname-db.pickle.gz, this can re-created with the --cache_langnames option.

For now, we assume all source data to be available in the data/ folder. The data can be updated using the scripts/0.update.sh script. So to update the database completely, one has to run:

./scripts/0.update.sh
python3 src/distals/distals.py --update_database

To update the textbased features, the LTI-LangId corpus needs to be downloaded, which takes a substantial amount of time (weeks), the steps for doing so can be seen in 0.get_miltale.sh.

It should be noted that all metrics are designed to have values between 0 and 1, and they are not directional. In cases where a metric could not be estimated, the code returns a -1.

Metrics

  • aes_dist: first collects the Agglomerated Endangerment Scale (AES) category for each language, and then calculates how many groups apart they are. See also https://glottolog.org/langdoc/status , the extraction of these values was done with the scripts/getAES.py script, and based on GlottoLog 5.0

  • asjp_lev_dist: Calculates the LDND distance on the ASJP word lists as defined in ``Adding typology to lexicostatistics: A combined approach to language classification''. Unfortunately, there is no 1-1 mapping between the language codes in ASJP and ISO639-3 codes, so we made an automatic mapping based on the language name provided in ASJP and other sources. The script for this is in scripts/complete_lists.py and the results in `data/aspj_conv`. We use the normalized levenshtein as provided by ASJP (https://asjp.clld.org/software). When multiple versions of a word are available, we use the average (this was underspecified in the original paper, and we could not find reference implementations).

  • lang2vec: Cosine distance between lang2vec vectors, only taking into account values that overlap. Note that this metric is thus hard to compare across language pairs, as different linguistic features will be included/excluded for different language pairs.

  • lang2vec_knn: Cosine distance between lang2vec vectors which have been completed through KNN by the original paper.

  • lang_fam: The percentage of trees of distance. This means that if you are in two different trees, it will always be 2.0. If both languages are in the same tree it is #overlapping edges/the total edges of the deepest language of the two.

  • lang_group: distance between language groups as defined in ``The State and Fate of Linguistic Diversity and Inclusion in the NLP World''. Unfortunately, I could not obtain the language codes, but have made an automatic mapping (scripts/complete_lang2tax.py), which is available in `data/lang2tax.txt.codes`.

  • script: We use the set of scripts used for a language as collected by ``GlotScript: A Resource and Tool for Low Resource Writing System Identification''. We then calculate the percentage of overlap and inverse (1-overlap) to obtain a distance metric. We ignore Braille (brai) in the calculations, as the information for this script is incomplete.

  • speakers: Number of speakers as reported by ASPJ, these are based on numbers from an old version of Ethnologue. Transformed to a distance metric by dividing the smallest by the largest number.

  • wiki_size: Wikipedia size, which is extracted from a download of the Wikipedia page ``List_of_Wikipedias'', downloaded on 17-04-2024. Transformed to a distance metric by dividing the smallest by the largest number.

Citations

Please provide the correct citations when using any of these metrics. People have spend a lot of their valuable time providing us with this data. Also, I would be interested to hear about your project if you find this repository useful, so would appreciate a link/short description e-mailed to me (robv@itu.dk).

  • aes_dist:
@misc{glottolog,
    title = "Glottolog 5.0.",
    author = "Hammarström, Harald and Forkel, Robert and Haspelmath, Martin and Bank, Sebastian",
    year = 2024,
    url = "https://doi.org/10.5281/zenodo.10804357",
    publisher = "Leipzig: Max Planck Institute for Evolutionary Anthropology",
    misc = "Available online at http://glottolog.org, Accessed on 2024-04-24."
}
  • asjp_lev_dist:
@misc{ASJP,
author = {Wichmann and Søren and Holman, Eric W. and Brown, Cecil H.},
year = {2022},
title = {The {ASJP} Database (version 20)}
}
  • lang2vec:
@inproceedings{littell-etal-2017-uriel,
    title = "{URIEL} and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors",
    author = "Littell, Patrick  and
      Mortensen, David R.  and
      Lin, Ke  and
      Kairis, Katherine  and
      Turner, Carlisle  and
      Levin, Lori",
    editor = "Lapata, Mirella  and
      Blunsom, Phil  and
      Koller, Alexander",
    booktitle = "Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers",
    month = apr,
    year = "2017",
    address = "Valencia, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/E17-2002",
    pages = "8--14"
}
  • lang_fam:
@misc{glottolog,
    title = "Glottolog 5.0.",
    author = "Hammarström, Harald and Forkel, Robert and Haspelmath, Martin and Bank, Sebastian",
    year = 2024,
    url = "https://doi.org/10.5281/zenodo.10804357",
    publisher = "Leipzig: Max Planck Institute for Evolutionary Anthropology",
    misc = "Available online at http://glottolog.org, Accessed on 2024-04-24."
}
  • lang_group:
@inproceedings{joshi-etal-2020-state,
    title = "The State and Fate of Linguistic Diversity and Inclusion in the {NLP} World",
    author = "Joshi, Pratik  and
      Santy, Sebastin  and
      Budhiraja, Amar  and
      Bali, Kalika  and
      Choudhury, Monojit",
    editor = "Jurafsky, Dan  and
      Chai, Joyce  and
      Schluter, Natalie  and
      Tetreault, Joel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.560",
    doi = "10.18653/v1/2020.acl-main.560",
    pages = "6282--6293"
}
  • script:
@article{kargaran2023glotscript,
  title={GlotScript: A Resource and Tool for Low Resource Writing System Identification},
  author={Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  journal={arXiv preprint arXiv:2309.13320},
  year={2023}
}
  • speakers:
@misc{ASJP,
author = {Wichmann and Søren and Holman, Eric W. and Brown, Cecil H.},
year = {2022},
title = {The {ASJP} Database (version 20)}
}
  • wiki_size:
https://en.wikipedia.org/wiki/List_of_Wikipedias

Example output

rob@cir:/data/rob/lang_dist$ python3 src/distals/distals.py --lang1 fry --lang2 dan
loading from: ./distals-db.pickle.gz
7855 languages loaded
========================================
Information for fry
wiki_size: 56,299
nlp_state: 1. The Scraping-Bys
speakers: 740,000
AES: 5. not endangered
loc: (5.86091, 53.143)
lang2vec: [1.0, 1.0, 0.0, ..., '--', '--', '--']
lang2vec_knn: [1.0, 1.0, 0.0, ..., 1.0, 0.0, 0.0]
grambank: {'GB020': 1, 'GB021': 1, 'GB022': 1, ..., 'GB520': 0, 'GB521': 0, 'GB522': 0}
glot_tree: ["'Western Frisian [west2354][fry]-l-'", "'Westlauwers-Terschelling Frisian [west2902]'", "'Modern West Frisian [mode1264]'", ..., "'Germanic [germ1287]'", "'Classical Indo-European [clas1257]'", "'Indo-European [indo1319]'"]
scripts: {'latn'}
asjp: [['1', 'ik'], ['2', 'do, yo'], ['3', 'vEi'], ..., ['95', 'fol'], ['96', 'nEy, nEi'], ['100', 'nam3']]
whitespace: 0.160835
punctuation: 0.031726

========================================
Information for dan
wiki_size: 307,173
nlp_state: 3. The Rising Stars
speakers: 5,510,600
AES: 5. not endangered
loc: (9.36284, 54.8655)
lang2vec: [1.0, 0.0, 0.0, ..., '--', '--', '--']
lang2vec_knn: [1.0, 0.0, 0.0, ..., 1.0, 0.0, 0.0]
grambank: {'GB020': 1, 'GB021': 1, 'GB022': 1, ..., 'GB520': 0, 'GB521': 0, 'GB522': 0}
glot_tree: ["'Danish [dani1285][dan]-l-'", "'South Scandinavian [sout3248]'", "'North Germanic [nort3160]'", "'Northwest Germanic [nort3152]'", "'Germanic [germ1287]'", "'Classical Indo-European [clas1257]'", "'Indo-European [indo1319]'"]
scripts: {'latn'}
asjp: [['1', 'yoy'], ['2', 'du'], ['3', 'vi'], ..., ['98', 'ron7'], ['99', 'tE7a'], ['100', 'now7n']]
whitespace: 0.156298
punctuation: 0.028514

========================================
Distances between fry and dan (-1 if the feature is not available for both)
METADATA
wiki_size: 0.8167
nlp_state: 0.4000
speakers: 0.8657
AES: 0.0000
loc: 0.0149
average: 0.5206

TYPOLOGY
lang2vec: 0.1598
lang2vec_knn: 0.1204
grambank: 0.0280
gb_clause: 0.0269
gb_nominal_domain: 0.0267
gb_numeral: 0.0353
gb_pronoun: 0.0000
gb_verbal_domain: 0.0328
glot_tree: 0.5325
scripts: 0.0000
average: 0.0280

WORDLISTS
asjp: 0.3397
concepts: 0.0400
average: 0.1898

TEXTBASED
whitespace: 0.0282
punctuation: 0.1012
JSD: 0.1979
average: 0.1979

Coverage:

7855 language codes found.
l2v_avg 3910
l2v_knn 3910
num_wikiarticles 286
speakers 5119
asjp 5581
glot_tree 7855
scripts 7393
state_and_fate 2264
AES 7718
loc 7624
speakers_l 5536
scripts_l 6425
conceptualizer 1271
grambank 2324
textdata found for 2110 iso-codes

Update

  • generate a new database
  • push/upload database
  • update link to db in src/distals/distals.py
  • push code
  • update number in setup.py
  • add to pip:
rm dist/*
python3 setup.py  sdist bdist_wheel
pip3 install dist/distals-0.1-py3-none-any.whl  --break-system-packages --force-reinstall
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distals-0.0.4.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distals-0.0.4-py3-none-any.whl (27.0 kB view details)

Uploaded Python 3

File details

Details for the file distals-0.0.4.tar.gz.

File metadata

  • Download URL: distals-0.0.4.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for distals-0.0.4.tar.gz
Algorithm Hash digest
SHA256 36cf7689a8ffa8de465151068945d5b25d1037e30a8b60a0cd837fb50b8041f4
MD5 b438ccbe32479b270044e2a1445c9474
BLAKE2b-256 f50ffd0bebcea2ab67e61421f9c712123a95280365603611e9acae71e5e18e5f

See more details on using hashes here.

File details

Details for the file distals-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: distals-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 27.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for distals-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 613f3edcace6093a5ecd0ec8c99c62ae7bae6cf2c406fa986c5925709b792894
MD5 d7eb10b21934ccc41d355cfd8525df2f
BLAKE2b-256 0a3a86b1f1f139a71c81625ed1392c711a46713a70c09d7bd0f3826d0de612e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page