No project description provided
Project description
Language distances and information
This repository combines a number of sources to obtain information and distances for languages. It focuses on ISO639-3 languages, and combines information from a variety of sources.
Usage
The package is located in src/distals/, and can be used from the
commandline: python3 src/distals/distals.py.
Its main function is to provide a user with distance metrics between
two languages. This can be obtained by adding --lang1 and --lang2
to the command. For example:
python3 src/distals/distals.py --lang1 fry --lang2 dan
The distances are calculated based on information from existing databases.
This information is loaded from a pickle file (default distals-db.pickle.gz),
and a full database is included in this repo. It can also easily be updated,
by using the --update_database option and the --update_textbased option.
The package also depends on language code and language names conversion.
The database for this is included in langname-db.pickle.gz, this can
re-created with the --cache_langnames option.
For now, we assume all source data to be available in the data/ folder. The
data can be updated using the scripts/0.update.sh script. So to update the
database completely, one has to run:
./scripts/0.update.sh
python3 src/distals/distals.py --update_database
To update the textbased features, the LTI-LangId corpus needs to be downloaded,
which takes a substantial amount of time (weeks), the steps for doing so can be
seen in 0.get_miltale.sh.
It should be noted that all metrics are designed to have values between 0 and 1, and they are not directional. In cases where a metric could not be estimated, the code returns a -1.
Metrics
-
aes_dist: first collects the Agglomerated Endangerment Scale (AES) category for each language, and then calculates how many groups apart they are. See also https://glottolog.org/langdoc/status , the extraction of these values was done with the
scripts/getAES.pyscript, and based on GlottoLog 5.0 -
asjp_lev_dist: Calculates the LDND distance on the ASJP word lists as defined in ``Adding typology to lexicostatistics: A combined approach to language classification''. Unfortunately, there is no 1-1 mapping between the language codes in ASJP and ISO639-3 codes, so we made an automatic mapping based on the language name provided in ASJP and other sources. The script for this is in
scripts/complete_lists.pyand the results in `data/aspj_conv`. We use the normalized levenshtein as provided by ASJP (https://asjp.clld.org/software). When multiple versions of a word are available, we use the average (this was underspecified in the original paper, and we could not find reference implementations). -
lang2vec: Cosine distance between lang2vec vectors, only taking into account values that overlap. Note that this metric is thus hard to compare across language pairs, as different linguistic features will be included/excluded for different language pairs.
-
lang2vec_knn: Cosine distance between lang2vec vectors which have been completed through KNN by the original paper.
-
lang_fam: The percentage of trees of distance. This means that if you are in two different trees, it will always be 2.0. If both languages are in the same tree it is #overlapping edges/the total edges of the deepest language of the two.
-
lang_group: distance between language groups as defined in ``The State and Fate of Linguistic Diversity and Inclusion in the NLP World''. Unfortunately, I could not obtain the language codes, but have made an automatic mapping (
scripts/complete_lang2tax.py), which is available in `data/lang2tax.txt.codes`. -
script: We use the set of scripts used for a language as collected by ``GlotScript: A Resource and Tool for Low Resource Writing System Identification''. We then calculate the percentage of overlap and inverse (1-overlap) to obtain a distance metric. We ignore Braille (brai) in the calculations, as the information for this script is incomplete.
-
speakers: Number of speakers as reported by ASPJ, these are based on numbers from an old version of Ethnologue. Transformed to a distance metric by dividing the smallest by the largest number.
-
wiki_size: Wikipedia size, which is extracted from a download of the Wikipedia page ``List_of_Wikipedias'', downloaded on 17-04-2024. Transformed to a distance metric by dividing the smallest by the largest number.
Citations
Please provide the correct citations when using any of these metrics. People have spend a lot of their valuable time providing us with this data. Also, I would be interested to hear about your project if you find this repository useful, so would appreciate a link/short description e-mailed to me (robv@itu.dk).
- aes_dist:
@misc{glottolog,
title = "Glottolog 5.0.",
author = "Hammarström, Harald and Forkel, Robert and Haspelmath, Martin and Bank, Sebastian",
year = 2024,
url = "https://doi.org/10.5281/zenodo.10804357",
publisher = "Leipzig: Max Planck Institute for Evolutionary Anthropology",
misc = "Available online at http://glottolog.org, Accessed on 2024-04-24."
}
- asjp_lev_dist:
@misc{ASJP,
author = {Wichmann and Søren and Holman, Eric W. and Brown, Cecil H.},
year = {2022},
title = {The {ASJP} Database (version 20)}
}
- lang2vec:
@inproceedings{littell-etal-2017-uriel,
title = "{URIEL} and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors",
author = "Littell, Patrick and
Mortensen, David R. and
Lin, Ke and
Kairis, Katherine and
Turner, Carlisle and
Levin, Lori",
editor = "Lapata, Mirella and
Blunsom, Phil and
Koller, Alexander",
booktitle = "Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers",
month = apr,
year = "2017",
address = "Valencia, Spain",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/E17-2002",
pages = "8--14"
}
- lang_fam:
@misc{glottolog,
title = "Glottolog 5.0.",
author = "Hammarström, Harald and Forkel, Robert and Haspelmath, Martin and Bank, Sebastian",
year = 2024,
url = "https://doi.org/10.5281/zenodo.10804357",
publisher = "Leipzig: Max Planck Institute for Evolutionary Anthropology",
misc = "Available online at http://glottolog.org, Accessed on 2024-04-24."
}
- lang_group:
@inproceedings{joshi-etal-2020-state,
title = "The State and Fate of Linguistic Diversity and Inclusion in the {NLP} World",
author = "Joshi, Pratik and
Santy, Sebastin and
Budhiraja, Amar and
Bali, Kalika and
Choudhury, Monojit",
editor = "Jurafsky, Dan and
Chai, Joyce and
Schluter, Natalie and
Tetreault, Joel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.560",
doi = "10.18653/v1/2020.acl-main.560",
pages = "6282--6293"
}
- script:
@article{kargaran2023glotscript,
title={GlotScript: A Resource and Tool for Low Resource Writing System Identification},
author={Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
journal={arXiv preprint arXiv:2309.13320},
year={2023}
}
- speakers:
@misc{ASJP,
author = {Wichmann and Søren and Holman, Eric W. and Brown, Cecil H.},
year = {2022},
title = {The {ASJP} Database (version 20)}
}
- wiki_size:
https://en.wikipedia.org/wiki/List_of_Wikipedias
Example output
rob@cir:/data/rob/lang_dist$ python3 src/distals/distals.py --lang1 fry --lang2 dan
loading from: ./distals-db.pickle.gz
7855 languages loaded
========================================
Information for fry
wiki_size: 56,299
nlp_state: 1. The Scraping-Bys
speakers: 740,000
AES: 5. not endangered
loc: (5.86091, 53.143)
lang2vec: [1.0, 1.0, 0.0, ..., '--', '--', '--']
lang2vec_knn: [1.0, 1.0, 0.0, ..., 1.0, 0.0, 0.0]
grambank: {'GB020': 1, 'GB021': 1, 'GB022': 1, ..., 'GB520': 0, 'GB521': 0, 'GB522': 0}
glot_tree: ["'Western Frisian [west2354][fry]-l-'", "'Westlauwers-Terschelling Frisian [west2902]'", "'Modern West Frisian [mode1264]'", ..., "'Germanic [germ1287]'", "'Classical Indo-European [clas1257]'", "'Indo-European [indo1319]'"]
scripts: {'latn'}
asjp: [['1', 'ik'], ['2', 'do, yo'], ['3', 'vEi'], ..., ['95', 'fol'], ['96', 'nEy, nEi'], ['100', 'nam3']]
whitespace: 0.160835
punctuation: 0.031726
========================================
Information for dan
wiki_size: 307,173
nlp_state: 3. The Rising Stars
speakers: 5,510,600
AES: 5. not endangered
loc: (9.36284, 54.8655)
lang2vec: [1.0, 0.0, 0.0, ..., '--', '--', '--']
lang2vec_knn: [1.0, 0.0, 0.0, ..., 1.0, 0.0, 0.0]
grambank: {'GB020': 1, 'GB021': 1, 'GB022': 1, ..., 'GB520': 0, 'GB521': 0, 'GB522': 0}
glot_tree: ["'Danish [dani1285][dan]-l-'", "'South Scandinavian [sout3248]'", "'North Germanic [nort3160]'", "'Northwest Germanic [nort3152]'", "'Germanic [germ1287]'", "'Classical Indo-European [clas1257]'", "'Indo-European [indo1319]'"]
scripts: {'latn'}
asjp: [['1', 'yoy'], ['2', 'du'], ['3', 'vi'], ..., ['98', 'ron7'], ['99', 'tE7a'], ['100', 'now7n']]
whitespace: 0.156298
punctuation: 0.028514
========================================
Distances between fry and dan (-1 if the feature is not available for both)
METADATA
wiki_size: 0.8167
nlp_state: 0.4000
speakers: 0.8657
AES: 0.0000
loc: 0.0149
average: 0.5206
TYPOLOGY
lang2vec: 0.1598
lang2vec_knn: 0.1204
grambank: 0.0280
gb_clause: 0.0269
gb_nominal_domain: 0.0267
gb_numeral: 0.0353
gb_pronoun: 0.0000
gb_verbal_domain: 0.0328
glot_tree: 0.5325
scripts: 0.0000
average: 0.0280
WORDLISTS
asjp: 0.3397
concepts: 0.0400
average: 0.1898
TEXTBASED
whitespace: 0.0282
punctuation: 0.1012
JSD: 0.1979
average: 0.1979
Coverage:
7855 language codes found.
l2v_avg 3910
l2v_knn 3910
num_wikiarticles 286
speakers 5119
asjp 5581
glot_tree 7855
scripts 7393
state_and_fate 2264
AES 7718
loc 7624
speakers_l 5536
scripts_l 6425
conceptualizer 1271
grambank 2324
textdata found for 2110 iso-codes
Update
- generate a new database
- push/upload database
- update link to db in src/distals/distals.py
- push code
- update number in setup.py
- add to pip:
rm dist/*
python3 setup.py sdist bdist_wheel
pip3 install dist/distals-0.1-py3-none-any.whl --break-system-packages --force-reinstall
twine upload dist/*
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distals-0.0.4.tar.gz.
File metadata
- Download URL: distals-0.0.4.tar.gz
- Upload date:
- Size: 22.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36cf7689a8ffa8de465151068945d5b25d1037e30a8b60a0cd837fb50b8041f4
|
|
| MD5 |
b438ccbe32479b270044e2a1445c9474
|
|
| BLAKE2b-256 |
f50ffd0bebcea2ab67e61421f9c712123a95280365603611e9acae71e5e18e5f
|
File details
Details for the file distals-0.0.4-py3-none-any.whl.
File metadata
- Download URL: distals-0.0.4-py3-none-any.whl
- Upload date:
- Size: 27.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
613f3edcace6093a5ecd0ec8c99c62ae7bae6cf2c406fa986c5925709b792894
|
|
| MD5 |
d7eb10b21934ccc41d355cfd8525df2f
|
|
| BLAKE2b-256 |
0a3a86b1f1f139a71c81625ed1392c711a46713a70c09d7bd0f3826d0de612e6
|