Skip to main content

No project description provided

Project description

DistaLs

DistaLs is a database and toolkit that can be used to obtain distances between languages.

An online interface is available on: https://distals.streamlit.app/

A short introduction video to DistaLs can be viewed on: https://www.youtube.com/watch?v=SSk9xbssY2o

More information about the toolkit and its features can be found in the paper

Installation

Direct use from the repository

First install the required packages: pip3 install -r requirements.txt, and then you can run DistaLs directly from the src folder: python3 src/distals/distals.py.

Install

You can install DistaLs from pip: pip3 install distals, or from the repository:

python3 setup.py  sdist bdist_wheel
pip3 install dist/distals-0.1.1-py3-none-any.whl --break-system-packages --force-reinstall

After distals has been installed, you can directly call it from the command line using distals, or import it in python.

Usage

command line

The main usage is through the --langs parameter. It takes a list of languages as input, which can be either iso639-3 codes or full names of languages. We first try to parse the input as an ISO code, and treat it as a language name only if this is not successful. It first prints information from the databases for each language, followed by the distances between the languages. If you provide only one language, you just get the database information. If you provide two languages, you get a list of distances, and if you provide more, you get confusion matrices. An example output with two languages is shown below:

> distals --langs danish fry
database_path not defined, searching for database in:
current folder
loading from: ./distals-db.pickle.gz
7856 languages loaded
========================================
Information for Danish (dan)
wiki_size: 308,911
nlp_state: 3. The Rising Stars
speakers: 5,510,600
AES: 5. not endangered
loc: (9.36284, 54.8655)
lang2vec: [1.0, 0.0, 0.0, ..., '--', '--', '--']
lang2vec_knn: [1.0, 0.0, 0.0, ..., 1.0, 0.0, 0.0]
phoible: ['0061', '0062+0325', '0062+0325+02B0', ..., '0281+031E', '028B', '028C']
grambank: {'GB020': 1, 'GB021': 1, 'GB022': 1, ..., 'GB520': 0, 'GB521': 0, 'GB522': 0}
glot_tree: ["'Danish [dani1285][dan]-l-'", "'South Scandinavian [sout3248]'", "'North Germanic [nort3160]'", "'Northwest Germanic [nort3152]'", "'Germanic [germ1287]'", "'Classical Indo-European [clas1257]'", "'Indo-European [indo1319]'"]
scripts: {'latn'}
asjp: [['1', 'yoy'], ['2', 'du'], ['3', 'vi'], ..., ['98', 'ron7'], ['99', 'tE7a'], ['100', 'now7n']]
whitespace: 0.156298
punctuation: 0.028514
char_JSD: {' ': 0.1563, 'e': 0.1249, 'r': 0.0675, ..., 'Y': 0.0000, 'Á': 0.0000, 'Q': 0.0000}
textcat: [' ', 'e', 'r', ..., 'lle ', 'J', 'e de']

========================================
Information for Western Frisian (fry)
wiki_size: 57,027
nlp_state: 1. The Scraping-Bys
speakers: 740,000
AES: 5. not endangered
loc: (5.86091, 53.143)
lang2vec: [1.0, 1.0, 0.0, ..., '--', '--', '--']
lang2vec_knn: [1.0, 1.0, 0.0, ..., 1.0, 0.0, 0.0]
phoible: ['0061', '0061+0069', '0061+0075', ..., '026A+0259', '0275', '0275+0259']
grambank: {'GB020': 1, 'GB021': 1, 'GB022': 1, ..., 'GB520': 0, 'GB521': 0, 'GB522': 0}
glot_tree: ["'Western Frisian [west2354][fry]-l-'", "'Westlauwers-Terschelling Frisian [west2902]'", "'Modern West Frisian [mode1264]'", ..., "'Germanic [germ1287]'", "'Classical Indo-European [clas1257]'", "'Indo-European [indo1319]'"]
scripts: {'latn'}
asjp: [['1', 'ik'], ['2', 'do, yo'], ['3', 'vEi'], ..., ['95', 'fol'], ['96', 'nEy, nEi'], ['100', 'nam3']]
whitespace: 0.160835
punctuation: 0.031726
char_JSD: {' ': 0.1608, 'e': 0.1195, 'n': 0.0754, ..., 'ɾ': 0.0000, 'õ': 0.0000, 'ß': 0.0000}
textcat: [' ', 'e', 'n', ..., 'ing ', ' dat ', 'n.']

========================================
Distances between Danish (dan) and Western Frisian (fry), -1 if the feature is not available for both
METADATA
wiki_size: 0.8154
nlp_state: 0.4000
speakers: 0.8657
AES: 0.0000
loc: 0.0149
average: 0.4219

TYPOLOGY
lang2vec: 0.1598
lang2vec_knn: 0.1204
phoible: 0.8148
grambank: 0.3841
gb_clause: 0.3742
gb_nominal_domain: 0.3482
gb_numeral: 0.5000
gb_pronoun: 0.0000
gb_verbal_domain: 0.4644
glot_tree: 0.5325
scripts: 0.0000
average: 0.5995

WORDLISTS
asjp: 0.3397
concepts: 0.0400
average: 0.1898

TEXTBASED
whitespace: 0.0282
punctuation: 0.1012
char_JSD: 0.1979
textcat: 0.5859
average: 0.5859

The code is dependent on a database file. If it is not specified, it will search in the current folder and in ~/.cache/distals/. If it is not found, it will automatically download a recent version of the database to the .cache folder and use that.

To rebuild or update the database, you first need to scrape the relevant data sources, this can be done with the scripts/0.update.sh script, and 0.get_text_data.sh for the textbased features (note that this takes very long to download). After this has been done, you can run distals with the --database_path option, and one of the three update commands: --update_langnames for language lookup information, --update_textbased for updating the textbased features, and --update_databases for all other features.

python

First load a DistaLs models based on a database:

>>> from distals import distals
>>> model = distals.Distals()

Now you can query the model for distances between two languages with the get_dists() function:

>>> model.get_dists('nld', 'cmn')
{'metadata': {'AES': 0.0,
              'average': 0.39377015597113374,
              'loc': 0.39121043192100247,
              'nlp_state': 0.2,
              'speakers': 0.9813104679134012,
              'wiki_size': 0.9937779949135439},
 'textbased': {'average': 0.87235,
               'char_JSD': 0.5440088529474519,
               'punctuation': 0.6785480160817546,
               'textcat': 0.87235,
               'whitespace': 0.2124426381618435},
 'typology': {'average': 0.8025219634870027,
              'gb_clause': 0.5547001962252291,
              'gb_nominal_domain': 0.5976143046671968,
              'gb_numeral': 0.0,
              'gb_pronoun': 0.6454972243679029,
              'gb_verbal_domain': 0.6030226891555274,
              'glot_tree': 1.0,
              'grambank': 0.584781080334426,
              'lang2vec': 0.3165422989941302,
              'lang2vec_knn': 0.33795071460896176,
              'phoible': 0.8227848101265822,
              'scripts': 0.6666666666666667},
 'wordlists': {'asjp': 0.49635837471403577,
               'average': 0.28817918735701786,
               'concepts': 0.07999999999999996}}

By default, it returns the features as a hierarchy of dictionaries, including the averages over selected features. If you set aslist to True, you will get the features as a list (without averages). If you want to align them to the feature names you can obtain those from distals.classes:

>>> model.get_dists('nld', 'cmn', aslist=True)
([0.9937790230796005, 0.2, 0.9813104679134012, 0.0, 0.39121043192100247, 0.3165422989941302, 0.33795071460896176, 0.8227848101265822, 0.584781080334426, 0.5547001962252291, 0.5976143046671968, 0.0, 0.6454972243679029, 0.6030226891555274, 1.0, 0.6666666666666667, 0.4968771719863224, 0.07999999999999996, 0.2124426381618435, 0.6785480160817546, 0.5440088529474518, 0.87235], [0.5437723727482504, 0.7037829452305041, 0.2884385859931612, 0.7081794264737259])
>>> [x[1] for x in distals.classes]
['wiki_size', 'nlp_state', 'speakers', 'AES', 'loc', 'lang2vec', 'lang2vec_knn', 'phoible', 'grambank', 'gb_clause', 'gb_nominal_domain', 'gb_numeral', 'gb_pronoun', 'gb_verbal_domain', 'glot_tree', 'scripts', 'asjp', 'concepts', 'whitespace', 'punctuation', 'char_JSD', 'textcat']

For the distance measures based on ASJP, lang2vec and Grambank, DistaLs uses a threshold: distances are only calculated if at least 25% of the features have values for both languages. You can change this threshold with the threshold argument:

>>> model.get_dists('fry', 'jap')
{
  ...
 'typology': {'lang2vec': -1.0,
  'lang2vec_knn': 0.37192315798103215,
  'phoible': 0.8135593220338984,
  'grambank': 0.6093090226067065,
  'gb_clause': 0.6282808624375433,
  'gb_nominal_domain': 0.5679618342470648,
  'gb_numeral': -1.0,
  'gb_pronoun': 0.6454972243679029,
  'gb_verbal_domain': 0.6324555320336759,
  ...
}

>>> model.get_dists('fry', 'jap', threshold=0.9)
{
  ...
 'typology': {'lang2vec': -1.0,
  'lang2vec_knn': 0.37192315798103215,
  'phoible': 0.8135593220338984,
  'grambank': -1.0,
  'gb_clause': -1.0,
  'gb_nominal_domain': -1.0,
  'gb_numeral': -1.0,
  'gb_pronoun': 0.6454972243679029,
  'gb_verbal_domain': 0.6324555320336759,
  ...
}

Using only the name to iso639-3 conversion functionality is also possible:

>>> print(model.langname_utils.toISO('nederlands'))
nld

If you want direct access to the features of a language you can use model.all_data, it is a dictionary with languages as keys, and all their information stored in their values (also as a dictionary). Note that they are in a variety of formats.

web interface

The DistaLs web interface is hosted online at https://distals.streamlit.app. Running the interface locally can be done using the following command:

>>> python -m streamlit run gui/gui.py

Data

You can also use the precalculated distance metrics directly from csv files, but the files are quite large (as the matrices can be 7856x7856). You can download the csv files from https://itu.dk/people/robv/data/distals-precalculated.tar.gz

References

Please provide the correct citations when using any of these metrics. People have spend a lot of their valuable time providing us with this data. If you use all of them, please use:

We used DistaLs~\cite{van-der-goot-etal-2025-distals,ritchie-etal-2024-linguameta,kargaran-etal-2024-glotscript-resource,joshi-etal-2020-state,glottolog,littell-etal-2017-uriel,skirgardGrambankRevealsImportance2023,glottolog,phoible,ASJP,liu-etal-2023-crosslingual,brown-2014-non}

with the bib data in data/papers.bib in this repository.

@inproceedings{ritchie-etal-2024-linguameta,
    title = "{L}ingua{M}eta: Unified Metadata for Thousands of Languages",
    author = "Ritchie, Sandy  and
      van Esch, Daan  and
      Okonkwo, Uche  and
      Vashishth, Shikhar  and
      Drummond, Emily",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.921/",
    pages = "10530--10538",
}

Complemented with script information from:

@inproceedings{kargaran-etal-2024-glotscript-resource,
    title = "{G}lot{S}cript: A Resource and Tool for Low Resource Writing System Identification",
    author = {Kargaran, Amir Hossein  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.687",
    pages = "7774--7784"
}
  • State and Fate
@inproceedings{joshi-etal-2020-state,
    title = "The State and Fate of Linguistic Diversity and Inclusion in the {NLP} World",
    author = "Joshi, Pratik  and
      Santy, Sebastin  and
      Budhiraja, Amar  and
      Bali, Kalika  and
      Choudhury, Monojit",
    editor = "Jurafsky, Dan  and
      Chai, Joyce  and
      Schluter, Natalie  and
      Tetreault, Joel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.560",
    doi = "10.18653/v1/2020.acl-main.560",
    pages = "6282--6293"
}
  • Glottolog
@misc{glottolog,
    title = "Glottolog 5.0.",
    author = "Hammarström, Harald and Forkel, Robert and Haspelmath, Martin and Bank, Sebastian",
    year = 2024,
    url = "https://doi.org/10.5281/zenodo.10804357",
    publisher = "Leipzig: Max Planck Institute for Evolutionary Anthropology",
    misc = "Available online at http://glottolog.org, Accessed on 2024-04-24."
}
  • lang2vec:
@inproceedings{littell-etal-2017-uriel,
    title = "{URIEL} and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors",
    author = "Littell, Patrick  and
      Mortensen, David R.  and
      Lin, Ke  and
      Kairis, Katherine  and
      Turner, Carlisle  and
      Levin, Lori",
    editor = "Lapata, Mirella  and
      Blunsom, Phil  and
      Koller, Alexander",
    booktitle = "Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers",
    month = apr,
    year = "2017",
    address = "Valencia, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/E17-2002",
    pages = "8--14"
}
  • Grambank
@article{skirgardGrambankRevealsImportance2023,
  title = {Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss},
  author = {Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, 
    Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D.},
  journal = {Science Advances},
  volume = {9},
  number = {16},
  doi = {10.1126/sciadv.adg6175},
  year = {2023}
}
  • Glottolog:
@misc{glottolog,
    title = "Glottolog 5.0.",
    author = "Hammarström, Harald and Forkel, Robert and Haspelmath, Martin and Bank, Sebastian",
    year = 2024,
    url = "https://doi.org/10.5281/zenodo.10804357",
    publisher = "Leipzig: Max Planck Institute for Evolutionary Anthropology",
    misc = "Available online at http://glottolog.org, Accessed on 2024-04-24."
}
  • PHOIBLE
@book{phoible,
  address   = {Jena},
  editor    = {Steven Moran and Daniel McCloy},
  publisher = {Max Planck Institute for the Science of Human History},
  title     = {PHOIBLE 2.0},
  url       = {https://phoible.org/},
  year      = {2019}
}
  • asjp_lev_dist:
@misc{ASJP,
author = {Wichmann and Søren and Holman, Eric W. and Brown, Cecil H.},
year = {2022},
title = {The {ASJP} Database (version 20)}
}
  • Conceptualizer
@inproceedings{liu-etal-2023-crosslingual,
    title = "A Crosslingual Investigation of Conceptualization in 1335 Languages",
    author = {Liu, Yihong  and
      Ye, Haotian  and
      Weissweiler, Leonie  and
      Wicke, Philipp  and
      Pei, Renhao  and
      Zangenfeind, Robert  and
      Sch{\"u}tze, Hinrich},
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.726/",
    doi = "10.18653/v1/2023.acl-long.726",
    pages = "12969--13000"
}
  • Text-based
@inproceedings{brown-2014-non,
    title = "Non-linear Mapping for Improved Identification of 1300+ Languages",
    author = "Brown, Ralf",
    editor = "Moschitti, Alessandro  and
      Pang, Bo  and
      Daelemans, Walter",
    booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})",
    month = oct,
    year = "2014",
    address = "Doha, Qatar",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D14-1069/",
    doi = "10.3115/v1/D14-1069",
    pages = "627--632"
}
  • DistaLs
@inproceedings{van-der-goot-etal-2025-distals,
    title= "{DistaLs}: a Comprehensive Collection of Language Distance Measures",
    author = "van der Goot, Rob and Ploeger, Esther and Blaschke, Verena and Samardzic, Tanja"
    year = "2025",
    booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}
}

How to reproduce the paper

We follow the reproducability strategy proposed on: https://robvanderg.github.io/evaluation/repro/ . This means that we have a scripts folder that contains the code for all experiments. The experiments are numbered, and the exact commands to run to reproduce all results can be found in scripts/runall.sh. Note that these steps take very long, so I would discourage you to run runall.sh directly, but instead parallelize some steps. If you want to run the scripts, you need to install the dependencies from requirements-dev.txt.

How to add new languages

By design, DistaLs focuses only on ISO639-3 languages. The information from ISO639-3 is simply collected from the file data/data/iso-639-3.tab, so if you use a file in the same format you could also add other languages. However, most data sources use ISO639-3, so you will not be able to automatically extract the features. If you would just like to update a feature(set) for a language, this can be done through the original sources (and running scripts/0.update.sh), editing the downloaded files in the data/ folder, or directly editing the python dictionary that contains the database.

How to add new features

  • You can add the commands to download the new data source to the data folder in scripts/0.update.sh
  • You have to create a new python file for each new data source that you include.
  • In this file there have to be two functions: collect and distance_metric. The first one converts the raw data into the DistaLs database, the 2nd one uses this data to calculate a distance metric. Below we'll describe how they can/should be made:

collect: there is an all_data variable, which is a dictionary containing all metadata per language. So for each language, you should first find the language in the all_data dictionary, then add the new information with a new, unique name. Note that you can (and probably should) use the functions langname_utils.name_to_iso or langname_utiles.toISO to ensure that names and language codes are converted into valid ISO639-3 codes (or None if not available). Finally, the all_data dictionary should be returned.

distance_metric: This will receive the data that was stored in the collect function for 2 languages. This data can then be used to calculate a distance metric which should be returned.

  • You can see many examples of how features are implemented in src/distals/. We would recommend to look at src/distals/state_and_fate.py, as the data collection is relatively straightforward.
  • Import the file in src/distal/distals.py (twice)
  • Add to the classes variable in src/distals/distals.py, note that the string with the name should exactly match the one used in the collect function to store the data in the database.

How to update DistaLs (for devs)

  • generate a new database
  • push/upload database
  • update link to db in src/distals/distals.py
  • push code
  • update version number in setup.py
  • add to pip:
rm dist/*
python3 setup.py  sdist bdist_wheel
pip3 install dist/distals-0.1-py3-none-any.whl  --break-system-packages --force-reinstall
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distals-0.1.4.tar.gz (39.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distals-0.1.4-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file distals-0.1.4.tar.gz.

File metadata

  • Download URL: distals-0.1.4.tar.gz
  • Upload date:
  • Size: 39.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for distals-0.1.4.tar.gz
Algorithm Hash digest
SHA256 1eabb73640d4981e6aab19df25b1eebc45dcab6870b4b3f366e8a59e10d1fabd
MD5 9c42e14f19c253a6acc44c46f771a5f8
BLAKE2b-256 4eb71c0ee89d553152c31610b47330c23c898cee44190efab8274b84b374cabb

See more details on using hashes here.

File details

Details for the file distals-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: distals-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for distals-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b88f4eb928d8c07c11dfdf861f4b06cf94648abf413a1e3923bbca1e65765ea1
MD5 5b8c6f26a759137f33e1957fcea78d2a
BLAKE2b-256 4a5eb9c152dd473ce73b30c7c80452e5409ee7b87235ade19ee8a186ec005494

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page