Skip to main content

URIEL+: Knowledge base for natural language processing

Project description

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

knowledge base for natural language processing

URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

If you are interested for more information, check out our full paper.

Citation

If you use this code for your research, please cite the following work:

@inproceedings{khan-etal-2025-uriel,
    title = "{URIEL}+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base",
    author = {Khan, Aditya  and
      Shipton, Mason  and
      Anugraha, David  and
      Duan, Kaiyao  and
      Hoang, Phuong H.  and
      Khiu, Eric  and
      Do{\u{g}}ru{\"o}z, A. Seza  and
      Lee, En-Shiun Annie},
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.463/",
    pages = "6937--6952",
    abstract = "URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies."
}

If you have any questions, you can open a GitHub Issue or send us an email.

Contributors: Aditya Khan, Mason Shipton, York Hay Ng, David Anugraha, Kaiyao Duan, Phuong H. Hoang, Eric Khiu, Xiang Lu, A. Seza Doğruöz, En-Shiun Annie Lee

Contents

Environment

Python 3.10 or later. If you're working with the MIDASpy extra dependencies, version of Python must be less than 3.11. Details of dependencies are in setup.py. NOTE: There are known issues with the MIDASpy extra dependencies. Please use between Python 3.10 and Python 3.11 for the time being.

Setup Instruction

  • To get started with URIEL+:

    pip install urielplus
    
    from urielplus import urielplus
    
    u = urielplus.URIELPlus()
    

Configuration Options Examples

  • URIEL+ offers various configurations that you can adjust:

    • Caching: Enable or disable caching (True or False).
    • Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted).
    • Fill Missing Data: Decide whether to fill missing data using parent language data (True or False).
    • Distance Metric: Specify the distance metric to be used ("angular" or "cosine").
  • Changing A Configuration:

    u.set_{configuration}({option})
    
  • Checking A Configuration:

    u.get_{configuration}({option})
    
  • Replace {configuration} with cache, aggregation, fill_with_base_lang, or distance_metric.

  • Replace {option} with your desired value for the selected configuration.

  • Note: the default configurations are cache=False, aggregation='U', fill_with_base_lang=True, and distance_metric="angular".

Retrieving Loaded Features Examples

  • Retrieving A Loaded Feature:

    u.get_{vector_type}_{feature_type}_array()
    
  • Replace {vector_type} with phylogeny, typological, geography, or scriptural.

  • Replace {feature_type} with features, languages, data, or sources.

  • Example:

    u.get_typological_languages_array()
    

Database Integration Examples

  • Integrating One Database:

    u.integrate_{database}()
    
  • Integrating Some Databases:

    u.integrate_custom_databases({databases})
    
  • Integrating All Databases:

    u.integrate_databases()
    
  • Set Language Codes to Glottocodes:

    u.set_glottocodes()
    
  • Reset all changes:

    u.reset()
    
  • Import (and replace all existing) data from a custom CSV file:

      u.import_csv({file_path}, {index})
    
  • Replace {database} with saphon, bdproto, grambank, apics, ewave, or glottolog.

  • Replace {databases} with arguments "UPDATED_SAPHON", "BDPROTO", "GRAMBANK", "APICS", "EWAVE", and/or "GLOTTOLOG" (e.g., "UPDATED_SAPHON", "BDPROTO", "EWAVE").

  • Replace {index} with 0 for genetic data, 1 for typological data, 2 for geographic data, or 3 for scriptural data.

Imputation Examples

  • Aggregate Typological and Scriptural Data:

    u.set_aggregation({aggregation}) 
    u.aggregate()
    
  • Impute Missing Values:

    u.{imputation_strategy}_imputation()
    
  • Replace {aggregation} with 'U' (union) or 'A' (average).

  • Replace {imputation_strategy} with midaspy, knn, softimpute, or mean.

Language Distance Calculation Examples

  • Calculate a Specific Distance:

    print(u.new_distance({distance_type}, {languages}))
    
  • Calculate Distance Using Specific Features:

    print(u.new_custom_distance({features}, {languages}, {source}))
    
  • Retrieve Language Vectors:

    u.get_vector({distance_type}, {languages})
    
  • View URIEL+ Feature Coverage:

    u.feature_coverage()
    
  • Calculate Confidence Scores for Distances

    print(u.confidence_score({language 1}, {language 2}, {distance_type}))
    
  • Replace {distance_type} with a distance type (e.g., "featural") or a list (e.g., ["syntactic", "phonological"]). Must be single distance type for retrieving language vectors.

  • Replace {features} with a list of features (e.g., ["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]).

  • Replace {languages}, {language 1}, and {language 2} with language codes (e.g., "stan1293", "hind1269").

  • Replace {source} with one database (e.g., "WALS") or all databases ('A').

  • Note: the default {source} is all databases.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urielplus-1.2.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urielplus-1.2-py3-none-any.whl (8.7 MB view details)

Uploaded Python 3

File details

Details for the file urielplus-1.2.tar.gz.

File metadata

  • Download URL: urielplus-1.2.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for urielplus-1.2.tar.gz
Algorithm Hash digest
SHA256 08446c77fd07127572ffc3f9c62fd714b0ef6103034a8e7cf74094f97c4df866
MD5 86c6f29e16c7feabbe0ed897898a48a1
BLAKE2b-256 e599cbd1a990d106c35db285954cfba70232a01bf3ac0072fb7b103d4f488738

See more details on using hashes here.

File details

Details for the file urielplus-1.2-py3-none-any.whl.

File metadata

  • Download URL: urielplus-1.2-py3-none-any.whl
  • Upload date:
  • Size: 8.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for urielplus-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 89f025e01a04a942f4f32018505e95e82b02f20c4e8326be9efc035cdd470e9e
MD5 7a5ef84052d907c43023f722547845ce
BLAKE2b-256 d776f22928e59c78c7d325e8b83a5299e949f859bc83e91b32a5a0b9cf3884f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page