Skip to main content

URIEL+: Knowledge base for natural language processing

Project description

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

knowledge base for natural language processing

URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

If you are interested for more information, check out our full paper.

Contents

Environment

Python 3.10 or later. If you're working with the MIDASpy extra dependencies, version of Python must be less than 3.11. Details of dependencies are in setup.py.

Setup Instruction

  • To get started with URIEL+:

    pip install urielplus
    
    from urielplus import urielplus
    
    u = urielplus.URIELPlus()
    

Configuration Options Examples

  • URIEL+ offers various configurations that you can adjust:

    • Caching: Enable or disable caching (True or False).
    • Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted).
    • Fill Missing Data: Decide whether to fill missing data using parent language data (True or False).
    • Distance Metric: Specify the distance metric to be used ("angular" or "cosine").
  • Changing A Configuration:

    u.set_{configuration}({option})
    
  • Checking A Configuration:

    u.get_{configuration}({option})
    
  • Replace {configuration} with cache, aggregation, fill_with_base_lang, or distance_metric.

  • Replace {option} with your desired value for the selected configuration.

  • Note: the default configurations are cache=False, aggregation='U', fill_with_base_lang=True, and distance_metric="angular".

Retrieving Loaded Features Examples

  • Retrieving A Loaded Feature:

    u.get_{vector_type}_{feature_type}_array()
    
  • Replace {vector_type} with phylogeny, typological, or geography.

  • Replace {feature_type} with features, languages, data, or sources.

  • Example:

    u.get_typological_languages_array()
    

Database Integration Examples

  • Integrating One Database:

    u.integrate_{database}()
    
  • Integrating Some Databases:

    u.integrate_custom_databases({databases})
    
  • Integrating All Databases:

    u.integrate_databases()
    
  • Set Language Codes to Glottocodes:

    u.set_glottocodes()
    
  • Reset all changes:

    u.reset()
    
  • Replace {database} with saphon, bdproto, grambank, apics, or ewave.

  • Replace {databases} with arguments "UPDATED_SAPHON", "BDPROTO", "GRAMBANK", "APICS", and/or "EWAVE" (e.g., "UPDATED_SAPHON", "BDPROTO", "EWAVE").

Imputation Examples

  • Aggregate Typological Data:

    u.set_aggregation({aggregation}) 
    u.aggregate()
    
  • Impute Missing Values:

    u.{imputation_strategy}_imputation()
    
  • Replace {aggregation} with 'U' (union) or 'A' (average).

  • Replace {imputation_strategy} with midaspy, knn, softimpute, or mean.

Language Distance Calculation Examples

  • Calculate a Specific Distance:

    print(u.new_distance({distance_type}, {languages}))
    
  • Calculate Distance Using Specific Features:

    print(u.new_custom_distance({features}, {languages}, {source}))
    
  • Retrieve Language Vectors:

    u.get_vector({distance_type}, {languages})
    
  • View URIEL+ Feature Coverage:

    u.feature_coverage()
    
  • Calculate Confidence Scores for Distances

    print(u.confidence_score({language 1}, {language 2}, {distance_type}))
    
  • Replace {distance_type} with a distance type (e.g., "featural") or a list (e.g., ["syntactic", "phonological"]). Must be single distance type for retrieving language vectors.

  • Replace {features} with a list of features (e.g., ["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]).

  • Replace {languages}, {language 1}, and {language 2} with language codes (e.g., "stan1293", "hind1269").

  • Replace {source} with one database (e.g., "WALS") or all databases ('A').

  • Note: the default {source} is all databases.

Citation

If you use this code for your research, please cite the following work:

@article{khan2024urielplus,
  title={URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base},
  author={Khan, Aditya and Shipton, Mason and Anugraha, David and Duan, Kaiyao and Hoang, Phuong H. and Khiu, Eric and Doğruöz, A. Seza and Lee, En-Shiun Annie},
  journal={arXiv preprint arXiv:2409.18472},
  year={2024}
}

If you have any questions, you can open a GitHub Issue or send us an email.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urielplus-1.1.tar.gz (7.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urielplus-1.1-py3-none-any.whl (7.5 MB view details)

Uploaded Python 3

File details

Details for the file urielplus-1.1.tar.gz.

File metadata

  • Download URL: urielplus-1.1.tar.gz
  • Upload date:
  • Size: 7.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for urielplus-1.1.tar.gz
Algorithm Hash digest
SHA256 6c67e78d22b9db58e0d4bb64ed463d64ba49beb13e9a63e5f4b6fad3eecf8f89
MD5 b034156584a13cc82c5aac0ff883d907
BLAKE2b-256 413b55f3a1db9fc0ed92059c684981e11fc32889f95e976483f9bc5180b74697

See more details on using hashes here.

Provenance

The following attestation bundles were made for urielplus-1.1.tar.gz:

Publisher: python-publish.yml on Masonshipton25/URIELPlus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file urielplus-1.1-py3-none-any.whl.

File metadata

  • Download URL: urielplus-1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for urielplus-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b186cc73c4c767fe88f16f56637375428def1116aee374b096e157ad6165e198
MD5 872114d83709de871c3e34f3e718730a
BLAKE2b-256 ee3b427500bce9482ff6d7aec4d287aa29d41ca1ab1052c29b4bdd80fb7726e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for urielplus-1.1-py3-none-any.whl:

Publisher: python-publish.yml on Masonshipton25/URIELPlus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page