Skip to main content

URIEL+: Knowledge base for natural language processing

Project description

URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base

knowledge base for natural language processing

URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

If you are interested for more information, check out our full paper.

Contents

Environment

Python 3.10.4 or higher. Details of dependencies are in requirements.txt.

Setup Instruction

  • To get started with URIEL+:

    pip install urielplus
    
    from urielplus.urielplus import URIELPlus
    
    u = URIELPlus()
    

Configuration Options Examples

  • URIEL+ offers various configurations that you can adjust:

    • Caching: Enable or disable caching (True or False).
    • Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted).
    • Fill Missing Data: Decide whether to fill missing data using parent language data (True or False).
    • Distance Metric: Specify the distance metric to be used ("angular" or "cosine").
  • Changing A Configuration:

    u.set_{configuration}({option})
    
  • Checking A Configuration:

    u.get_{configuration}({option})
    
  • Replace {configuration} with cache, aggregation, fill_with_base_lang, or distance_metric.

  • Replace {option} with your desired value for the selected configuration.

  • Note: the default configurations are cache=False, aggregation='U', fill_with_base_lang=True, and distance_metric="angular".

Retrieving Loaded Features Examples

  • Retrieving A Loaded Feature:

    u.get_{vector_type}_{feature_type}_array()
    
  • Replace {vector_type} with phylogeny, typological, or geography.

  • Replace {feature_type} with features, languages, data, or sources.

  • Example:

    u.get_typological_languages_array()
    

Database Integration Examples

  • Integrating One Database:

    u.integrate_{database}()
    
  • Integrating Some Databases:

    u.integrate_custom_databases({databases})
    
  • Integrating All Databases:

    u.integrate_databases()
    
  • Set Language Codes to Glottocodes:

    u.set_glottocodes()
    
  • Reset all changes:

    u.reset()
    
  • Replace {database} with saphon, bdproto, grambank, apics, or ewave.

  • Replace {databases} with arguments "UPDATED_SAPHON", "BDPROTO", "GRAMBANK", "APICS", and/or "EWAVE" (e.g., "UPDATED_SAPHON", "BDPROTO", "EWAVE").

Imputation Examples

  • Aggregate Typological Data:

    u.set_aggregation({aggregation}) 
    u.aggregate()
    
  • Impute Missing Values:

    u.{imputation_strategy}_imputation()
    
  • Replace {aggregation} with 'U' (union) or 'A' (average).

  • Replace {imputation_strategy} with midaspy, knn, softimpute, or mean.

Language Distance Calculation Examples

  • Calculate a Specific Distance:

    print(u.new_distance({distance_type}, {languages}))
    
  • Calculate Distance Using Specific Features:

    print(u.new_custom_distance({features}, {languages}, {source}))
    
  • Retrieve Language Vectors:

    u.get_vector({distance_type}, {languages})
    
  • View URIEL+ Feature Coverage:

    u.feature_coverage()
    
  • Calculate Confidence Scores for Distances

    print(u.confidence_score({language 1}, {language 2}, {distance_type}))
    
  • Replace {distance_type} with a distance type (e.g., "featural") or a list (e.g., ["syntactic", "phonological"]). Must be single distance type for retrieving language vectors.

  • Replace {features} with a list of features (e.g., ["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]).

  • Replace {languages}, {language 1}, and {language 2} with language codes (e.g., "stan1293", "hind1269").

  • Replace {source} with one database (e.g., "WALS") or all databases ('A').

  • Note: the default {source} is all databases.

Citation

If you use this code for your research, please cite the following work:

@article{khan2024urielplus,
  title={URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base},
  author={Khan, Aditya and Shipton, Mason and Anugraha, David and Duan, Kaiyao and Hoang, Phuong H. and Khiu, Eric and Doğruöz, A. Seza and Lee, En-Shiun Annie},
  journal={arXiv preprint arXiv:2409.18472},
  year={2024}
}

If you have any questions, you can open a GitHub Issue or send us an email.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urielplus-1.0.tar.gz (7.5 MB view details)

Uploaded Source

Built Distribution

urielplus-1.0-py3-none-any.whl (7.6 MB view details)

Uploaded Python 3

File details

Details for the file urielplus-1.0.tar.gz.

File metadata

  • Download URL: urielplus-1.0.tar.gz
  • Upload date:
  • Size: 7.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for urielplus-1.0.tar.gz
Algorithm Hash digest
SHA256 ef6dcacc83f95db337339b17a67b4987592b43dd77f45c3d9719c81875c4bebf
MD5 16a3f4a48eb9de5539d4ddc4ee5ecb65
BLAKE2b-256 1603dc95a819c2585cc07c0af47c176fc44982dc56be5429f02b73a72f33560b

See more details on using hashes here.

File details

Details for the file urielplus-1.0-py3-none-any.whl.

File metadata

  • Download URL: urielplus-1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for urielplus-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 365fb9856718b1c61cbf5360cb7c96b69f14c2c04db26cc2fc8cd7daf7920dae
MD5 3172bcaf63019deaf70801c934166105
BLAKE2b-256 552dc000f19b4f20f605111c709034a431bfb62853a4c054baa079677553c84d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page