URIEL+: Knowledge base for natural language processing
Project description
URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
If you are interested for more information, check out our full paper.
Contents
- Environment
- Setup Instruction
- Configuration Options Examples
- Retrieving Loaded Features Examples
- Database Integration Examples
- Imputation Examples
- Language Distance Calculation Examples
- Citation
Environment
Python 3.10.4 or higher. Details of dependencies are in requirements.txt
.
Setup Instruction
-
To get started with URIEL+:
pip install urielplus
from urielplus.urielplus import URIELPlus u = URIELPlus()
Configuration Options Examples
-
URIEL+ offers various configurations that you can adjust:
- Caching: Enable or disable caching (True or False).
- Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted).
- Fill Missing Data: Decide whether to fill missing data using parent language data (True or False).
- Distance Metric: Specify the distance metric to be used ("angular" or "cosine").
-
Changing A Configuration:
u.set_{configuration}({option})
-
Checking A Configuration:
u.get_{configuration}({option})
-
Replace
{configuration}
withcache
,aggregation
,fill_with_base_lang
, ordistance_metric
. -
Replace
{option}
with your desired value for the selected configuration. -
Note: the default configurations are
cache=False
,aggregation='U'
,fill_with_base_lang=True
, anddistance_metric="angular"
.
Retrieving Loaded Features Examples
-
Retrieving A Loaded Feature:
u.get_{vector_type}_{feature_type}_array()
-
Replace
{vector_type}
withphylogeny
,typological
, orgeography
. -
Replace
{feature_type}
withfeatures
,languages
,data
, orsources
. -
Example:
u.get_typological_languages_array()
Database Integration Examples
-
Integrating One Database:
u.integrate_{database}()
-
Integrating Some Databases:
u.integrate_custom_databases({databases})
-
Integrating All Databases:
u.integrate_databases()
-
Set Language Codes to Glottocodes:
u.set_glottocodes()
-
Reset all changes:
u.reset()
-
Replace
{database}
withsaphon
,bdproto
,grambank
,apics
, orewave
. -
Replace
{databases}
with arguments"UPDATED_SAPHON"
,"BDPROTO"
,"GRAMBANK"
,"APICS"
, and/or"EWAVE"
(e.g.,"UPDATED_SAPHON"
,"BDPROTO"
,"EWAVE"
).
Imputation Examples
-
Aggregate Typological Data:
u.set_aggregation({aggregation}) u.aggregate()
-
Impute Missing Values:
u.{imputation_strategy}_imputation()
-
Replace
{aggregation}
with'U'
(union) or'A'
(average). -
Replace
{imputation_strategy}
withmidaspy
,knn
,softimpute
, ormean
.
Language Distance Calculation Examples
-
Calculate a Specific Distance:
print(u.new_distance({distance_type}, {languages}))
-
Calculate Distance Using Specific Features:
print(u.new_custom_distance({features}, {languages}, {source}))
-
Retrieve Language Vectors:
u.get_vector({distance_type}, {languages})
-
View URIEL+ Feature Coverage:
u.feature_coverage()
-
Calculate Confidence Scores for Distances
print(u.confidence_score({language 1}, {language 2}, {distance_type}))
-
Replace
{distance_type}
with a distance type (e.g.,"featural"
) or a list (e.g.,["syntactic"
,"phonological"]
). Must be single distance type for retrieving language vectors. -
Replace
{features}
with a list of features (e.g.,["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]
). -
Replace
{languages}
,{language 1}
, and{language 2}
with language codes (e.g.,"stan1293"
,"hind1269"
). -
Replace
{source}
with one database (e.g.,"WALS"
) or all databases ('A'
). -
Note: the default
{source}
is all databases.
Citation
If you use this code for your research, please cite the following work:
@article{khan2024urielplus,
title={URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base},
author={Khan, Aditya and Shipton, Mason and Anugraha, David and Duan, Kaiyao and Hoang, Phuong H. and Khiu, Eric and Doğruöz, A. Seza and Lee, En-Shiun Annie},
journal={arXiv preprint arXiv:2409.18472},
year={2024}
}
If you have any questions, you can open a GitHub Issue or send us an email.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file urielplus-1.0.tar.gz
.
File metadata
- Download URL: urielplus-1.0.tar.gz
- Upload date:
- Size: 7.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef6dcacc83f95db337339b17a67b4987592b43dd77f45c3d9719c81875c4bebf |
|
MD5 | 16a3f4a48eb9de5539d4ddc4ee5ecb65 |
|
BLAKE2b-256 | 1603dc95a819c2585cc07c0af47c176fc44982dc56be5429f02b73a72f33560b |
File details
Details for the file urielplus-1.0-py3-none-any.whl
.
File metadata
- Download URL: urielplus-1.0-py3-none-any.whl
- Upload date:
- Size: 7.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 365fb9856718b1c61cbf5360cb7c96b69f14c2c04db26cc2fc8cd7daf7920dae |
|
MD5 | 3172bcaf63019deaf70801c934166105 |
|
BLAKE2b-256 | 552dc000f19b4f20f605111c709034a431bfb62853a4c054baa079677553c84d |