Returns language vectors
Project description
lang2vec Author: Patrick Littell Last updated: July 15, 2016
Usage: ./lang2vec (-m) (-f) (-r)
is a space-separated string of ISO 639-3 codes (e.g., "deu eng fra"). Any two letter codes ISO 639-1 codes will be mapped to their corresponding ISO-639-3 codes.
is a named feature set (e.g., syntax_wals or phonology_knn), or an elementwise union A|B of two feature sets, or a concatenation A+B of two feature sets. So "id+syntax_wals|syntax_sswl" gives the id vector concatenated with the elementwise union of the WALS and SSWL syntax feature sets.
The named sets are:
Sets from feature and inventory databases:
"syntax_wals",
"phonology_wals",
"syntax_sswl",
"syntax_ethnologue",
"phonology_ethnologue",
"inventory_ethnologue",
"inventory_phoible_aa",
"inventory_phoible_gm",
"inventory_phoible_saphon",
"inventory_phoible_spa",
"inventory_phoible_ph",
"inventory_phoible_ra",
"inventory_phoible_upsid",
Averages of sets:
"syntax_average",
"phonology_average",
"inventory_average",
KNN predictions of feature values:
"syntax_knn",
"phonology_knn",
"inventory_knn",
Membership in language families and subfamilies:
"fam",
Distance from fixed points on Earth's surface
"geo",
One-hot identity vector:
"id",
OPTIONS:
-m, --minimal: Suppresses columns that contain only zeros, only ones, or only nulls -f, --fields: Display field names as the first row. -r, --random: Randomize the values (as, for example, a control)
The "minimal" transformation applies after any union or concatenation. (If it did not, sets in the same group, like the syntax_* sets, would not be the same dimensionality for comparison.) The "random" transformation applies after the "minimal" transformation. (So if you're doing an experiment with a minimized set and using a randomized set as a control, the randomized set will be the same dimensionality as the original.)
REFERENCES:
The different sets above are derived from many sources:
*_wals -- Features derived from the World Atlas of Language Structures. *_sswl -- Features derived from Syntactic Structures of the World's Languages. *_ethnologue -- Features derived from (shallowly) parsing the prose typological descriptions in Ethnologue (Lewis et al. 2015). *_phoible_aa -- AA = Alphabets of Africa. Features derived from PHOIBLE's normalization of Systèmes alphabétiques des langues africaines (Hartell 1993, Chanard 2006). *_phoible_gm -- GM = Green and Moran. Features derived from PHOIBLE's normalization of Christopher Green and Steven Moran's pan-African inventory database. *_phoible-ph -- PH = PHOIBLE. Features derived from PHOIBLE proper, by Moran, McCloy, and Wright (2012). *_phoible-ra -- RA = Ramaswami. Features derived from PHOIBLE's normalization of Common Linguistic Features in Indian Languages: Phoentics (Ramaswami 1999). *_phoible-saphon - SAPHON = South American Phonological Inventory Database. Features derived from PHOIBLE's normalization of SAPHON (Lev et al. 2012). *_phoible-spa - SPA = Stanford Phonology Archive. Features derived from PHOIBLE's normalization of SPA (Crothers et al., 1979). *_phoible-upsid - UPSID = UCLA Phonological Segment Inventory Database. Features derived from PHOIBLE's normalization of UPSID (Maddieson 1984, Maddieson and Precoda 1990).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.