Skip to main content

Returns language vectors

Project description

lang2vec Author: Patrick Littell Last updated: July 15, 2016

Usage: ./lang2vec (-m) (-f) (-r)

is a space-separated string of ISO 639-3 codes (e.g., "deu eng fra"). Any two letter codes ISO 639-1 codes will be mapped to their corresponding ISO-639-3 codes.

is a named feature set (e.g., syntax_wals or phonology_knn), or an elementwise union A|B of two feature sets, or a concatenation A+B of two feature sets. So "id+syntax_wals|syntax_sswl" gives the id vector concatenated with the elementwise union of the WALS and SSWL syntax feature sets.

The named sets are:

Sets from feature and inventory databases:
    "syntax_wals",
    "phonology_wals",
    "syntax_sswl",
    "syntax_ethnologue",
    "phonology_ethnologue",
    "inventory_ethnologue",
    "inventory_phoible_aa",
    "inventory_phoible_gm",
    "inventory_phoible_saphon",
    "inventory_phoible_spa",
    "inventory_phoible_ph",
    "inventory_phoible_ra",
    "inventory_phoible_upsid",

Averages of sets:
    "syntax_average",
    "phonology_average",
    "inventory_average",

KNN predictions of feature values:
    "syntax_knn",
    "phonology_knn",
    "inventory_knn",

Membership in language families and subfamilies:
    "fam",

Distance from fixed points on Earth's surface
    "geo",

One-hot identity vector:
    "id",

OPTIONS:

-m, --minimal: Suppresses columns that contain only zeros, only ones, or only nulls -f, --fields: Display field names as the first row. -r, --random: Randomize the values (as, for example, a control)

The "minimal" transformation applies after any union or concatenation. (If it did not, sets in the same group, like the syntax_* sets, would not be the same dimensionality for comparison.) The "random" transformation applies after the "minimal" transformation. (So if you're doing an experiment with a minimized set and using a randomized set as a control, the randomized set will be the same dimensionality as the original.)

REFERENCES:

The different sets above are derived from many sources:

*_wals -- Features derived from the World Atlas of Language Structures. *_sswl -- Features derived from Syntactic Structures of the World's Languages. *_ethnologue -- Features derived from (shallowly) parsing the prose typological descriptions in Ethnologue (Lewis et al. 2015). *_phoible_aa -- AA = Alphabets of Africa. Features derived from PHOIBLE's normalization of Systèmes alphabétiques des langues africaines (Hartell 1993, Chanard 2006). *_phoible_gm -- GM = Green and Moran. Features derived from PHOIBLE's normalization of Christopher Green and Steven Moran's pan-African inventory database. *_phoible-ph -- PH = PHOIBLE. Features derived from PHOIBLE proper, by Moran, McCloy, and Wright (2012). *_phoible-ra -- RA = Ramaswami. Features derived from PHOIBLE's normalization of Common Linguistic Features in Indian Languages: Phoentics (Ramaswami 1999). *_phoible-saphon - SAPHON = South American Phonological Inventory Database. Features derived from PHOIBLE's normalization of SAPHON (Lev et al. 2012). *_phoible-spa - SPA = Stanford Phonology Archive. Features derived from PHOIBLE's normalization of SPA (Crothers et al., 1979). *_phoible-upsid - UPSID = UCLA Phonological Segment Inventory Database. Features derived from PHOIBLE's normalization of UPSID (Maddieson 1984, Maddieson and Precoda 1990).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lang2vec-0.1.2.tar.gz (7.5 MB view hashes)

Uploaded Source

Built Distribution

lang2vec-0.1.2-py3-none-any.whl (7.5 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page