Skip to main content

Geographically-informed language identification

Project description

geoLid

Geographically-informed language identification

This Python package carries out language identification with geographic priors to increase performance for low-resource and under-represented languages.

A description and evaluation of this approach can be found here: https://jdunn.name/2024/03/13/geographically-informed-language-identification/

A complete list of language codes and names per regional model can be found in the language_names directory.

Downloading models

geoLid contains a baseline non-geographic model as well as models for 16 specific regions, as shown below:

baseline (916 languages)
africa_north (44 languages)
africa_southern (58 languages)
africa_sub (166 languages)
america_brazil (88 languages)
america_central (188 languages)
america_north (68 languages)
america_south (129 languages)
asia_central (54 languages)
asia_east (46 languages)
asia_south (60 languages)
asia_southeast (325 languages)
europe_east (65 languages)
europe_russia (65 languages)
europe_west (108 languages)
middle_east (53 languages)
oceania (49 languages)

To download models, use this command:

from geoLid import download_model
download_model("baseline")

The model name "all" will download all region-specific models.

Usage

Language identification can be used as shown below:

from geoLid import geoLid
lid = geoLid(model_location = "models")
labels = lid.predict(data = data, region = "baseline")

The model_location during initialization points to the directory containing the LID models.

The input variable data is a list containing at least one string that represents a text to make predictions about.

The region variable indicates which region-specific model should be used. The default is to use the non-geographic baseline model.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoLid-1.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

geoLid-1.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file geoLid-1.0.tar.gz.

File metadata

  • Download URL: geoLid-1.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for geoLid-1.0.tar.gz
Algorithm Hash digest
SHA256 40e4ef3a4ee2df6482db3ed883931e9338a3e8014c7374e4191324e4dc49e002
MD5 85cba871c29a29f2c60e4e11d6929b88
BLAKE2b-256 6b6dbc009965a0dde8be84b41bdf83774d7835f207997325412a927e4f7516be

See more details on using hashes here.

Provenance

File details

Details for the file geoLid-1.0-py3-none-any.whl.

File metadata

  • Download URL: geoLid-1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for geoLid-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a5d43ab29f4c11d7884e03f337b4fe520b5d351bd298769023352408d6dd5c4c
MD5 18b81bfaf5731ac222b75bb111d5bad0
BLAKE2b-256 7a4f8660f683f89d4d158e666f2178d9c01bf731100834d572b9bed7f31c2775

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page