Skip to main content

Tetun Language Identification Model

Project description

Tetun LID

The Tetun Language Identification Model (Tetun LID) is an advanced machine learning model that automatically identifies the language of a given text. It has been specifically designed to recognize four languages commonly spoken in Timor-Leste: Tetun, Portuguese, Indonesian, and English.

Tetun LID uses state-of-the-art algorithms and linguistic features to learn unique characteristics and patterns of each language, ensuring high accuracy and reliability in identifying the language of a give text. This makes it a valuable tool for many natural language processing applications, such as language-specific search engines, sentiment analysis, and machine translation.

Overall, the Tetun Language Identification Model represents a significant advancement in language identification technology, and its ability to accurately identify multiple languages makes it an indispensable tool for anyone working with multilingual text data in Timor-Leste.

Installation

pip install tetun-lid

Dependecies

The Tetun LID package depends on the following packages:

  • joblib
  • scikit-learn
  • Unicode

To install the dependencies packages, run the following commands:

pip install joblib
pip install scikit-learn
pip install Unidecode

Usage

To use the Tetun LID, from tetunlid package, import lid as follows:

  1. In case you want to predict a single input text.
from tetunlid import lid

input_text = "Sé mak hamriik iha ne'ebá?"
output = lid.predict_language(input_text)

print(output)

The output will be:

Tetun
  1. If you want to see details of why it was being predicted to Tetun, you can use the predict_detail() function.
from tetunlid import lid

input_list_of_str = ["Sé mak hamriik iha ne'ebá?"]
output_detail = lid.predict_detail(input_list_of_str)
print('\n'.join(output_detail))

The output will be:

Input text: "Sé mak hamriik iha ne'ebá?"
Probability:
        English: 0.0007
        Indonesian: 0.0007
        Portuguese: 0.0006
        Tetun: 0.9980
Therefore, the probability of being "Tetun" is 99.80%.

Note: the input parameter and the output of predict_detail() is a List[str] or a list of string, thus, we need to use for loop or join() as in the example above to print the result.

  1. You can use multiple languages as input. Observe the following example:
from tetunlid import lid

multiple_langs = ["Ha'u ema baibain", "I am quite busy",
                  "Kamu malas sekali", "Vou sair daqui"]

output = [(ml, lid.predict_language(ml)) for ml in multiple_langs]
print(output)

The output will be:

[("Ha'u ema baibain", 'Tetun'), ('I am quite busy', 'English'), ('Kamu malas sekali', 'Indonesian'), ('Vou sair daqui', 'Portuguese')]

Note: This is how to simplify the codes and visualization. However, you can use for or any similar way to visualize in lines as follows:

from tetunlid import lid

input_texts = ["Ha'u ema baibain", "I am quite busy",
               "Kamu malas sekali", "Vou sair daqui"]

for input_text in input_texts:
    lang = lid.predict_language(input_text)
    print(f"{input_text} ({lang})")

The output will be:

Ha'u ema baibain (Tetun)
I am quite busy (English)
Kamu malas sekali (Indonesian)
Vou sair daqui (Portuguese)

If you want to see details of each input, you can use a similar way as above. Here you go:

from tetunlid import lid

input_texts = ["Ha'u ema baibain", "I am quite busy",
               "Kamu malas sekali", "Vou sair daqui"]

output_multiple_detail = lid.predict_detail(input_texts)
print('\n'.join(output_multiple_detail))

The output will be:

Input text: "Ha'u ema baibain"
Probability:
        English: 0.0027
        Indonesian: 0.0028
        Portuguese: 0.0024
        Tetun: 0.9920
Therefore, the probability of being "Tetun" is 99.20%.


Input text: "I am quite busy"
Probability:
        English: 0.9974
        Indonesian: 0.0007
        Portuguese: 0.0015
        Tetun: 0.0004
Therefore, the probability of being "English" is 99.74%.


Input text: "Kamu malas sekali"
Probability:
        English: 0.0001
        Indonesian: 0.9997
        Portuguese: 0.0001
        Tetun: 0.0001
Therefore, the probability of being "Indonesian" is 99.97%.


Input text: "Vou sair daqui"
Probability:
        English: 0.0034
        Indonesian: 0.0030
        Portuguese: 0.9912
        Tetun: 0.0023
Therefore, the probability of being "Portuguese" is 99.12%.
  1. You can also use Tetun LID to predict a text from a file containing various languages. Here is an example:
from pathlib import Path
from tetunlid import lid


file_path = Path("myfile/example.txt")

try:
    with file_path.open('r', encoding='utf-8') as f:
        contents = [line.strip() for line in f]
except FileNotFoundError:
    print(f"File not found at: {file_path}")

output = [(content, lid.predict_language(content)) for content in contents]
print(output)

There are a few more ways to read file contents that you can use to achieve the same output.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tetun_lid-0.0.2.tar.gz (7.2 MB view hashes)

Uploaded Source

Built Distribution

tetun_lid-0.0.2-py3-none-any.whl (16.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page