Tetun Language Identification Model
Project description
Tetun LID
Tetun Language Identification (Tetun LID) Model is a machine learning model that automatically identifies the language of a given text. It was specifically designed to recognize four languages commonly spoken in Timor-Leste: Tetun, Portuguese, Indonesian, and English.
Using a combination of cutting-edge algorithms and sophisticated linguistic features, Tetun LID was trained on a large corpus of text data to accurately recognize the characteristic of each language and the linguistic patterns. Its ability to accurately identify multiple languages makes it a valuable tool for anyone working with multilingual text data in Timor-Leste in the natural language processing (NLP) and information retrieval (IR) areas, such as language-specific search engines, sentiment analysis, and machine translation.
Installation
To install Tetun LID, run the following commands in your console:
pip install tetun-lid
Dependecies
Tetun LID package depends on the following packages:
- joblib
- scikit-learn
- Unicode
To install the dependencies packages, use the commands as follows:
pip install joblib
pip install scikit-learn
pip install Unidecode
Usage
To use the Tetun LID, from tetunlid
package, import lid
as follows:
- In case you want to predict a sentence as the input text.
from tetunlid import lid
input_text = "Sé mak hamriik iha ne'ebá?"
output = lid.predict_language(input_text)
print(output)
The output will be:
Tetun
- If you want to see the details of why it was being predicted to Tetun, you can use the
predict_detail()
function.
from tetunlid import lid
input_list_of_str = ["Sé mak hamriik iha ne'ebá?"]
output_detail = lid.predict_detail(input_list_of_str)
print('\n'.join(output_detail))
The output will be:
Input text: "Sé mak hamriik iha ne'ebá?"
Probability:
English: 0.0007
Indonesian: 0.0007
Portuguese: 0.0006
Tetun: 0.9980
Thus, the input text is "Tetun" with a confidence level of 99.80%.
Note
: the input parameter and the output of predict_detail()
is a List[str]
or a list of strings, and therefore to view the output result in the console, we need to use for
loop or join()
as in the example above to print the result.
- You can use multiple languages as an input. Observe the following example:
from tetunlid import lid
multiple_langs = ["Ha'u ema baibain", "I am quite busy",
"Kamu malas sekali", "Vou sair daqui"]
output = [(ml, lid.predict_language(ml)) for ml in multiple_langs]
print(output)
The output will be:
[("Ha'u ema baibain", 'Tetun'), ('I am quite busy', 'English'), ('Kamu malas sekali', 'Indonesian'), ('Vou sair daqui', 'Portuguese')]
You can use for
or any similar way to print the output in lines in the console as follows:
from tetunlid import lid
input_texts = ["Ha'u ema baibain", "I am quite busy",
"Kamu malas sekali", "Vou sair daqui"]
for input_text in input_texts:
lang = lid.predict_language(input_text)
print(f"{input_text} ({lang})")
The output will be:
Ha'u ema baibain (Tetun)
I am quite busy (English)
Kamu malas sekali (Indonesian)
Vou sair daqui (Portuguese)
If you want to see the details of each input, you can use the same function as illustrated above. Here you go:
from tetunlid import lid
input_texts = ["Ha'u ema baibain", "I am quite busy",
"Kamu malas sekali", "Vou sair daqui"]
output_multiple_detail = lid.predict_detail(input_texts)
print('\n'.join(output_multiple_detail))
The output will be:
Input text: "Ha'u ema baibain"
Probability:
English: 0.0027
Indonesian: 0.0028
Portuguese: 0.0024
Tetun: 0.9920
Thus, the input text is "Tetun" with a confidence level of 99.20%.
Input text: "I am quite busy"
Probability:
English: 0.9974
Indonesian: 0.0007
Portuguese: 0.0015
Tetun: 0.0004
Thus, the input text is "English" with a confidence level of 99.74%.
Input text: "Kamu malas sekali"
Probability:
English: 0.0001
Indonesian: 0.9997
Portuguese: 0.0001
Tetun: 0.0001
Thus, the input text is "Indonesian" with a confidence level of 99.97%.
Input text: "Vou sair daqui"
Probability:
English: 0.0034
Indonesian: 0.0030
Portuguese: 0.9912
Tetun: 0.0023
Thus, the input text is "Portuguese" with a confidence level of 99.12%.
- You can also use Tetun LID to predict a text from a file containing various languages. Here is an example:
from pathlib import Path
from tetunlid import lid
file_path = Path("myfile/example.txt")
try:
with file_path.open('r', encoding='utf-8') as f:
contents = [line.strip() for line in f]
except FileNotFoundError:
print(f"File not found at: {file_path}")
output = [(content, lid.predict_language(content)) for content in contents]
print(output)
There are a few more ways to read file contents that you can use to achieve the same output.
Additional notes
- Please follow the instruction as it is and try to understand how it works. All the dependencies need to be installed accordingly.
- If you encountered an
AttributeError: 'list' object has no attribute 'predict_proba'
, you might have some issues while installing the package. Please send me an email, and I will guide you on how to handle the error. - Please make sure that you use the latest version of Tetun LID. To get the latest version, run this command in your console:
pip install --upgrade tetun-lid
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tetun_lid-0.0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1945b085b81d37940c6fe7d8505223ad32e65d62264ae834be3b68bee023572a |
|
MD5 | 2fa7073913fba7e03c66043868fa7a6c |
|
BLAKE2b-256 | 0e2b0e2ca002a60a45a9d402488d7667f2cf376ba88d2555ee9d7987f75fd576 |