Tetun Language Identification Model
Project description
Tetun LID
Tetun Language Identification (Tetun LID) model is a machine learning model that automatically identifies the language of a given text. It was specifically designed to recognize four languages commonly spoken in Timor-Leste: Tetun, Portuguese, English, and Indonesian.
Installation
With pip:
pip install tetun-lid joblib scikit-learn
Usage
The examples of its usage are as follows:
- To predict the language of an input text, use the
predict_language()
function.
from tetunlid import lid
input_text = "Sé mak toba iha ne'ebá?"
output = lid.predict_language(input_text)
print(output)
This will be the output. (Note: The initial call may take a few minutes to complete as it will load the LID model.)
Tetun
- To print the details of the Probability of being predicted to Tetun use the
predict_detail()
function.
from tetunlid import lid
input_list_of_str = ["Sé mak toba iha ne'ebá?"]
output_detail = lid.predict_detail(input_list_of_str)
print('\n'.join(output_detail))
This will be the output:
Input text: "Sé mak toba iha ne'ebá?"
Probability:
English: 0.0010
Indonesian: 0.0014
Portuguese: 0.0082
Tetun: 0.9967
- We can feed a mixed of corpus containing multiple languages into the LID model as the input list. Observe the following example:
from tetunlid import lid
multiple_langs = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
output = [(ml, lid.predict_language(ml)) for ml in multiple_langs]
print(output)
This will be the output:
[("Ha'u ema baibain", 'Tetun'), ('I am not available', 'English'), ('Apa kabar kawan?', 'Indonesian'), ('Estou a estudar', 'Portuguese')]
You can use print the output in the console as follows:
from tetunlid import lid
import warnings
warnings.filterwarnings('ignore')
input_texts = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
for input_text in input_texts:
lang = lid.predict_language(input_text)
print(f"{input_text} ({lang})")
This will be the output:
Ha'u ema baibain (Tetun)
I am not available (English)
Apa kabar kawan? (Indonesian)
Estou a estudar (Portuguese)
To print the details of each input, use the same function as previously explained. Here is the example:
from tetunlid import lid
import warnings
warnings.filterwarnings('ignore')
input_texts = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
output_multiple_details = lid.predict_detail(input_texts)
print('\n'.join(output_multiple_details))
This will be the output:
Input text: "Ha'u ema baibain"
Probability:
English: 0.0032
Indonesian: 0.0032
Portuguese: 0.0028
Tetun: 0.9907
Input text: "I am not available"
Probability:
English: 0.9999
Indonesian: 0.00001
Portuguese: 0.00001
Tetun: 0.00001
Input text: "Apa kabar kawan?"
Probability:
English: 0.0011
Indonesian: 0.9961
Portuguese: 0.0015
Tetun: 0.0184
Input text: "Estou a estudar"
Probability:
English: 0.0003
Indonesian: 0.002
Portuguese: 0.9810
Tetun: 0.0184
- We can filter only Tetun text from a mixed of corpus containing multiple languages using the
predict_language()
function.
from tetunlid import lid
import warnings
warnings.filterwarnings('ignore')
input_texts = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
output = [text for text in input_texts if lid.predict_language(text) == 'Tetun']
print(output)
This will be the output:
["Ha'u ema baibain"]
Additional notes
- If you encountered an
AttributeError: 'list' object has no attribute 'predict_proba'
, you might have some issues while installing the package. Please send me an email, and I will guide you on how to handle the error. - Please make sure that you use the latest version of Tetun LID by running this command in your console:
pip install --upgrade tetun-lid
.
Citation
If you use this repository or any of its contents for your research, academic work, or publication, we kindly request that you cite it as follows:
@inproceedings{de-jesus-nunes-2024-labadain-crawler,
title = "Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus",
author = "de Jesus, Gabriel and
Nunes, S{\'e}rgio Sobral",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.390",
pages = "4368--4380"
}
Acknowledgement
This work is financed by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia under the PhD scholarship grant number SFRH/BD/151437/2021 (DOI 10.54499/SFRH/BD/151437/2021).
License
Contact Information
If you have any questions or feedback, please feel free to contact me at mestregabrieldejesus[at]gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tetun_lid-1.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd41555ea713ac0be5d692c6c0d93caed1627026bfa115ba756a1b9b97441215 |
|
MD5 | 20d20b62da7b548835ad711073ac9c4c |
|
BLAKE2b-256 | 092ee89dc85968f0196d68a10446cbbd00c736b85e65586876daf399655f04fd |