Tetun Language Identification Model
Project description
Tetun LID
Tetun Language Identification (Tetun LID) model is a machine learning model that automatically identifies the language of a given text. It was specifically designed to recognize four languages commonly spoken in Timor-Leste: Tetun, Portuguese, English, and Indonesian.
Installation
With pip:
pip install tetun-lid
Dependencies
The Tetun LID model depends on the following packages:
- joblib
- scikit-learn
Install the dependencies packages with pip:
pip install joblib
pip install scikit-learn
Usage
The examples of its usage are as follows:
- To predict the language of an input text, use the
predict_language()
function.
from tetunlid import lid
input_text = "Sé mak toba iha ne'ebá?"
output = lid.predict_language(input_text)
print(output)
This will be the output (Note: During the initial call, it will load the LID model, so it may take a few minutes to complete):
Tetun
- To print the details of the Probability of being predicted to Tetun use the
predict_detail()
function.
from tetunlid import lid
input_list_of_str = ["Sé mak toba iha ne'ebá?"]
output_detail = lid.predict_detail(input_list_of_str)
print('\n'.join(output_detail))
This will be the output:
Input text: "Sé mak toba iha ne'ebá?"
Probability:
English: 0.0010
Indonesian: 0.0014
Portuguese: 0.0082
Tetun: 0.9967
Note
: The output of predict_detail()
is a list of strings and therefore to print the its result in the console, use for
loop or join()
as in the previous example.
- We can feed a mixed of corpus containing multiple languages into the LID model as the input list. Observe the following example:
from tetunlid import lid
multiple_langs = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
output = [(ml, lid.predict_language(ml)) for ml in multiple_langs]
print(output)
This will be the output:
[("Ha'u ema baibain", 'Tetun'), ('I am not available', 'English'), ('Apa kabar kawan?', 'Indonesian'), ('Estou a estudar', 'Portuguese')]
You can use print the output in the console as follows:
from tetunlid import lid
import warnings
warnings.filterwarnings('ignore')
input_texts = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
for input_text in input_texts:
lang = lid.predict_language(input_text)
print(f"{input_text} ({lang})")
This will be the output:
Ha'u ema baibain (Tetun)
I am not available (English)
Apa kabar kawan? (Indonesian)
Estou a estudar (Portuguese)
To print the details of each input, use the same function as previously explained. Here is the example:
from tetunlid import lid
import warnings
warnings.filterwarnings('ignore')
input_texts = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
output_multiple_detail = lid.predict_detail(input_texts)
print('\n'.join(output_multiple_detail))
This will be the output:
Input text: "Ha'u ema baibain"
Probability:
English: 0.0032
Indonesian: 0.0032
Portuguese: 0.0028
Tetun: 0.9907
Input text: "I am not available"
Probability:
English: 0.9999
Indonesian: 0.00001
Portuguese: 0.00001
Tetun: 0.00001
Input text: "Apa kabar kawan?"
Probability:
English: 0.0011
Indonesian: 0.9961
Portuguese: 0.0015
Tetun: 0.0184
Input text: "Estou a estudar"
Probability:
English: 0.0003
Indonesian: 0.002
Portuguese: 0.9810
Tetun: 0.0184
- We can filter only Tetun text from a mixed of corpus containing multiple languages using the
predict_language()
function.
from tetunlid import lid
import warnings
warnings.filterwarnings('ignore')
input_texts = ["Ha'u ema baibain", "I am not available",
"Apa kabar kawan?", "Estou a estudar"]
output = [text for text in input_texts if lid.predict_language(text) == 'Tetun']
print(output)
This will be the output:
["Ha'u ema baibain"]
- We can also use Tetun LID to predict texts from a file containing various languages or texts extracted from the web. Here is an example:
from pathlib import Path
from tetunlid import lid
import warnings
warnings.filterwarnings('ignore')
file_path = Path("myfile/example.txt")
try:
with file_path.open('r', encoding='utf-8') as f:
contents = [line.strip() for line in f]
except FileNotFoundError:
print(f"File not found at: {file_path}")
output = [(content, lid.predict_language(content)) for content in contents]
print(output)
Additional notes
- All the dependencies need to be installed accordingly before using the model.
- If you encountered an
AttributeError: 'list' object has no attribute 'predict_proba'
, you might have some issues while installing the package. Please send me an email, and I will guide you on how to handle the error. - Please make sure that you use the latest version of Tetun LID by running this command in your console:
pip install --upgrade tetun-lid
.
To get the source code, visit the GitHub repository for this project.
Citation
If you use this repository or any of its contents for your research, academic work, or publication, we kindly request that you cite it as follows:
@misc{jesus-nunes-2024,
author = {Gabriel de Jesus and Sérgio Nunes},
title = {Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus},
year = {2024},
note = {Accepted at LREC-COOLING, 2024},
}
Acknowledgement
This work is financed by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia under the PhD scholarship grant number SFRH/BD/151437/2021.
License
Contact Information
If you have any questions or feedback, please feel free to contact mestregabrieldejesus[at]gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tetun_lid-1.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7de2c68a5c414be99c3a9f51e978276a48c8e87c7005046aed758e34f537bbbc |
|
MD5 | 3d5635c57ae886f75b663f2e5406a9e7 |
|
BLAKE2b-256 | 2bc1a0ab8e62b1f4d4b871feafcf084f27a93c62f4ee6d076aa87a3d574b2c42 |