Implementation of the Turkish LM Tuner
Project description
🦖 Turkish LM Tuner
Overview
Turkish LM Tuner is a library for fine-tuning Turkish language models on various NLP tasks. It is built on top of Hugging Face Transformers library. It supports finetuning with conditional generation and sequence classification tasks. The library is designed to be modular and extensible. It is easy to add new tasks and models. The library also provides data loaders for various Turkish NLP datasets.
Installation
You can install turkish-lm-tuner
via PyPI:
pip install turkish-lm-tuner
Alternatively, you can use the following command to install the library:
pip install git+https://github.com/boun-tabi-LMG/turkish-lm-tuner.git
Model Support
Any Encoder or ConditionalGeneration model that is compatible with Hugging Face Transformers library can be used with Turkish LM Tuner. The following models are tested and supported.
Task and Dataset Support
Task | Datasets |
---|---|
Text Classification | Product Reviews, TTC4900, Tweet Sentiment |
Natural Language Inference | NLI_TR, SNLI_TR, MultiNLI_TR |
Semantic Textual Similarity | STSb_TR |
Named Entity Recognition | WikiANN, Milliyet NER |
Part-of-Speech Tagging | BOUN, IMST |
Text Summarization | TR News, MLSUM, Combined TR News and MLSUM |
Title Generation | TR News, MLSUM, Combined TR News and MLSUM |
Paraphrase Generation | OpenSubtitles, Tatoeba, TED Talks |
Usage
The tutorials in the documentation can help you get started with turkish-lm-tuner
.
Examples
Fine-tune and evaluate a conditional generation model
from turkish_lm_tuner import DatasetProcessor, TrainerForConditionalGeneration
dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
dataset_name=dataset_name, task=task, task_format=task_format, task_mode='',
tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length
)
train_dataset = dataset_processor.load_and_preprocess_data(split='train')
eval_dataset = dataset_processor.load_and_preprocess_data(split='validation')
test_dataset = dataset_processor.load_and_preprocess_data(split="test")
training_params = {
'num_train_epochs': 10,
'per_device_train_batch_size': 4,
'per_device_eval_batch_size': 4,
'output_dir': './',
'evaluation_strategy': 'epoch',
'save_strategy': 'epoch',
'predict_with_generate': True
}
optimizer_params = {
'optimizer_type': 'adafactor',
'scheduler': False,
}
model_trainer = TrainerForConditionalGeneration(
model_name=model_name, task=task,
optimizer_params=optimizer_params,
training_params=training_params,
model_save_path="turna_summarization_tr_news",
max_input_length=max_input_length,
max_target_length=max_target_length,
postprocess_fn=dataset_processor.dataset.postprocess_data
)
trainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)
model.save_pretrained(model_save_path)
dataset_processor.tokenizer.save_pretrained(model_save_path)
Evaluate a conditional generation model with custom generation config
from turkish_lm_tuner import DatasetProcessor, EvaluatorForConditionalGeneration
dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
task_mode = ''
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
dataset_name, task, task_format, task_mode,
model_name, max_input_length, max_target_length
)
test_dataset = dataset_processor.load_and_preprocess_data(split="test")
test_params = {
'per_device_eval_batch_size': 4,
'output_dir': './',
'predict_with_generate': True
}
model_path = "turna_tr_news_summarization"
generation_params = {
'num_beams': 4,
'length_penalty': 2.0,
'no_repeat_ngram_size': 3,
'early_stopping': True,
'max_length': 128,
'min_length': 30,
}
evaluator = EvaluatorForConditionalGeneration(
model_path, model_name, task, max_input_length, max_target_length, test_params,
generation_params, dataset_processor.dataset.postprocess_data
)
results = evaluator.evaluate_model(test_dataset)
print(results)
Reference
If you use this repository, please cite the following related paper:
@inproceedings{uludogan-etal-2024-turna,
title = "{TURNA}: A {T}urkish Encoder-Decoder Language Model for Enhanced Understanding and Generation",
author = {Uludo{\u{g}}an, G{\"o}k{\c{c}}e and
Balal, Zeynep and
Akkurt, Furkan and
Turker, Meliksah and
Gungor, Onur and
{\"U}sk{\"u}darl{\i}, Susan},
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.600",
doi = "10.18653/v1/2024.findings-acl.600",
pages = "10103--10117",
}
License
Note that all datasets belong to their respective owners. If you use the datasets provided by this library, please cite the original source.
This code base is licensed under the MIT license. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file turkish_lm_tuner-0.1.4.tar.gz
.
File metadata
- Download URL: turkish_lm_tuner-0.1.4.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea5a55e67f723c1fbbe58a9e5e27b42a62c7b9c8b4e290e086d668b7d42e30ac |
|
MD5 | 288b22e27dc08717cf89fda275eed80a |
|
BLAKE2b-256 | 47c2c0e28219f3ed8b293c3f314ca5cb228945492cb7ab49282365416643df61 |
File details
Details for the file turkish_lm_tuner-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: turkish_lm_tuner-0.1.4-py3-none-any.whl
- Upload date:
- Size: 23.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f48b3351635bca275bec12f0e82cd5a6d31244d6f3f53b3608d08532174b6d24 |
|
MD5 | be44f16137b3ad6b3c1c9f6dbf20d5c8 |
|
BLAKE2b-256 | 67f7e7a31bfb1dfb5350ec0251b313376b762e07041c7216dd083940dd2ec1bd |