Skip to main content

Implementation of the Turkish LM Tuner

Project description

🦖 Turkish LM Tuner


Paper Code license PyPI PyPI - Downloads PyPI - Python Version GitHub Repo stars

Overview

Turkish LM Tuner is a library for fine-tuning Turkish language models on various NLP tasks. It is built on top of Hugging Face Transformers library. It supports finetuning with conditional generation and sequence classification tasks. The library is designed to be modular and extensible. It is easy to add new tasks and models. The library also provides data loaders for various Turkish NLP datasets.

Installation

You can install turkish-lm-tuner via PyPI:

pip install turkish-lm-tuner

Alternatively, you can use the following command to install the library:

pip install git+https://github.com/boun-tabi-LMG/turkish-lm-tuner.git

Model Support

Any Encoder or ConditionalGeneration model that is compatible with Hugging Face Transformers library can be used with Turkish LM Tuner. The following models are tested and supported.

Task and Dataset Support

Task Datasets
Text Classification Product Reviews, TTC4900, Tweet Sentiment
Natural Language Inference NLI_TR, SNLI_TR, MultiNLI_TR
Semantic Textual Similarity STSb_TR
Named Entity Recognition WikiANN, Milliyet NER
Part-of-Speech Tagging BOUN, IMST
Text Summarization TR News, MLSUM, Combined TR News and MLSUM
Title Generation TR News, MLSUM, Combined TR News and MLSUM
Paraphrase Generation OpenSubtitles, Tatoeba, TED Talks

Usage

The tutorials in the documentation can help you get started with turkish-lm-tuner.

Examples

Fine-tune and evaluate a conditional generation model

from turkish_lm_tuner import DatasetProcessor, TrainerForConditionalGeneration

dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
    dataset_name=dataset_name, task=task, task_format=task_format, task_mode='',
    tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length
)

train_dataset = dataset_processor.load_and_preprocess_data(split='train')
eval_dataset = dataset_processor.load_and_preprocess_data(split='validation')
test_dataset = dataset_processor.load_and_preprocess_data(split="test")

training_params = {
    'num_train_epochs': 10,
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'output_dir': './', 
    'evaluation_strategy': 'epoch',
    'save_strategy': 'epoch',
    'predict_with_generate': True    
}
optimizer_params = {
    'optimizer_type': 'adafactor',
    'scheduler': False,
}

model_trainer = TrainerForConditionalGeneration(
    model_name=model_name, task=task,
    optimizer_params=optimizer_params,
    training_params=training_params,
    model_save_path="turna_summarization_tr_news",
    max_input_length=max_input_length,
    max_target_length=max_target_length, 
    postprocess_fn=dataset_processor.dataset.postprocess_data
)

trainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)

model.save_pretrained(model_save_path)
dataset_processor.tokenizer.save_pretrained(model_save_path)

Evaluate a conditional generation model with custom generation config

from turkish_lm_tuner import DatasetProcessor, EvaluatorForConditionalGeneration

dataset_name = "tr_news"
task = "summarization"
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
task_mode = ''
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
    dataset_name, task, task_format, task_mode,
    model_name, max_input_length, max_target_length
)

test_dataset = dataset_processor.load_and_preprocess_data(split="test")

test_params = {
    'per_device_eval_batch_size': 4,
    'output_dir': './',
    'predict_with_generate': True
}

model_path = "turna_tr_news_summarization"
generation_params = {
    'num_beams': 4,
    'length_penalty': 2.0,
    'no_repeat_ngram_size': 3,
    'early_stopping': True,
    'max_length': 128,
    'min_length': 30,
}
evaluator = EvaluatorForConditionalGeneration(
    model_path, model_name, task, max_input_length, max_target_length, test_params,
    generation_params, dataset_processor.dataset.postprocess_data
)
results = evaluator.evaluate_model(test_dataset)
print(results)

Reference

If you use this repository, please cite the following related paper:

@inproceedings{uludogan-etal-2024-turna,
    title = "{TURNA}: A {T}urkish Encoder-Decoder Language Model for Enhanced Understanding and Generation",
    author = {Uludo{\u{g}}an, G{\"o}k{\c{c}}e  and
      Balal, Zeynep  and
      Akkurt, Furkan  and
      Turker, Meliksah  and
      Gungor, Onur  and
      {\"U}sk{\"u}darl{\i}, Susan},
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.600",
    doi = "10.18653/v1/2024.findings-acl.600",
    pages = "10103--10117",
}

License

Note that all datasets belong to their respective owners. If you use the datasets provided by this library, please cite the original source.

This code base is licensed under the MIT license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkish_lm_tuner-0.1.4.tar.gz (43.1 kB view details)

Uploaded Source

Built Distribution

turkish_lm_tuner-0.1.4-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file turkish_lm_tuner-0.1.4.tar.gz.

File metadata

  • Download URL: turkish_lm_tuner-0.1.4.tar.gz
  • Upload date:
  • Size: 43.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.7

File hashes

Hashes for turkish_lm_tuner-0.1.4.tar.gz
Algorithm Hash digest
SHA256 ea5a55e67f723c1fbbe58a9e5e27b42a62c7b9c8b4e290e086d668b7d42e30ac
MD5 288b22e27dc08717cf89fda275eed80a
BLAKE2b-256 47c2c0e28219f3ed8b293c3f314ca5cb228945492cb7ab49282365416643df61

See more details on using hashes here.

File details

Details for the file turkish_lm_tuner-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for turkish_lm_tuner-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f48b3351635bca275bec12f0e82cd5a6d31244d6f3f53b3608d08532174b6d24
MD5 be44f16137b3ad6b3c1c9f6dbf20d5c8
BLAKE2b-256 67f7e7a31bfb1dfb5350ec0251b313376b762e07041c7216dd083940dd2ec1bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page