Skip to main content

Pretraining and sentiment student to instructor review sentiment corpora and analysis.

Project description

EduSenti: Education Review Sentiment in Albanian

PyPI Python 3.10 Python 3.11

Pretraining and sentiment student to instructor review corpora and analysis in Albanian. This repository contains the code base to be used for the paper RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian. To reproduce the results, see the paper reproduction repository. If you use our model or API, please cite our paper.

Table of Contents

Obtaining

The library can be installed with pip from the pypi repository:

pip3 install zensols.edusenti

The models are downloaded on the first use of the command-line or API.

Usage

Command line:

$ edusenti predict sq.txt
(+): <Per shkak  gjendjes  krijuar si pasojë e pandemisë edhe ne sikur [...]>
(-): <Fillimisht isha e shqetësuar se si do ti mbanim kuizet, si do  [...]>
(+): <Kjo gjendje ka vazhduar edhe  kohën e provimeve>
...

Use the csv action to write all predictions to a comma-delimited file (use edusent --help).

API

>>> from zensols.edusenti import (
>>>     ApplicationFactory, Application, SentimentFeatureDocument
>>> )
>>> app: Application = ApplicationFactory.get_application()
>>> doc: SentimentFeatureDocument
>>> for doc in app.predict(['Kjo gjendje ka vazhduar edhe në kohën e provimeve']):
>>>     print(f'sentence: {doc.text}')
>>>     print(f'prediction: {doc.pred}')
>>>     print(f'prediction: {doc.softmax_logit}')

sentence: Kjo gjendje ka vazhduar edhe  kohën e provimeve
prediction: +
logits: {'+': 0.70292175, '-': 0.17432323, 'n': 0.12275504}

Models

The models are downloaded the first time the API is used. To change the model (by default xlm-roberta-base is used) on the command-line, use --override esi_default.model_namel=xlm-roberta-large. You can also create a ~/.edusentirc file with the following:

[esi_default]
model_namel = xlm-roberta-large

Performance of the models on the test set when trained and validated are below.

Model F1 Precision Recall
xlm-roberta-base 78.1 80.7 79.7
xlm-roberta-large 83.5 84.9 84.7

However, the distributed models were trained on the training and test sets combined. The validation metrics of those trained models are available on the command line with edusenti info.

Differences from the Paper Repository

The paper reproduction repository has quite a few differences, mostly around reproducibility. However, this repository is designed to be a package used for research that applies the model. To reproduce the results of the paper, please refer to the reproduction repository. To use the best performing model (XLM-RoBERTa Large) from that paper, then use this repository.

The primary difference is this repo has significantly better performance in Albanian, which climbed from from F1 71.9 to 83.5 (see models). However, this repository has no English sentiment model since it was only used for comparing methods.

Changes include:

  • Python was upgraded from 3.9.9 to 3.11.6
  • PyTorch was upgraded from 1.12.1 to 2.1.1
  • HuggingFace transformers was upgraded from 4.19 to 4.35
  • zensols.deepnlp was upgraded from 1.8 to 1.13
  • The dataset was re-split and stratified.

Documentation

See the full documentation. The API reference is also available.

Changelog

An extensive changelog is available here.

Citation

If you use this project in your research please use the following BibTeX entry:

@inproceedings{nuci-etal-2024-roberta-low,
    title = "{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian",
    author = "Nuci, Krenare Pireva  and
      Landes, Paul  and
      Di Eugenio, Barbara",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1233",
    pages = "14146--14151"
}

License

MIT License

Copyright (c) 2023 - 2024 Paul Landes and Krenare Pireva Nuci

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

zensols.edusenti-0.0.1-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file zensols.edusenti-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for zensols.edusenti-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 889b9ac688834b3fc61b52e2aad15ccad55e856280eb4ac9573fdca0c8b11c82
MD5 03edc904c1f2f2f2ec26e5672cef1958
BLAKE2b-256 292db27277033ad7dbe1c004cd4650d1d7eb3164669dc680b8083ab623cb85a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page