Skip to main content

A Transformer-based library for SocialNLP tasks

Project description

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Tests

Test it in Colab

A Transformer-based library for SocialNLP tasks.

Currently supports:

Task Languages
Sentiment Analysis es, en, it, pt
Hate Speech Detection es, en, it, pt
Irony Detection es, en, it, pt
Emotion Analysis es, en, it
NER & POS tagging es, en

Just do pip install pysentimiento and start using it:

Getting Started

from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns AnalyzerOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns AnalyzerOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns AnalyzerOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})

analyzer.predict("jejeje no te creo mucho")
# AnalyzerOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""

emotion_analyzer = create_analyzer(task="emotion", lang="en")

emotion_analyzer.predict("yayyy")
# returns AnalyzerOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns AnalyzerOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})

"""
Hate Speech (misogyny & racism)
"""
hate_speech_analyzer = create_analyzer(task="hate_speech", lang="es")

hate_speech_analyzer.predict("Esto es una mierda pero no es odio")
# returns AnalyzerOutput(output=[], probas={hateful: 0.022, targeted: 0.009, aggressive: 0.018})
hate_speech_analyzer.predict("Esto es odio porque los inmigrantes deben ser aniquilados")
# returns AnalyzerOutput(output=['hateful'], probas={hateful: 0.835, targeted: 0.008, aggressive: 0.476})

hate_speech_analyzer.predict("Vaya guarra barata y de poca monta es XXXX!")
# returns AnalyzerOutput(output=['hateful', 'targeted', 'aggressive'], probas={hateful: 0.987, targeted: 0.978, aggressive: 0.969})

Also, you might use pretrained models directly with transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("pysentimiento/robertuito-sentiment-analysis")

model = AutoModelForSequenceClassification.from_pretrained("pysentimiento/robertuito-sentiment-analysis")

Preprocessing

pysentimiento features a tweet preprocessor specially suited for tweet classification with transformer-based models.

from pysentimiento.preprocessing import preprocess_tweet

# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "@usuario debería cambiar esto url"

# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"

# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"

# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"

# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# 'emoji party popper emoji emoji party popper emoji'

Trained models so far

Check CLASSIFIERS.md for details on the reported performances of each model.

Instructions for developers

  1. Clone and install
git clone https://github.com/pysentimiento/pysentimiento
pip install poetry
poetry shell
poetry install
  1. Get the data and put it under data/

Open an issue or email us if you are not able to get the it.

  1. Run script to train models

Check TRAIN.md for further information on how to train your models

  1. Upload models to Huggingface's Model Hub

Check "Model sharing and upload" instructions in huggingface docs.

License

pysentimiento is an open-source library. However, please be aware that models are trained with third-party datasets and are subject to their respective licenses, many of which are for non-commercial use

  1. TASS Dataset license (License for Sentiment Analysis in Spanish, Emotion Analysis in Spanish & English)

  2. SEMEval 2017 Dataset license (Sentiment Analysis in English)

  3. LinCE Datasets (License for NER & POS tagging)

Suggestions and bugfixes

Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)

Citation

If you use pysentimiento in your work, please cite this paper

@misc{perez2021pysentimiento,
      title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
      author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
      year={2021},
      eprint={2106.09462},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Also, pleace cite related pre-trained models and datasets for the specific models you use:

%%%%%%%%%%%%%%%%%%%%%%%%%%
% Pretrained models      %
%%%%%%%%%%%%%%%%%%%%%%%%%%
% RoBERTuito
@inproceedings{perez2022robertuito,
  title={RoBERTuito: a pre-trained language model for social media text in Spanish},
  author={P{\'e}rez, Juan Manuel and Furman, Dami{\'a}n Ariel and Alemany, Laura Alonso and Luque, Franco M},
  booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},
  pages={7235--7243},
  year={2022}
}
% BETO
@article{canete2020spanish,
  title={Spanish pre-trained bert model and evaluation data},
  author={Canete, Jos{\'e} and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and P{\'e}rez, Jorge},
  journal={Pml4dc at iclr},
  volume={2020},
  pages={2020},
  year={2020}
}
% BERTweet
@inproceedings{nguyen2020bertweet,
  title={BERTweet: A pre-trained language model for English Tweets},
  author={Nguyen, Dat Quoc and Vu, Thanh and Nguyen, Anh Tuan},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  pages={9--14},
  year={2020}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%
% Datasets               %
%%%%%%%%%%%%%%%%%%%%%%%%%%
% TASS 2020 (sentiment in Spanish)

@article{garcia2020overview,
  title={Overview of TASS 2020: introducing emotion detection},
  author={Garc{\'\i}a-Vegaa, Manuel and D{\'\i}az-Galianoa, Manuel Carlos and Garc{\'\i}a-Cumbrerasa, Miguel {\'A} and del Arcoa, Flor Miriam Plaza and Montejo-R{\'a}eza, Arturo and Jim{\'e}nez-Zafraa, Salud Mar{\'\i}a and C{\'a}marab, Eugenio Mart{\'\i}nez and Aguilarc, C{\'e}sar Antonio and Antonio, Marco and Cabezudod, Sobrevilla and others},
  year={2020}
}

% EmoEvent (Emotion Analysis Spanish & English)

@inproceedings{del2020emoevent,
  title={EmoEvent: A multilingual emotion corpus based on different events},
  author={del Arco, Flor Miriam Plaza and Strapparava, Carlo and Lopez, L Alfonso Urena and Mart{\'\i}n-Valdivia, M Teresa},
  booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
  pages={1492--1498},
  year={2020}
}

% Hate Speech Detection (Spanish & English)


@inproceedings{hateval2019semeval,
  title={SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter},
  author={Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and Rangel, Francisco and Rosso, Paolo and Sanguinetti, Manuela},
  booktitle={Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019)},
  year={2019},
  publisher= {Association for Computational Linguistics}
}
% Sentiment Analysis in English

@article{nakov2019semeval,
  title={SemEval-2016 task 4: Sentiment analysis in Twitter},
  author={Nakov, Preslav and Ritter, Alan and Rosenthal, Sara and Sebastiani, Fabrizio and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1912.01973},
  year={2019}
}

% LinCE (NER & POS Tagging)

@inproceedings{aguilar2020lince,
  title={LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation},
  author={Aguilar, Gustavo and Kar, Sudipta and Solorio, Thamar},
  booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
  pages={1803--1813},
  year={2020}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysentimiento-0.6.1rc2.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysentimiento-0.6.1rc2-py3-none-any.whl (36.4 kB view details)

Uploaded Python 3

File details

Details for the file pysentimiento-0.6.1rc2.tar.gz.

File metadata

  • Download URL: pysentimiento-0.6.1rc2.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.28.2 requests-toolbelt/0.10.1 urllib3/1.26.14 tqdm/4.64.1 importlib-metadata/6.0.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.8.16

File hashes

Hashes for pysentimiento-0.6.1rc2.tar.gz
Algorithm Hash digest
SHA256 7e2858e6024b035a57a7813b2ccb60a5fd072bbc275d7499dfe9c0f22cf7f78b
MD5 72acf6a7713907ac51be4ab2c511e3ef
BLAKE2b-256 d894e1f1e9f8c171b201e263be19b8ef3cf4fd00757af1e4f7ab584dc08f2410

See more details on using hashes here.

File details

Details for the file pysentimiento-0.6.1rc2-py3-none-any.whl.

File metadata

  • Download URL: pysentimiento-0.6.1rc2-py3-none-any.whl
  • Upload date:
  • Size: 36.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.28.2 requests-toolbelt/0.10.1 urllib3/1.26.14 tqdm/4.64.1 importlib-metadata/6.0.0 keyring/23.13.1 rfc3986/2.0.0 colorama/0.4.6 CPython/3.8.16

File hashes

Hashes for pysentimiento-0.6.1rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 1d8332fa68be242cebd1f076e83f172bfdb09e18d262f766dbfbae511ac31a85
MD5 96246a4bedcfab13124cd45e2bbd179a
BLAKE2b-256 a1c7b546160844d9f6c44cd92284c5d492c3fca17e819239fe904651cccc763f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page