A Transformer-based library for Sentiment Analysis in Spanish
Project description
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks
A Transformer-based library for SocialNLP classification tasks.
Currently supports:
- Sentiment Analysis (Spanish, English)
- Emotion Analysis (Spanish, English)
Just do pip install pysentimiento
and start using it:
from pysentimiento import SentimentAnalyzer
analyzer = SentimentAnalyzer(lang="es")
analyzer.predict("Qué gran jugador es Messi")
# returns SentimentOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns SentimentOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns SentimentOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})
analyzer.predict("jejeje no te creo mucho")
# SentimentOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""
emotion_analyzer = EmotionAnalyzer(lang="en")
emotion_analyzer.predict("yayyy")
# returns EmotionOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns EmotionOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})
Also, you might use pretrained models directly with transformers
library.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")
Preprocessing
pysentimiento
features a tweet preprocessor specially suited for tweet classification with transformer-based models.
from pysentimiento.preprocessing import preprocess_tweet
# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "[USER] debería cambiar esto [URL]"
# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"
# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"
# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"
# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# '[EMOJI] party popper [EMOJI][EMOJI] party popper [EMOJI]'
Trained models so far
Check CLASSIFIERS.md for details on the reported performances of each model.
Spanish models
English models
Instructions for developers
- First, download TASS 2020 data to
data/tass2020
(you have to register here to download the dataset)
Labels must be placed under data/tass2020/test1.1/labels
- Run script to train models
Check TRAIN_EVALUATE.md
- Upload models to Huggingface's Model Hub
Check "Model sharing and upload" instructions in huggingface
docs.
Citation
If you use pysentimiento
in your work, please cite this paper
@misc{perez2021pysentimiento,
title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
year={2021},
eprint={2106.09462},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
TODO:
- Upload some other models
- Train in other languages
Suggestions and bugfixes
Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pysentimiento-0.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcd2162d7f277c9931e4006df048af61a1dac66f01af9a483b15f8813e0e1da6 |
|
MD5 | 05f6180a838c7180f9e2c94bb7794553 |
|
BLAKE2b-256 | b5ad286fca09d87bdddadff7cb889f7153fa769b7ef96d910c3e1bd1ad499b43 |