Skip to main content

A sentiment analysis classifier in spanish.

Project description

Author : Elliot Hofman

This is a package to perform sentiment analysis in spanish.

## THE DATA

The model is fed data crawled from various websites :
Trip Advisor, PedidosYa, Apestan, QuejasOnline, MercadoLibre, SensaCine, OpenCine, TASS, Twitter
(See the files under /crawlers if interested)

## THE MODEL

The model is a pipeline that includes :
- A vectorizer : go from the text/string representation of the comment to a vectorized representation.
This is done with a TfIdfVectorizer
- A feature Selector : The vectorizer will output a n_samples*n_features very sparse matrix (scipy sparse matrices are already used by the sklean algorithm). This will reduce the number n_features, checking weather a feature is relevant or not.
- A classifier : The model used is a Multinomial Naive Bayes, which performs really well for text
classification.

The parameters and hyper-parameters of this pipeline are found by the use of a GridSearch K - cross validation with K = 3

## THE PREPROCESSING

All the comments are preprocessed before the training is done :
- They are set to lowercase
- Accents are removed and replaced (í ==>i, etc)
- 're' and '100%' are replaced by 'muy', which most of the time will have the same meaning in spanish
- ' x ' are replaced by ' por ', ' q ' and ' k ' by 'que'
- Regex is used to replace all possible forms of 'jajajajaja', 'ajajaaaajjaj', 'jjjjaajj', 'ajaja', 'jejejej', etc ... by the normalized form 'jaja'
- Regex is used to replace duplicated characters ('Que buenoooooo' -> 'Que bueno' etc), paying attention to the special case of the 'l' (It is actually normal in spanish to have words with repeated 'll')
- Reg ex is used to clean spaces ('No me gusto la comida.Vos que opinas?Sisi estuvo mala estoy de acuerdo' -> No me gusto la comida. Vos que opinas? Sisi estuvo mala estoy de acuerdo)
- Regex is used to clean 'k' (askeroso - > asqueroso)
- Reg ex is used to clean numbers (remove them all, except the 100% that is already replaced before)
- a dictionnary of spanish expressions is used to factorize expressions in the comments (por supuesto -> por_supuesto, poco a poco -> poco_a_poco, etc...)
- a dictionnary crawled from the web is used to set a list of chosen verbs to their infinitive form
('me cayo mal la comida'-> 'me caer mal la comida', etc ...)
- a function is applied to apply 'not_' to the words contained between a negation term and min(3, a stopNeg term) terms further. (Las papas no son ricas -> Las papas no not_son not_ricas)
The model will then learn that the fictive word 'not_ricas' is associated to a bad sentiment.
- a list of customed neutral words is used to remove useless words from the comment before the prediction is made.

## THE PREDICTION

The prediction is calculated with a few rules:
- If 'pero' is found in the comment to classify,
preScore = prediction(sentence before the 'pero')
postScore = prediction(sentence before the 'pero')

and a barycenter of those two quantities is calculated ((decayRate-t)*preScore + t*postScore)/decayRate
so that the score remains from the same side of 0.5 than postScore
This because the comments might say something kind of good, and then finish 'pero ...' and say something kind of good. In this case, usually the global sentiment of the comment is carried by the second part of the phrase.
- If 'muy' is found in the comment to classify:
importantScore = prediction(next word just after 'muy' if that word is an adjective)
score = globalScore of the sentence

and the same barycenter method is used so that the final prediction will predict the same thing as the important word placed after the 'muy'.
This is used because the comment might say a lot of things ('bla balab blalal') and the classifier could eventually get confused (if there was some piece of irony, a too big quantity of unknown words, ...), but if the comment started by ('Muy recomendable') anyway it will know the comment is good.
- The comments are processed the same way the training data was prepared, and the words that are not in the vocabulary are removed, to reduce the noise that they bring to the comment.

## MORE DOCUMENTATION ON USAGE COMING SOON ...

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
sondeos-sentiment-1.0.0.tar.gz (15.7 MB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page