b4msa

Baselines for Multilingual Sentiment Analysis

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- MacOS :: MacOS X
- POSIX :: Linux
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

A Baseline for Multilingual Sentiment Analysis (B4MSA)

B4MSA is a Python Sentiment Analysis Classifier for Twitter-like short texts. It can be used to create a first approximation to a sentiment classifier on any given language. It is almost language-independent, but it can take advantage of the particularities of a language.

It is written in Python making use of NTLK, scikit-learn and gensim to create simple but effective sentiment classifiers.

Performance

In order to know the performance of B4MSA, it was decided to compare B4MSA against different commercial sentiment analyzers such as: Google HP Haven, and Sentiment140. This comparison was performed on July 28, 2016.

The performance measures used in the comparison are taken from three different sentiment analysis competitions: TASS, SemEval, and SENTIPOLC. These measures are the accuracy, F1-positive-negative, and F0F1-positive-negative.

F1-positive-negative is defined as the average of the F1 obtained from the positive class and F1 of the negative class.

F0F1-positive-negative is defined as the average between F0-positive-negative and F1-positive-negative. F0-positive-negative is defined as the average of the F0 of the positive class and F0 of the negative class; F1-positive-negative is defined similarly. For more information regarding this measure we refer the reader to the overview of SENTIPOLC challenge

SENTIPOLC 2014 (Italian)

Algorithm	Accuracy	F1-positive-negative	F0F1-positive-negative
HP Haven	0.4693	0.3921	0.4788
B4MSA	0.6072	0.6041	0.6354

TASS 2015-small (Spanish)

Algorithm	Accuracy	F1-positive-negative	F0F1-positive-negative
HP Haven	0.5350	0.4564	0.5775
Sentiment140	0.0590	0.1001	0.3225
B4MSA	0.6330	0.6949	0.6529

SemEval 2015 (English)

Algorithm	Accuracy	F1-positive-negative	F0F1-positive-negative
HP Haven	0.6241	0.5724	0.6120
Google	0.4218	0.4967	0.4624
Sentiment140	0.5553	0.4790	0.5368
B4MSA	0.6281	0.5281	0.6578

SemEval 2016 (English)

Algorithm	Accuracy	F1-positive-negative	F0F1-positive-negative
HP Haven	0.5986	0.5173	0.5337
Google	0.3952	0.4728	0.4303
Sentiment140	0.5562	0.4135	0.4560
B4MSA	0.5106	0.4082	0.4866

Note regarding Google:

Google’s sentiment analysis algorithm does not provide classes instead the service returns a continuous value between -1 and 1; however, in the api documentation it was mentioned the following rule:

def get_class(pol):
    if pol < -0.75:
        return 'negative'
    elif pol > 0.25:
        return 'positive'
    return 'neutral'

Installing B4MSA

B4MSA can be installed using pip

pip install b4msa

or cloning the b4msa repository from github, e.g.,

git clone https://github.com/INGEOTEC/b4msa.git

Predict a training set using B4MSA

Suppose you have a workload of classified tweets tweets.json.gz to model your problem, let us assume that b4msa is already installed, then the stratisfied k-fold can be computed as follows:

b4msa-params -k5 -s24 -n24 tweets.json.gz -o tweets.json

the parameters means for:

-k5 five folds
-s48 b4msa optimizes model’s parameters for you, and -s48 specifies that the parameter space should be sampled in 48 points and it simply get the best among them
-n24 let us specify the number of workds to be launch, it is a good idea to set -s as a multiply of -n.
-o tweets.json specifies the file to store the configurations found by the parameter selection process, in best first order; a number of metrics are given, but it is in descending order by _score

The tweets.json looks like (for a four-classes problem)

[
  {
    "_accuracy": 0.7773561997268175,
    "_macro_f1": 0.5703751933361809,
    "_score": 0.5703751933361809,
    "_time": 36.73965764045715,
    "_weighted_f1": 0.7467834129359526,
    "del_dup1": false,
    "lc": true,
    "num_option": "group",
    "strip_diac": true,
    "token_list": [
      1,
      2,
      3,
      6
    ],
    "url_option": "none",
    "usr_option": "group"
  },
...

each entry specifies a configuration, please check the code (a manual is coming soon) to learn about each parameter. Since first configurations show how best/good setups are composed, it is possible to learn something about your dataset making some analysis on these setups.

There exist other useful flags like:

-H makes b4msa to perform last hill climbing search for the parameter selection, in many cases, this will produce much better configurations (never worst, guaranteed)
--lang spanish|english|german|italian it specifies the language of the dataset, it allows b4msa to use language dependent techniques to the parameter selection procedure; currently, only spanish is supported.

b4msa-params -H -k5 -s48 -n24 tweets.json.gz -o tweets-spanish.json --lang spanish

The tweets-spanish.json file looks as follows:

[
  {
    "_accuracy": 0.7750796782516315,
    "_macro_f1": 0.5736270120411987,
    "_score": 0.5736270120411987,
    "_time": 36.68731508255005,
    "_weighted_f1": 0.7472079134492694,
    "del_dup1": true,
    "lc": true,
    "negation": false,
    "num_option": "group",
    "stemming": true,
    "stopwords": "delete",
    "strip_diac": true,
    "token_list": [
      1,
      2,
      3,
      5
    ],
    "url_option": "delete",
    "usr_option": "none"
  },
...

Here we can see that negation, stemming and stopwords parameters were considered.

Using the models to create a sentiment classifier

Testing a sentiment classifier against a workload

Minimum requirements

In the modeling stage, the minimum requirements are dependent on the knowledge database being processed. Make sure you have enough memory for it. Take into account that b4msa can take advantage of multicore architectures using the multiprocessing module of python, this means that the memory requirements are multiplied by the number of processes you run.

It is recomended to use as many cores as you have to obtain good results in short running times.

On the training and testing stages only one core is used and there is no extra memory needs; however, no multicore support is provided for these stages.

Installing dependencies

Let us download python (from conda distribution), install it, and include python in the PATH.

wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
chmod 755 miniconda.sh
./miniconda.sh -b
export PATH=/home/$USER/miniconda3/bin:$PATH

B4MSA needs the following dependencies.

pip install coverage
pip install numpy
pip install scipy
pip install scikit-learn
pip install gensim
pip install nose
pip install nltk

For the eager people, it is recommended to install the tqdm package

pip install tqdm

However, it is better to prepare a coffee and a sandwich :)

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- MacOS :: MacOS X
- POSIX :: Linux
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

2.2.4

Sep 11, 2024

2.2.3

Jun 4, 2024

2.2.2

Apr 20, 2022

2.2.1

Apr 16, 2022

2.2.0

Apr 15, 2022

2.1.1

Mar 12, 2020

2.1

Mar 1, 2019

2.0.4

Feb 26, 2019

2.0.3

Feb 20, 2019

2.0.2

Feb 18, 2019

2.0.1

Feb 2, 2019

2.0

Feb 1, 2019

1.5

Jan 12, 2019

1.4.1

Dec 31, 2018

1.4

Dec 27, 2018

1.3.2

Dec 11, 2018

1.3.1

Dec 3, 2018

1.3.0

Nov 28, 2018

1.2.6

Feb 23, 2018

1.2.5

Feb 17, 2018

1.2.1

Nov 28, 2017

1.1

Sep 29, 2016

1.0

Sep 23, 2016

0.1.27

Sep 8, 2016

This version

0.1.26

Aug 26, 2016

0.1.25

Aug 23, 2016

0.1.24

Aug 23, 2016

0.1.23

Aug 17, 2016

0.1.22

Aug 14, 2016

0.1.21

Jul 28, 2016

0.1.20

Jul 22, 2016

0.1.19

Jul 14, 2016

0.1.18

Jul 14, 2016

0.1.17

Jul 13, 2016

0.1.15

Jun 24, 2016

0.1.14

Jun 24, 2016

0.1.13

Jun 23, 2016

0.1.12

Jun 23, 2016

0.1.11

Jun 23, 2016

0.1.10

Jun 22, 2016

0.1.8

Jun 20, 2016

0.1.7

Jun 20, 2016

0.1.6

Jun 20, 2016

0.1.5

Jun 10, 2016

0.1.4

Jun 10, 2016

0.1.3

Jun 10, 2016

0.1.2

Jun 10, 2016

0.1.1

Jun 9, 2016

0.1.0

Jun 9, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

b4msa-0.1.26.tar.gz (44.3 kB view hashes)

Uploaded Aug 26, 2016 Source

Hashes for b4msa-0.1.26.tar.gz

Hashes for b4msa-0.1.26.tar.gz
Algorithm	Hash digest
SHA256	`1e23facd97bd0e0a6e3e2eb364d642af492a763394bdbb3fa4bfaa2c309de202`
MD5	`704a15580517cde5168234e6730a3303`
BLAKE2b-256	`97de630911f6b24d96621b4629a53701b54e7cea774d3e0e4f2cb0d0bb5b14ef`