Python package for text mining of time-series data

Project description

Arabica

Python package for text mining of time-series data

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include social media conversations, product reviews, research metadata, central bank communication, and newspaper headlines. Arabica makes exploratory analysis of these datasets simple by providing:

Descriptive n-gram analysis: n-gram frequencies
Time-series n-gram analysis: n-gram frequencies over a period
Text visualization: n-gram heatmap, line plot, word cloud
Sentiment analysis: VADER sentiment classifier
Financial sentiment analysis: with FinVADER
Structural breaks identification: Jenks Optimization Method

It automatically cleans data from punctuation on input. It can also apply all or a selected combination of the following cleaning operations:

Remove digits from the text
Remove the standard list(s) of stopwords
Remove an additional list of stop words

Arabica works with texts of languages based on the Latin alphabet, uses cleantext for punctuation cleaning, and enables stop words removal for languages in the NLTK corpus of stopwords.

It reads dates in:

US-style: MM/DD/YYYY (2013-12-31, Feb-09-2009, 2013-12-31 11:46:17, etc.)
European-style: DD/MM/YYYY (2013-31-12, 09-Feb-2009, 2013-31-12 11:46:17, etc.) date and datetime formats.

Installation

Arabica requires Python 3.8 - 3.10, NLTK - stop words removal, cleantext - text cleaning, wordcloud - word cloud visualization, plotnine - heatmaps and line graphs, matplotlib - word clouds and graphical operations, vaderSentiment - sentiment analysis, finvader - financial sentiment analysis, and jenskpy for breakpoint identification.

To install using pip, use:

pip install arabica

Usage

Import the library:

from arabica import arabica_freq
from arabica import cappuccino
from arabica import coffee_break

Choose a method:

arabica_freq enables a specific set of cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.

def arabica_freq(text: str,                # Text
                 time: str,                # Time
                 date_format: str,         # Date format: 'eur' - European, 'us' - American
                 time_freq: str,           # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
                 max_words: int,           # Maximum of most frequent n-grams displayed for each period
                 stopwords: [],            # Languages for stop words
                 stopwords_ext: [],        # Languages for extended stop words list
                 skip: [],                 # Remove additional stop words
                 numbers: bool = False,    # Remove numbers
                 lower_case: bool = False  # Lowercase text
)

cappuccino enables cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.

def cappuccino(text: str,                # Text
               time: str,                # Time
               date_format: str,         # Date format: 'eur' - European, 'us' - American
               plot: str,                # Chart type: 'wordcloud'/'heatmap'/'line'
               ngram: int,               # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
               time_freq: str,           # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup'
               max_words int,            # Maximum of most frequent n-grams displayed for each period
               stopwords: [],            # Languages for stop words
               stopwords_ext: [],        # Languages for extended stop words list
               skip: [],                 # Remove additional stop words               
               numbers: bool = False,    # Remove numbers
               lower_case: bool = False  # Lowercase text
)

coffee_break provides sentiment analysis and breakpoint identification in aggregated time series of sentiment. The implemented models are:

VADER is a lexicon and rule-based sentiment classifier attuned explicitly to general language expressed in social media
FinVADER improves VADER's classification accuracy on financial texts, including two financial lexicons

Break points in the time series are identified with the Fisher-Jenks algorithm (Jenks, 1977. Optimal data classification for choropleth maps).

def coffee_break(text: str,                 # Text
                 time: str,                 # Time
                 date_format: str,          # Date format: 'eur' - European, 'us' - American
                 model: str,                # Sentiment classifier, 'vader' - general language, 'finvader' - financial text                
                 skip: [],                  # Remove additional stop words
                 preprocess: bool = False,  # Clean data from numbers and punctuation
                 time_freq: str,            # Aggregation period: 'Y'/'M'
                 n_breaks: int              # Number of breakpoints: min. 2
)

Documentation, examples and tutorials

Read the documentation

For more examples of coding, read these tutorials:

General use:

Sentiment Analysis and Structural Breaks in Time-Series Text Data here
Visualization Module in Arabica Speeds Up Text Data Exploration here
Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis here

Applications:

Business Intelligence: Customer Satisfaction Measurement with N-gram and Sentiment Analysis here
Research meta-data analysis: Research Article Meta-data Description Made Quick and Easy here
Media coverage text mining
Social media analysis

ðŸ’¬ Please visit here for any questions, issues, bugs, and suggestions.

Citation

Using arabica in a paper or thesis? Please cite this paper:

@article{KorÃ¡b:2024,
  author   = {{KorÃ¡b}, P., and {PomÄ›nkovÃ¡}, J.},
  title    = {Arabica: A Python package for exploratory analysis of text data},
  journal  = {Journal of Open Source Software},
  volume   = {97},
  number   = {9},
  pages    = {6186},
  year     = {2024},
  doi      = {doi.org/10.21105/joss.06186},
}

Project details

Release history Release notifications | RSS feed

This version

1.8.2

Nov 23, 2024

1.8.1

Jul 27, 2024

1.8.0

Jul 27, 2024

1.7.9

Jul 26, 2024

1.7.8

Jul 26, 2024

1.7.7

Dec 15, 2023

1.7.6

Oct 23, 2023

1.7.4

Oct 3, 2023

1.7.2

Sep 10, 2023

1.7.1

Aug 20, 2023

1.7.0

Aug 16, 2023

1.6.9

Aug 2, 2023

1.6.8

Jul 5, 2023

1.6.7

Jun 29, 2023

1.6.6

Jun 29, 2023

1.6.5

Jun 28, 2023

1.6.4

Jun 24, 2023

1.6.3

Jun 22, 2023

1.6.2

Jun 17, 2023

1.6.1

Jun 17, 2023

1.6.0

Jun 15, 2023

1.5.2

May 20, 2023

1.5.0

May 18, 2023

1.4.9

Apr 29, 2023

1.4.8

Apr 29, 2023

1.4.7

Apr 21, 2023

1.4.6

Apr 17, 2023

1.4.5

Apr 17, 2023

1.4.4

Apr 17, 2023

1.4.3

Apr 16, 2023

1.4.2

Apr 16, 2023

1.4.1

Mar 21, 2023

1.4.0

Mar 20, 2023

1.3.9

Mar 19, 2023

1.3.8

Mar 14, 2023

1.3.6

Mar 10, 2023

1.3.5

Mar 4, 2023

1.2.2

Feb 17, 2023

1.2.1

Jan 20, 2023

1.2.0

Jan 20, 2023

1.1.9

Jan 3, 2023

1.1.8

Jan 2, 2023

1.1.7

Dec 26, 2022

1.1.6

Dec 24, 2022

1.1.5

Dec 22, 2022

1.1.4

Dec 20, 2022

1.1.3

Dec 19, 2022

1.1.2

Dec 19, 2022

1.1.1

Dec 16, 2022

1.0.5

Dec 10, 2022

1.0.4

Nov 28, 2022

1.0.3

Nov 12, 2022

1.0.2

Oct 20, 2022

1.0.1

Oct 18, 2022

1.0.0

Oct 17, 2022

0.0.5

Sep 11, 2022

0.0.4

Sep 9, 2022

0.0.3

Sep 8, 2022

0.0.2

Sep 8, 2022

0.0.1

Sep 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabica-1.8.2.tar.gz (23.6 kB view details)

Uploaded Nov 23, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arabica-1.8.2-py3-none-any.whl (22.2 kB view details)

Uploaded Nov 23, 2024 Python 3

File details

Details for the file arabica-1.8.2.tar.gz.

File metadata

Download URL: arabica-1.8.2.tar.gz
Upload date: Nov 23, 2024
Size: 23.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for arabica-1.8.2.tar.gz
Algorithm	Hash digest
SHA256	`6edeb494da3c4cae0440fb8e3c0192e2acc454d046e45876e824126a42f4395d`
MD5	`ecc44dd289d7b9cd08fdc49dedaf8f2e`
BLAKE2b-256	`1f061eb8f7b7a893c778900ed59b26943a1b41c2f533d33cb78b6bc950802361`

See more details on using hashes here.

File details

Details for the file arabica-1.8.2-py3-none-any.whl.

File metadata

Download URL: arabica-1.8.2-py3-none-any.whl
Upload date: Nov 23, 2024
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for arabica-1.8.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55924b066a936c5e3356b821f737d093c2cdbc98c4dc09b35579721a983a8843`
MD5	`39c05312889fea5438590dcf4bcdff04`
BLAKE2b-256	`8d65a8edc7d9ae4ccf6634a36cfa77cd51fc1dfddd5066db840f6d50dc66723c`

See more details on using hashes here.

arabica 1.8.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Arabica

Installation

Usage

Documentation, examples and tutorials

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes