Skip to main content

Python package for exploratory text data analysis

Project description

Arabica

Python package for exploratory text data analysis

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.

Arabica provides these methods:

  • arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month, day)

  • cappuccino: provides plots for descriptive (word cloud) and time-series (heatmap, line plot) text data visualization

It can apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text
  • Remove punctuation from the text
  • Remove standard list of stopwords
  • Remove an additional specific list of words

Arabica works with texts of languages based on the Latin alphabet, uses clean-text for punctuation cleaning, and enables stop words removal for languages in the NLTK corpus of stopwords.

It reads dates in standard date and datetime formats (e.g., 2013–12–31, 2013/12/31, Feb-09-2009, 2013–12–31 11:46:17, 09/02/2009 09:26). It is preferable to use the US-style dates (MM/DD/YYYY) rather than the European-style date format (DD/MM/YYYY).

Installation

Arabica requires Python 3.8 - 3.10, NLTK - stop words removal, clean-text - text cleaning, wordcloud - word cloud visualization, plotnine - heatmaps and line graphs, and matplotlib for graphical operations.

To install using pip, use:

pip install arabica

Usage

  • Import the library:
from arabica import arabica_freq
from arabica import cappuccino
  • Choose a method:

arabica_freq returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period and choose a specific set of cleaning operations:

def arabica_freq(text: str,                # Text
                 time: str,                # Time
                 time_freq: str ='',       # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
                 max_words: int ='',       # Max number for most frequent n-grams displayed for each period
                 stopwords: [],            # Languages for stop words
                 skip: [],                 # Remove additional strings
                 numbers: bool = False,    # Remove all digits
                 punct: bool = False,      # Remove all punctuation
                 lower_case: bool = False  # Lowercase text before cleaning and frequency analysis
) 

cappuccino enables standard cleaning operations (stop words, numbers, and punctuation removal) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) text data visualization.

def cappuccino(text: str,                # Text
               time: str,                # Time
               plot: str ='',            # Chart type: 'wordcloud'/'heatmap'/'line'
               ngram: int ='',           # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
               time_freq: int ='',       # Aggregation period: 'Y'/'M'', if no aggregation: 'ungroup'
               max_words int ='',        # Max number for most frequent n-grams displayed for each period
               stopwords = [],           # Languages for stop words
               skip: [ ],                # Remove additional strings
               numbers: bool = False,    # Remove numbers
               punct: bool = False,      # Remove punctuation
               lower_case: bool = False  # Lowercase text before cleaning and frequency analysis
)

A list of available languages for stopwords is printed with:

from nltk.corpus import stopwords
print(stopwords.fileids())

It is possible to remove more sets of stopwords at once by stopwords = ['language 1', 'language2','etc..']

Documentation, examples and tutorials

  • Read the documentation.

  • For more examples of coding, read this tutorial:

Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis here


Please visit here for any questions, issues, bugs, and suggestions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabica-1.1.7.tar.gz (10.9 kB view hashes)

Uploaded Source

Built Distribution

arabica-1.1.7-py3-none-any.whl (10.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page