Skip to main content

A Python package for exploratory analysis of text data

Project description


A Python package for exploratory analysis of text data

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.

Arabica provides these methods:

  • arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month)

It can apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text
  • Remove punctuation from the text
  • Remove standard list of stopwords

arabica uses clean-text for punctuation cleaning and nltk corpus of stopwords.


Arabica requires Python 3, NLTK, and clean-text, to execute. To install using pip, use:

pip install arabica


  • Import the library:
from arabica import arabica_freq
  • Choose a method:

Arabica returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:

def arabica_freq(text: str, # Text
                 time: str, # Time
                 stopwords: str, # Language for stop words
                 punct: bool = False, # Remove all punctuation
                 max_words: int='', # Max number for unigrams, bigrams and trigrams displayed
                 time_freq: str='', # Aggregation period, 'Y'/'M'
                 numbers: bool = False # Remove all digits


import pandas as pd
from arabica import arabica_freq
data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
                              'So far seems to be the wrong product for me :-/',
                              'Excellent, service, thank you really, really, really much!!!'],
                     'time': ['2013-08-8', '2013-09-8','2014-10-8']})
arabica_freq(text= data['text'],time=data['time'],time_freq='M',max_words=2,stopwords='english', numbers = True, punct=True)


For more examples of coding, read a tutorial here.



For any questions, issues, bugs, and suggestions, please visit here

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabica-0.0.5.tar.gz (5.7 kB view hashes)

Uploaded source

Built Distribution

arabica-0.0.5-py3-none-any.whl (6.5 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page