Skip to main content

A Python package for exploratory analysis of text data

Project description

Arabica

A Python package for exploratory analysis of text data

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.

Arabica provides these methods:

  • arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month)

It can apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text
  • Remove punctuations from the text
  • Remove standard list of stopwords

arabica uses clean-text for punctuation cleaning and nltk corpus of stopwords.

Installation

Arabica requires Python 3, NLTK, and clean-text, to execute. To install using pip, use:

pip install arabica

Usage

  • Import the library:
from arabica import arabica_freq
  • Choose a method:

Arabica returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:

def arabica_freq(text: str, # Text
                 time: str, # Time
                 stopwords: str, # Language for stop words
                 punct: bool = False, # Remove all punctuations
                 max_words: int='', # Max number for unigrams, bigrams and trigrams displayed
                 time_freq: str='', # Aggregation period, 'Y'/'M'
                 numbers: bool = False # Remove all digits
) 

Example

import pandas as pd
from arabica import arabica_freq
data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
                              'So far seems to be the wrong product for me :-/',
                              'Excellent, service, thank you really, really, really much!!!'],
                     'time': ['2013-08-8', '2013-09-8','2014-10-8']})
arabica_freq(text= data['text'],time=data['time'],time_freq='M',max_words=2,stopwords='english', numbers = True, punct=True)

Tutorial

For more examples of coding, read a tutorial here.

License

MIT

For any questions, issues, bugs, and suggestions, please visit here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabica-0.0.4.tar.gz (5.7 kB view hashes)

Uploaded Source

Built Distribution

arabica-0.0.4-py3-none-any.whl (6.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page