Python package for exploratory text data analysis
Project description
Arabica
Python package for exploratory text data analysis
Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.
Arabica provides these methods:
- arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month, day)
It can apply all or a selected combination of the following cleaning operations:
- Remove digits from the text
- Remove punctuation from the text
- Remove standard list of stopwords
- Remove an additional specific list of words
Arabica works with texts of languages based on the Latin alphabet, uses clean-text
for punctuation cleaning, and enables stop words removal for languages in the NLTK
corpus of stopwords.
It reads dates in standard date and datetime formats (e.g., 2013–12–31, 2013/12/31, 09-Feb-2009, 2013–12–31 11:46:17, 09/02/2009 09:26). It is preferable to use the US-style dates (MM/DD/YYYY) rather than the European-style date format (DD/MM/YYYY).
Installation
Arabica requires Python >=3.7, NLTK, clean-text, and numpy to execute. To install using pip, use:
pip install arabica
Usage
- Import the library:
from arabica import arabica_freq
- Choose a method:
arabica_freq returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:
def arabica_freq(text: str, # Text
time: str, # Time
time_freq: str ='', # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
max_words: int ='', # Max number for unigrams, bigrams and trigrams displayed
stopwords: [], # Languages for stop words
skip: [], # Remove additional strings
numbers: bool = False # Remove all digits
punct: bool = False, # Remove all punctuation
lower_case: bool = False, # Lowercase text before cleaning and frequency analysis
)
A list of available languages for stopwords is printed with:
from nltk.corpus import stopwords
print(stopwords.fileids())
It is possible to remove more sets of stopwords at once by stopwords = ['language 1', 'language2','etc..']
Examples
Time-series n-gram analysis
Returns a table with unigram, bigram, and trigram frequencies over a period of time.
import pandas as pd
from arabica import arabica_freq
data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
'So far seems to be the wrong product for me :-/ grrrrr...',
'Excellent, service, thank you really, really, really much!!!'],
'time': ['2013-08-8', '2013-09-8','2014-10-8']})
arabica_freq(text = data['text'],
time = data['time'],
time_freq = 'M', # Calculates monthly n-gram frequencies
max_words = 2, # Displays two most frequent unigrams, bigrams, and trigrams
stopwords = ['english'], # Removes English set of stopwords
skip = ['grrrrr'], # Excludes string from n-gram calculation
numbers = True, # Removes numbers
punct = True, # Removes punctuation
lower_case = True) # Lowercase text before cleaning and n-gram calculation
Descriptive n-gram analysis
Returns unigram, bigram, and trigram frequencies without period aggregation.
arabica_freq(text = data['text'],
time = data['time'],
time_freq = 'ungroup', # No aggregation made
max_words = 2,
stopwords = ['english'],
skip = ['grrrrr'],
numbers = True,
punct = True
lower_case = True)
Documentation and tutorials
Read the documentation here. For more examples of coding, read this tutorial:
Text as Time Series: Arabica 1.0.0 Brings New Features for Exploratory Text Data Analysis here
License
MIT
For any questions, issues, bugs, and suggestions, please visit here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.