A Python package for exploratory analysis of text data
Project description
Arabica
A Python package for exploratory analysis of text data
Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.
Arabica provides these methods:
- arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month)
It can apply all or a selected combination of the following cleaning operations:
- Remove digits from the text
- Remove punctuations from the text
- Remove standard list of stopwords
arabica
uses clean-text
for punctuation cleaning and nltk
corpus of stopwords.
Installation
Arabica requires Python 3, NLTK, and clean-text, to execute. To install using pip, use:
pip install arabica
Usage
- Import the library:
from arabica import arabica_freq
- Choose a method:
Arabica returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:
def arabica_freq(text: str, # Text
time: str, # Time
stopwords: str, # Language for stop words
punct: bool = False, # Remove all punctuations
max_words: int='', # Max number for unigrams, bigrams and trigrams displayed
time_freq: str='', # Aggregation period, 'Y'/'M'
numbers: bool = False # Remove all digits
)
Example
import pandas as pd
from arabica import arabica_freq
data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
'So far seems to be the wrong product for me :-/',
'Excellent, service, thank you really, really, really much!!!'],
'time': ['2013-08-8', '2013-09-8','2014-10-8']})
arabica_freq(text= data['text'],time=data['time'],time_freq='M',max_words=2,stopwords='english', numbers = True, punct=True)
Tutorial
For more examples of coding, read a tutorial here.
License
MIT
For any questions, issues, bugs, and suggestions, please visit here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file arabica-0.0.4.tar.gz
.
File metadata
- Download URL: arabica-0.0.4.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
a3213d29416b71e6d226d1245c499b017ad42c5e67021e4c2fe957d6ec63b0b0
|
|
MD5 |
0fb344c353cb28db4983c73673b1a998
|
|
BLAKE2b-256 |
a473b09119336a81258cc8e60e7348db7586b4c0925f8d24c859786f29956f2f
|
File details
Details for the file arabica-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: arabica-0.0.4-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
78bc3812e325ddd75eb1c91288f50752817b465c34e225b985f2d8a1676ed5ac
|
|
MD5 |
eb02ef96de54776885ce4c28d12aa041
|
|
BLAKE2b-256 |
af7db49e9108021d261b79e84b53401658a668ad0fcf8a0bdc5efc5fac10aca2
|