A Python package for exploratory analysis of text data
Project description
Arabica
A Python package for exploratory analysis of text data
Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.
Arabica provides these methods:
- arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month)
It can apply all or a selected combination of the following cleaning operations:
- Remove digits from the text
- Remove punctuations from the text
- Remove standard list of stopwords
arabica uses clean-text for punctuation cleaning and nltk corpus of stopwords.
Installation
Arabica requires Python 3, NLTK, and clean-text, to execute. To install using pip, use:
pip install arabica
Usage
- Import the library:
from arabica import arabica_freq
- Choose a method:
Arabica returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:
def arabica_freq(text: str, # Text
time: str, # Time
stopwords: str, # Language for stop words
punct: bool = False, # Remove all punctuations
max_words: int='', # Max number for unigrams, bigrams and trigrams displayed
time_freq: str='', # Aggregation period, 'Y'/'M'
numbers: bool = False # Remove all digits
)
Example
import pandas as pd
from arabica import arabica_freq
data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
'So far seems to be the wrong product for me :-/',
'Excellent, service, thank you really, really, really much!!!'],
'time': ['2013-08-8', '2013-09-8','2014-10-8']})
arabica_freq(text= data['text'],time=data['time'],time_freq='M',max_words=2,stopwords='english', numbers = True, punct=True)
Tutorial
For more examples of coding, read a tutorial here.
License
MIT
For any questions, issues, bugs, and suggestions, please visit here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arabica-0.0.4.tar.gz.
File metadata
- Download URL: arabica-0.0.4.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3213d29416b71e6d226d1245c499b017ad42c5e67021e4c2fe957d6ec63b0b0
|
|
| MD5 |
0fb344c353cb28db4983c73673b1a998
|
|
| BLAKE2b-256 |
a473b09119336a81258cc8e60e7348db7586b4c0925f8d24c859786f29956f2f
|
File details
Details for the file arabica-0.0.4-py3-none-any.whl.
File metadata
- Download URL: arabica-0.0.4-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78bc3812e325ddd75eb1c91288f50752817b465c34e225b985f2d8a1676ed5ac
|
|
| MD5 |
eb02ef96de54776885ce4c28d12aa041
|
|
| BLAKE2b-256 |
af7db49e9108021d261b79e84b53401658a668ad0fcf8a0bdc5efc5fac10aca2
|