Skip to main content

A Python package for exploratory analysis of text data

Project description

Arabica

A Python package for exploratory analysis of text data

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.

Arabica provides these methods:

  • arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month)

It can apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text
  • Remove punctuations from the text
  • Remove standard list of stopwords

arabica uses clean-text for punctuation cleaning and nltk corpus of stopwords.

Installation

Arabica requires Python 3, NLTK, and clean-text, to execute. To install using pip, use:

pip install arabica

Usage

  • Import the library:
from arabica import arabica_freq
  • Choose a method:

Arabica returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:

def arabica_freq(text: str, # Text
                 time: str, # Time
                 stopwords: str, # Language for stop words
                 punct: bool = False, # Remove all punctuations
                 max_words: int='', # Max number for unigrams, bigrams and trigrams displayed
                 time_freq: str='', # Aggregation period, 'Y'/'M'
                 numbers: bool = False # Remove all digits
) 

Example

import pandas as pd
from arabica import arabica_freq
data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
                              'So far seems to be the wrong product for me :-/',
                              'Excellent, service, thank you really, really, really much!!!'],
                     'time': ['2013-08-8', '2013-09-8','2014-10-8']})
arabica_freq(text= data['text'],time=data['time'],time_freq='M',max_words=2,stopwords='english', numbers = True, punct=True)

Tutorial

For more examples of coding, read a tutorial here.

License

MIT

For any questions, issues, bugs, and suggestions, please visit here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabica-0.0.4.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

arabica-0.0.4-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file arabica-0.0.4.tar.gz.

File metadata

  • Download URL: arabica-0.0.4.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.8

File hashes

Hashes for arabica-0.0.4.tar.gz
Algorithm Hash digest
SHA256 a3213d29416b71e6d226d1245c499b017ad42c5e67021e4c2fe957d6ec63b0b0
MD5 0fb344c353cb28db4983c73673b1a998
BLAKE2b-256 a473b09119336a81258cc8e60e7348db7586b4c0925f8d24c859786f29956f2f

See more details on using hashes here.

File details

Details for the file arabica-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: arabica-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.8

File hashes

Hashes for arabica-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 78bc3812e325ddd75eb1c91288f50752817b465c34e225b985f2d8a1676ed5ac
MD5 eb02ef96de54776885ce4c28d12aa041
BLAKE2b-256 af7db49e9108021d261b79e84b53401658a668ad0fcf8a0bdc5efc5fac10aca2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page