Python package for exploratory text data analysis

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Arabica

Python package for exploratory text data analysis

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple.

Arabica provides these methods:

arabica_freq: calculates unigram, bigram, and trigram frequencies over a period (year, month, day)

It can apply all or a selected combination of the following cleaning operations:

Remove digits from the text
Remove punctuation from the text
Remove standard list of stopwords
Remove an additional specific list of words

arabica uses clean-text for punctuation cleaning and nltk corpus of stopwords.

Arabica works with texts of languages based on the Latin alphabet and enables stopword removal for languages in the ntlk corpus of stopwords.

It reads dates in standard date and datetime formats (e.g., 2013–12–31, 2013/12/31, 09-Feb-2009, 2013–12–31 11:46:17, 09/02/2009 09:26). It is preferable to use the US-style dates (MM/DD/YYYY) rather than the European-style date format (DD/MM/YYYY) since there might be a mismatch between months and days in small datasets.

Installation

Arabica requires Python 3, NLTK, clean-text, and numpy to execute. To install using pip, use:

pip install arabica

Usage

Import the library:

from arabica import arabica_freq

Choose a method:

arabica_freq returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations:

def arabica_freq(text: str,                # Text
                 time: str,                # Time
                 stopwords: [],            # Languages for stop words
                 skip: [],                 # Strings to be skipped
                 punct: bool = False,      # Remove all punctuation
                 lower_case: bool = False, # Make all text lowercase before n-gram calculation
                 max_words: int ='',       # Max number for unigrams, bigrams and trigrams displayed
                 time_freq: str ='',       # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
                 numbers: bool = False     # Remove all digits
)

A list of available languages for stopwords is printed with:

from nltk.corpus import stopwords
print(stopwords.fileids())

It is possible to remove more sets of stopwords at once by stopwords = ['language 1', 'language2','etc..']

Examples

Time-series n-gram analysis

Returns a table with unigram, bigram, and trigram frequencies over a period of time.

import pandas as pd
from arabica import arabica_freq

data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.',
                              'So far seems to be the wrong product for me :-/ grrrrr...',
                              'Excellent, service, thank you really, really, really much!!!'],
                     'time': ['2013-08-8', '2013-09-8','2014-10-8']})

arabica_freq(text = data['text'],
             time = data['time'],
             time_freq = 'M',           # Calculates monthly n-gram frequencies
             max_words = 2,             # Displays only the first two most frequent unigrams, bigrams, and trigrams
             stopwords = ['english'],   # Removes English set of stopwords
             skip = ['grrrrr'],         # Excludes string from n-gram calculation
             numbers = True,            # Removes numbers
             punct = True,              # Removes punctuation
             lower_case = True)         # Makes all text lowercase before n-gram calculation

Descriptive n-gram analysis

Returns unigram, bigram, and trigram frequencies without period aggregation.

arabica_freq(text = data['text'],
             time = data['time'],
             time_freq = 'ungroup',        # No aggregation made
             max_words = 2,
             stopwords = ['english'],
             skip = ['grrrrr'],       
             numbers = True,
             punct = True
             lower_case = True)

Tutorial

For more examples of coding, read these tutorials:

Text as Time Series: Arabica 1.0.0 Brings New Features for Exploratory Text Data Analysis here

Arabica: A Python Package for Exploratory Analysis of Text Data here

License

MIT

For any questions, issues, bugs, and suggestions, please visit here.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.7.7

Dec 15, 2023

1.7.6

Oct 23, 2023

1.7.4

Oct 3, 2023

1.7.2

Sep 10, 2023

1.7.1

Aug 20, 2023

1.7.0

Aug 16, 2023

1.6.9

Aug 2, 2023

1.6.8

Jul 5, 2023

1.6.7

Jun 29, 2023

1.6.6

Jun 29, 2023

1.6.5

Jun 28, 2023

1.6.4

Jun 24, 2023

1.6.3

Jun 22, 2023

1.6.2

Jun 17, 2023

1.6.1

Jun 17, 2023

1.6.0

Jun 15, 2023

1.5.2

May 20, 2023

1.5.0

May 18, 2023

1.4.9

Apr 29, 2023

1.4.8

Apr 29, 2023

1.4.7

Apr 21, 2023

1.4.6

Apr 17, 2023

1.4.5

Apr 17, 2023

1.4.4

Apr 17, 2023

1.4.3

Apr 16, 2023

1.4.2

Apr 16, 2023

1.4.1

Mar 21, 2023

1.4.0

Mar 20, 2023

1.3.9

Mar 19, 2023

1.3.8

Mar 14, 2023

1.3.6

Mar 10, 2023

1.3.5

Mar 4, 2023

1.2.2

Feb 17, 2023

1.2.1

Jan 20, 2023

1.2.0

Jan 20, 2023

1.1.9

Jan 3, 2023

1.1.8

Jan 2, 2023

1.1.7

Dec 26, 2022

1.1.6

Dec 24, 2022

1.1.5

Dec 22, 2022

1.1.4

Dec 20, 2022

1.1.3

Dec 19, 2022

1.1.2

Dec 19, 2022

1.1.1

Dec 16, 2022

1.0.5

Dec 10, 2022

1.0.4

Nov 28, 2022

This version

1.0.3

Nov 12, 2022

1.0.2

Oct 20, 2022

1.0.1

Oct 18, 2022

1.0.0

Oct 17, 2022

0.0.5

Sep 11, 2022

0.0.4

Sep 9, 2022

0.0.3

Sep 8, 2022

0.0.2

Sep 8, 2022

0.0.1

Sep 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabica-1.0.3.tar.gz (8.9 kB view hashes)

Uploaded Nov 12, 2022 Source

Built Distribution

arabica-1.0.3-py3-none-any.whl (7.8 kB view hashes)

Uploaded Nov 12, 2022 Python 3

Hashes for arabica-1.0.3.tar.gz

Hashes for arabica-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`b529579c94e4184b401c37962a1feafe3a95fcbe9a83ba60069f7a88fbf36d3c`
MD5	`312e5b33e49bb65e8d52444d538ad673`
BLAKE2b-256	`01aa854444cd0062f261d445cc1a4dc0f433630c30fe8e2c0d2b62e294d4e834`

Hashes for arabica-1.0.3-py3-none-any.whl

Hashes for arabica-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d1d9b25111e37e5332b1353ac1a477842abc1b405154499bf9f448107073e8c6`
MD5	`4b2bf21457426f6e52481a82e2b0bcb4`
BLAKE2b-256	`2af6e19f41f07c3cbb63b8b00dec0934cce1e515f1fd5266de57cabc9ad0111b`