This module helps you extract key terms and topics from corpus using a comparative approach.
pip install --upgrade comparativeExtraction
from comparativeExtraction import comparative_keyword_extraction
Load sample data
import pandas as pd import numpy as np PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv" data = pd.read_csv(PATH) label = [x <= 3 for x in data['stars']]
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')
Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.
The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.
Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.
Initialize the module with the review corpus and labels
kw_init = comparative_keyword_extraction(corpus = data['reviews'], labels = label)
Extract the keywords
kw = kw_init.get_distinguishing_terms(ngram_range = (1,3),top_n = 10)
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews kw.incre_df
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews kw.decline_df
If we need more context on a given word, or we need more interpretable topics, we can:
- Output the reviews that contains the term
- Switch the ngram_range
- Use the supplement functions module
Output the reviews
Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.
The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.
# The binary_dtm provides a convenient way to extract reviews with specific terms print(kw.binary_dtm[['work','not']])
work not 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 ... ... ... 4995 1 0 4996 0 1 4997 0 0 4998 0 0 4999 0 0 [5000 rows x 2 columns]
reviews_contain_term_work = data['reviews'][[x == 1 for x in kw.binary_dtm['work']]] len(reviews_contain_term_work)
for x in pd.Series(reviews_contain_term_work).sample(1): print(x)
It's alright, only got it to give Nintendo another chance. It's a neat concept. Overall, it's aggressively mediocre, good for casual stuff, but will never get as much use as my ps4.Wi-Fi is god awful though. The worst I've dealt with. It's connection capabilities are atrocious compared with any other wireless device. Don't expect it to just work. Honestly, this singular problem is enough for me to rate it 1 star. I suppose they had to cut corners somewhere.
Change the n-gram range to exclude uni-grams
kw = kw_init.get_distinguishing_terms(ngram_range = (2,4),top_n = 10) kw.incre_df
Using supplement function
Sometimes when we want to drill down into one specific term, we can leverage the built-in supplement functions to find related n-grams containing the term
from comparativeExtraction.supplement_funcs import get_ngrams_on_term
target_term = "work" reviews_contain_term_work = data['reviews'][[x == 1 for x in kw.binary_dtm['work']]] related_ngrams = get_ngrams_on_term(target_term,reviews_contain_term_work,filter_by_extreme=False)
Here, the count is also a Document Frequency
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size comparativeExtraction-0.0.7.tar.gz (12.8 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for comparativeExtraction-0.0.7.tar.gz