Skip to main content

Extract keywords via comparison of corpus

Project description

Introduction

This module helps you extract key terms and topics from corpus using a comparative approach.

Installation

Usage

Import packages

from compExtract import ComparativeExtraction

Load sample data

import pandas as pd
import numpy as np
PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"
data = pd.read_csv(PATH)
label = [x <= 3 for x in data['stars']]
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
stars titles reviews dates
0 5.0 Worth It\n Definitely worth the money!\n September 21, 2019
1 2.0 Nintendo Swich gris joy con\n Con este producto no he sentido mucha satisfac... September 20, 2019
2 5.0 My kid wont put it down\n Couldnt of been happier, came early. I was th... September 20, 2019
3 3.0 Happy\n Happy\n September 20, 2019
4 5.0 Great\n Great product\n September 19, 2019
... ... ... ... ...
4995 1.0 One Star\n it is no good, it suck, no work, plz hlp amazon\n December 12, 2017
4996 5.0 A must have gaming system\n The Nintendo Switch is a versatile hybrid game... December 12, 2017
4997 5.0 Switch\n This purchase save me from looking for one.\n December 11, 2017
4998 5.0 Five Stars\n Best babysitter ever!\n December 11, 2017
4999 5.0 Five Stars\n Its a great game console.\n December 11, 2017

5000 rows × 4 columns

data.columns
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')

Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.

The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.

Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.

Initialize the module with the review corpus and labels

ce = ComparativeExtraction(corpus = data['reviews'], labels = label)

Extract the keywords

ce.get_distinguish_terms(ngram_range = (1,3),top_n = 10)
<compExtract.ComparativeExtraction at 0x7ff96f84b588>
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews
ce.increased_terms_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
feature diff pos_prop pos_count neg_prop neg_count
0 work 0.194976 0.278426 191 0.083449 360
1 switch 0.176764 0.351312 241 0.174548 753
2 buy 0.174520 0.297376 204 0.122856 530
3 month 0.143129 0.158892 109 0.015763 68
4 nintendo 0.134316 0.290087 199 0.155772 672
5 charge 0.122855 0.141399 97 0.018544 80
6 use 0.118448 0.206997 142 0.088549 382
7 new 0.113989 0.160350 110 0.046361 200
8 would 0.106540 0.164723 113 0.058183 251
9 get 0.104055 0.231778 159 0.127724 551
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews
ce.decreased_terms_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
feature diff pos_prop pos_count neg_prop neg_count
0 love -0.216997 0.080175 55 0.297172 1282
1 great -0.122247 0.099125 68 0.221372 955
2 fun -0.048160 0.046647 32 0.094808 409
3 best -0.042638 0.030612 21 0.073250 316
4 amaze -0.038011 0.010204 7 0.048215 208
5 awesome -0.035827 0.007289 5 0.043115 186
6 son love -0.035564 0.002915 2 0.038479 166
7 perfect -0.032515 0.008746 6 0.041261 178
8 easy -0.026282 0.023324 16 0.049606 214
9 kid love -0.024370 0.004373 3 0.028744 124

If we need more context on a given word, or we need more interpretable topics, we can:

  1. Output the reviews that contains the term
  2. Switch the ngram_range

Output the reviews

Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.

The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.

# The binary_dtm provides a convenient way to extract reviews with specific terms
print(ce.binary_dtm[['work']])
      work
0        0
1        0
2        0
3        0
4        0
...    ...
4995     1
4996     0
4997     0
4998     0
4999     0

[5000 rows x 1 columns]
reviews_contain_term_work = data['reviews'][[x == 1 for x in ce.binary_dtm['work']]]
len(reviews_contain_term_work)
551
for x in pd.Series(reviews_contain_term_work).sample(1):
    print(x)
I bought this as a Christmas present for my son.  After about a month and half of using it.  The switch stopped working.  It wont charge.  The product is an expensive piece of junk.

Change the n-gram range to exclude uni-grams

ce_ngram = ComparativeExtraction(corpus = data['reviews'], labels = label).get_distinguish_terms(ngram_range=(2,4), top_n=10)
/Users/xiaoma/envs/compExtract/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"





<compExtract.ComparativeExtraction at 0x7ff955f23cf8>
ce_ngram.increased_terms_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
feature diff pos_prop pos_count neg_prop neg_count
0 joy con 0.040857 0.056851 39 0.015994 69
1 brand new 0.020511 0.027697 19 0.007186 31
2 nintendo switch 0.019638 0.074344 51 0.054706 236
3 buy switch 0.018888 0.027697 19 0.008809 38
4 play game 0.014092 0.039359 27 0.025267 109
5 game play 0.009812 0.021866 15 0.012054 52
6 year old 0.005243 0.023324 16 0.018081 78
7 christmas gift 0.003682 0.014577 10 0.010895 47
8 battery life 0.001833 0.024781 17 0.022949 99
9 wii u 0.000504 0.016035 11 0.015531 67
ce_ngram.decreased_terms_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
feature diff pos_prop pos_count neg_prop neg_count
0 son love -0.035564 0.002915 2 0.038479 166
1 kid love -0.024370 0.004373 3 0.028744 124
2 great game -0.018442 0.007289 5 0.025730 111
3 great product -0.014171 0.004373 3 0.018544 80
4 great console -0.013641 0.005831 4 0.019471 84
5 best console -0.013609 0.001458 1 0.015067 65
6 highly recommend -0.012615 0.002915 2 0.015531 67
7 absolutely love -0.011987 0.001458 1 0.013445 58
8 game system -0.011746 0.021866 15 0.033611 145
9 love switch -0.011452 0.013120 9 0.024571 106

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compExtract-0.1.2.tar.gz (8.1 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page