Extract keywords via comparison of corpus
Project description
Introduction
This module helps you extract key terms and topics from corpus using a comparative approach.
Installation
Usage
Import packages
from compExtract import ComparativeExtraction
Load sample data
import pandas as pd
import numpy as np
PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"
data = pd.read_csv(PATH)
label = [x <= 3 for x in data['stars']]
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
stars | titles | reviews | dates | |
---|---|---|---|---|
0 | 5.0 | Worth It\n | Definitely worth the money!\n | September 21, 2019 |
1 | 2.0 | Nintendo Swich gris joy con\n | Con este producto no he sentido mucha satisfac... | September 20, 2019 |
2 | 5.0 | My kid wont put it down\n | Couldnt of been happier, came early. I was th... | September 20, 2019 |
3 | 3.0 | Happy\n | Happy\n | September 20, 2019 |
4 | 5.0 | Great\n | Great product\n | September 19, 2019 |
... | ... | ... | ... | ... |
4995 | 1.0 | One Star\n | it is no good, it suck, no work, plz hlp amazon\n | December 12, 2017 |
4996 | 5.0 | A must have gaming system\n | The Nintendo Switch is a versatile hybrid game... | December 12, 2017 |
4997 | 5.0 | Switch\n | This purchase save me from looking for one.\n | December 11, 2017 |
4998 | 5.0 | Five Stars\n | Best babysitter ever!\n | December 11, 2017 |
4999 | 5.0 | Five Stars\n | Its a great game console.\n | December 11, 2017 |
5000 rows × 4 columns
data.columns
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')
Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.
The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.
Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.
Initialize the module with the review corpus and labels
ce = ComparativeExtraction(corpus = data['reviews'], labels = label)
Extract the keywords
ce.get_distinguish_terms(ngram_range = (1,3),top_n = 10)
<compExtract.ComparativeExtraction at 0x7ff96f84b588>
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews
ce.increased_terms_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | work | 0.194976 | 0.278426 | 191 | 0.083449 | 360 |
1 | switch | 0.176764 | 0.351312 | 241 | 0.174548 | 753 |
2 | buy | 0.174520 | 0.297376 | 204 | 0.122856 | 530 |
3 | month | 0.143129 | 0.158892 | 109 | 0.015763 | 68 |
4 | nintendo | 0.134316 | 0.290087 | 199 | 0.155772 | 672 |
5 | charge | 0.122855 | 0.141399 | 97 | 0.018544 | 80 |
6 | use | 0.118448 | 0.206997 | 142 | 0.088549 | 382 |
7 | new | 0.113989 | 0.160350 | 110 | 0.046361 | 200 |
8 | would | 0.106540 | 0.164723 | 113 | 0.058183 | 251 |
9 | get | 0.104055 | 0.231778 | 159 | 0.127724 | 551 |
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews
ce.decreased_terms_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | love | -0.216997 | 0.080175 | 55 | 0.297172 | 1282 |
1 | great | -0.122247 | 0.099125 | 68 | 0.221372 | 955 |
2 | fun | -0.048160 | 0.046647 | 32 | 0.094808 | 409 |
3 | best | -0.042638 | 0.030612 | 21 | 0.073250 | 316 |
4 | amaze | -0.038011 | 0.010204 | 7 | 0.048215 | 208 |
5 | awesome | -0.035827 | 0.007289 | 5 | 0.043115 | 186 |
6 | son love | -0.035564 | 0.002915 | 2 | 0.038479 | 166 |
7 | perfect | -0.032515 | 0.008746 | 6 | 0.041261 | 178 |
8 | easy | -0.026282 | 0.023324 | 16 | 0.049606 | 214 |
9 | kid love | -0.024370 | 0.004373 | 3 | 0.028744 | 124 |
If we need more context on a given word, or we need more interpretable topics, we can:
- Output the reviews that contains the term
- Switch the ngram_range
Output the reviews
Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.
The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.
# The binary_dtm provides a convenient way to extract reviews with specific terms
print(ce.binary_dtm[['work']])
work
0 0
1 0
2 0
3 0
4 0
... ...
4995 1
4996 0
4997 0
4998 0
4999 0
[5000 rows x 1 columns]
reviews_contain_term_work = data['reviews'][[x == 1 for x in ce.binary_dtm['work']]]
len(reviews_contain_term_work)
551
for x in pd.Series(reviews_contain_term_work).sample(1):
print(x)
I bought this as a Christmas present for my son. After about a month and half of using it. The switch stopped working. It wont charge. The product is an expensive piece of junk.
Change the n-gram range to exclude uni-grams
ce_ngram = ComparativeExtraction(corpus = data['reviews'], labels = label).get_distinguish_terms(ngram_range=(2,4), top_n=10)
/Users/xiaoma/envs/compExtract/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
warnings.warn("The parameter 'token_pattern' will not be used"
<compExtract.ComparativeExtraction at 0x7ff955f23cf8>
ce_ngram.increased_terms_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | joy con | 0.040857 | 0.056851 | 39 | 0.015994 | 69 |
1 | brand new | 0.020511 | 0.027697 | 19 | 0.007186 | 31 |
2 | nintendo switch | 0.019638 | 0.074344 | 51 | 0.054706 | 236 |
3 | buy switch | 0.018888 | 0.027697 | 19 | 0.008809 | 38 |
4 | play game | 0.014092 | 0.039359 | 27 | 0.025267 | 109 |
5 | game play | 0.009812 | 0.021866 | 15 | 0.012054 | 52 |
6 | year old | 0.005243 | 0.023324 | 16 | 0.018081 | 78 |
7 | christmas gift | 0.003682 | 0.014577 | 10 | 0.010895 | 47 |
8 | battery life | 0.001833 | 0.024781 | 17 | 0.022949 | 99 |
9 | wii u | 0.000504 | 0.016035 | 11 | 0.015531 | 67 |
ce_ngram.decreased_terms_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | son love | -0.035564 | 0.002915 | 2 | 0.038479 | 166 |
1 | kid love | -0.024370 | 0.004373 | 3 | 0.028744 | 124 |
2 | great game | -0.018442 | 0.007289 | 5 | 0.025730 | 111 |
3 | great product | -0.014171 | 0.004373 | 3 | 0.018544 | 80 |
4 | great console | -0.013641 | 0.005831 | 4 | 0.019471 | 84 |
5 | best console | -0.013609 | 0.001458 | 1 | 0.015067 | 65 |
6 | highly recommend | -0.012615 | 0.002915 | 2 | 0.015531 | 67 |
7 | absolutely love | -0.011987 | 0.001458 | 1 | 0.013445 | 58 |
8 | game system | -0.011746 | 0.021866 | 15 | 0.033611 | 145 |
9 | love switch | -0.011452 | 0.013120 | 9 | 0.024571 | 106 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file compExtract-0.1.2.tar.gz
.
File metadata
- Download URL: compExtract-0.1.2.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8ae41b96da2c4184d23904b073a20d812aeca8bfc770b028b8f94747e6fc745 |
|
MD5 | 15ae6f650c5f8f644a3bfc29d88b6a47 |
|
BLAKE2b-256 | f0a3a9f81623cff9265901033102b2fb31c6edb71212a41fc02da923d87a742c |