Skip to main content

keyword extraction

Project description


This module helps you extract key terms and topics from corpus using a comparative approach.


pip install --upgrade comparativeExtraction


Import packages

from comparativeExtraction import comparative_keyword_extraction

Load sample data

import pandas as pd
import numpy as np
PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"
data = pd.read_csv(PATH)
label = [x <= 3 for x in data['stars']]
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')

Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.

The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.

Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.

Initialize the module with the review corpus and labels

kw_init = comparative_keyword_extraction(corpus = data['reviews'], labels = label)

Extract the keywords

kw = kw_init.get_distinguishing_terms(ngram_range = (1,3),top_n = 10)
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews

If we need more context on a given word, or we need more interpretable topics, we can:

  1. Output the reviews that contains the term
  2. Switch the ngram_range
  3. Use the supplement functions module

Output the reviews

Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.

The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.

# The binary_dtm provides a convenient way to extract reviews with specific terms
      work  not
0        0    0
1        0    0
2        0    0
3        0    0
4        0    0
...    ...  ...
4995     1    0
4996     0    1
4997     0    0
4998     0    0
4999     0    0

[5000 rows x 2 columns]
reviews_contain_term_work = data['reviews'][[x == 1 for x in kw.binary_dtm['work']]]
for x in pd.Series(reviews_contain_term_work).sample(1):
It's alright, only got it to give Nintendo another chance. It's a neat concept. Overall, it's aggressively mediocre, good for casual stuff, but will never get as much use as my ps4.Wi-Fi is god awful though. The worst I've dealt with. It's connection capabilities are atrocious compared with any other wireless device. Don't expect it to just work. Honestly, this singular problem is enough for me to rate it 1 star. I suppose they had to cut corners somewhere.

Change the n-gram range to exclude uni-grams

kw = kw_init.get_distinguishing_terms(ngram_range = (2,4),top_n = 10)

Using supplement function

Sometimes when we want to drill down into one specific term, we can leverage the built-in supplement functions to find related n-grams containing the term

from comparativeExtraction.supplement_funcs import get_ngrams_on_term
target_term = "work"
reviews_contain_term_work = data['reviews'][[x == 1 for x in kw.binary_dtm['work']]]

related_ngrams = get_ngrams_on_term(target_term,reviews_contain_term_work,filter_by_extreme=False)

Here, the count is also a Document Frequency

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for comparativeExtraction, version 0.0.7
Filename, size File type Python version Upload date Hashes
Filename, size comparativeExtraction-0.0.7.tar.gz (12.8 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page