Skip to main content

Search the context where a token appears

Project description

Introduction

Given a word token and a corpus where this word appears, this package helps you find and analyze the context in which the word appears. It can be easily leveraged to improve your bag-of-words based analysis.

Installation

pip install contextSearching

Usage

As an example to illustrate the usage, we choose the term "break" and the Amazon review corpus for Nintendo Switch where people used the term "break".

From a simple bag-of-words analysis, we know that whenever people mention "break", the product is likely to receive a low star rating. But we do not know what breaks or any other context around "break."

Preparation

"""
Preparation
"""
import pandas as pd
import numpy as np
# read in corpus
corpus = pd.read_csv("data/switch_w_break.csv")
# define the target token
target = "break"

corpus.head()
stars titles reviews dates
0 1.0 Already broken parts\n Only 3 months later and parts are breaking. Th... September 13, 2019
1 1.0 Dock is broken\n Hey. This was supposed to work. Dock is broken... September 11, 2019
2 5.0 Dependable seller\n Arrived on time, well packed for the trip. N... August 10, 2019
3 1.0 Nintendo Does Not Honor Warranty\n My son used this unit for 7 months. At which ... August 8, 2019
4 4.0 Great product, Joycons need work.\n Everyone knows the switch is great. I waited a... August 5, 2019

Loading the package and initialize the class

"""
Loading the package and initialize the class
"""
from contextSearching import context_searching

cs = context_searching(target_token=target,doc=corpus['reviews'],left_window=5,right_window=5,padding_token="_empty_")

In addition to the target token and the corpus, the class requires three more inputs: left/right window and padding token.

The algorithm takes in the target token and aggressively collect all the words within the specified window.

For example, when left_window is set to 10, it will find the target token within each document of the corpus, then collect all the ten words to the left of the target, recording the relative position. If there are less than 10 words to the left, the algorithm will append the word list with the padding token.

Get the Context Probing Matrix

"""
Get the Context Probing Matrix
"""
# Get a list of stopwords
from gensim.parsing.preprocessing import STOPWORDS
stopwords = list(STOPWORDS)
contextPMat = cs.get_context_prob_matrix(stop_words = stopwords,lemmatize=True, stem = False)

Assuming we have N documents in the corpus, and left_window and right window are set to 5. The Context Probing Matrix (CPM) is an N by 11 matrix like below:

# We can examine the actual CPM like this:
cpm_df = pd.DataFrame(np.array(contextPMat.context_prob_matrix))
cpm_df.columns = [str(x) for x in contextPMat.position_idx]
cpm_df.head()
-5 -4 -3 -2 -1 0 1 2 3 4 5
0 _empty_ _empty_ _empty_ month later break joy button work replace _empty_
1 _empty_ hey suppose work dock break replace dock pretty angry _empty_
2 _empty_ arrive time pack trip break work great _empty_ _empty_ _empty_
3 gadget year expensive relative function break quickly support manufacturer beware nintendo
4 issue real button leave joycon break month fortunately nintendo replace warranty

The column index indicates the relative position. For example, in the first document, the word "button" appears two words to the right of the target term "break".

Get the vocabs dictionary

"""
Get the vocabs dictionary
"""
contextPMat.vocabs['joycon']
[-1, -1, -4, -2, -2, -1, -2, 1]

The .vocabs is a dictionary whose keys are unique tokens collected in constructing the CPM, and the values are lists of recorded relative positions to the target token.

In the output above, we see the term "joycon" appears 8 times in total within the +- 5 window of the target term. It most often appears on the left side of the target term.

Get the statistics table for each term

"""
Get the statistics table for each term
"""
cpm_stats = contextPMat.get_cpm_stats_tb()
cpm_stats.cpm_stats_tb.head()
tokens mean variance abs_mean count median
0 month 0.142857 3.979592 1.857143 14 1.0
1 later 0.000000 2.000000 1.333333 3 -1.0
2 break 0.186441 0.643206 0.186441 118 0.0
3 joy -1.444444 6.024691 2.555556 9 -2.0
4 button -0.500000 2.583333 1.500000 6 -1.0

To understand the context, we can look at the statistics of relative positions for each term collected above.

For example,

When the occurrence of a term is high, we know that it always appears around the target token;

When the variance of a term's relative position is low, we know that it always appears at the same relative location;

Infer potential N-grams containing the target term

"""
Infer potential N-grams containing the term
"""
cpm_stats.guess_ngram(n = 5)
ngram_candidates total_scores
0 look expensive joycon controller break 0.327352
1 expensive joycon controller break month 0.330129
2 joycon controller break month handle 0.337807
3 controller break month handle inside 0.379144
4 break month handle inside item 0.447479

based on the statistics table, the algorithm can infer most likely n-grams containing the target term

For example, when we want to infer what is most likely the word appears to the left of "break" (i.e. with relative location = -1), we go through the following steps

  1. start with a word collected in the CPM constructing process above (e.g. "controller")
  2. for the word, take the mean of the observed relative positions, minus the mean by -1 and take the absolute value
  3. for the word, take the median of the observed relative positions, minus the median by -1 and take the absolute value
  4. Calculate 1/count
  5. Calculate the variance of the relative positions of the word
  6. Repeat the above on all the collected words and acquire 4 lists of metrics above (abs median difference, abs mean difference, 1/count, variance)
  7. normalize the 4 lists
  8. for each collected word, multiply its 4 metrics with user-defined weights and take the sum to get a final score

The best candidate words at location -1 will have the smallest final score.

When we want to find the most likely tri-grams, the algorithm considers a trigram with the target token in each possible location. Thus in the example output above, the target term "break" appears as the 5th, 4th, 3rd, 2nd and 1st term on the n-gram respectively.

Now we get more context around "break":

Expensive Joycon Controller breaks in months seem to be the problem.

Some Notes

  1. Currently, when the target term appears more than once in a single document, the CPM only takes the first one into consideration. I will try to improve this in the near future

  2. This method works better when we have more documents while each document is short. It will not work well on, for example, a collection of News articles.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextSearching-0.4.2.tar.gz (9.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page