Search the context where a token appears
Given a word token and a corpus where this word appears, this package helps you find and analyze the context in which the word appears. It can be easily leveraged to improve your bag-of-words based analysis.
pip install contextSearching
As an example to illustrate the usage, we choose the term "break" and the Amazon review corpus for Nintendo Switch where people used the term "break".
From a simple bag-of-words analysis, we know that whenever people mention "break", the product is likely to receive a low star rating. But we do not know what breaks or any other context around "break."
""" Preparation """ import pandas as pd import numpy as np # read in corpus corpus = pd.read_csv("data/switch_w_break.csv") # define the target token target = "break" corpus.head()
|0||1.0||Already broken parts\n||Only 3 months later and parts are breaking. Th...||September 13, 2019|
|1||1.0||Dock is broken\n||Hey. This was supposed to work. Dock is broken...||September 11, 2019|
|2||5.0||Dependable seller\n||Arrived on time, well packed for the trip. N...||August 10, 2019|
|3||1.0||Nintendo Does Not Honor Warranty\n||My son used this unit for 7 months. At which ...||August 8, 2019|
|4||4.0||Great product, Joycons need work.\n||Everyone knows the switch is great. I waited a...||August 5, 2019|
Loading the package and initialize the class
""" Loading the package and initialize the class """ from contextSearching import context_searching cs = context_searching(target_token=target,doc=corpus['reviews'],left_window=5,right_window=5,padding_token="_empty_")
In addition to the target token and the corpus, the class requires three more inputs: left/right window and padding token.
The algorithm takes in the target token and aggressively collect all the words within the specified window.
For example, when left_window is set to 10, it will find the target token within each document of the corpus, then collect all the ten words to the left of the target, recording the relative position. If there are less than 10 words to the left, the algorithm will append the word list with the padding token.
Get the Context Probing Matrix
""" Get the Context Probing Matrix """ # Get a list of stopwords from gensim.parsing.preprocessing import STOPWORDS stopwords = list(STOPWORDS) contextPMat = cs.get_context_prob_matrix(stop_words = stopwords,lemmatize=True, stem = False)
Assuming we have N documents in the corpus, and left_window and right window are set to 5. The Context Probing Matrix (CPM) is an N by 11 matrix like below:
# We can examine the actual CPM like this: cpm_df = pd.DataFrame(np.array(contextPMat.context_prob_matrix)) cpm_df.columns = [str(x) for x in contextPMat.position_idx] cpm_df.head()
The column index indicates the relative position. For example, in the first document, the word "button" appears two words to the right of the target term "break".
Get the vocabs dictionary
""" Get the vocabs dictionary """ contextPMat.vocabs['joycon']
[-1, -1, -4, -2, -2, -1, -2, 1]
The .vocabs is a dictionary whose keys are unique tokens collected in constructing the CPM, and the values are lists of recorded relative positions to the target token.
In the output above, we see the term "joycon" appears 8 times in total within the +- 5 window of the target term. It most often appears on the left side of the target term.
Get the statistics table for each term
""" Get the statistics table for each term """ cpm_stats = contextPMat.get_cpm_stats_tb() cpm_stats.cpm_stats_tb.head()
To understand the context, we can look at the statistics of relative positions for each term collected above.
When the occurrence of a term is high, we know that it always appears around the target token;
When the variance of a term's relative position is low, we know that it always appears at the same relative location;
Infer potential N-grams containing the target term
""" Infer potential N-grams containing the term """ cpm_stats.guess_ngram(n = 5)
|0||look expensive joycon controller break||0.327352|
|1||expensive joycon controller break month||0.330129|
|2||joycon controller break month handle||0.337807|
|3||controller break month handle inside||0.379144|
|4||break month handle inside item||0.447479|
based on the statistics table, the algorithm can infer most likely n-grams containing the target term
For example, when we want to infer what is most likely the word appears to the left of "break" (i.e. with relative location = -1), we go through the following steps
- start with a word collected in the CPM constructing process above (e.g. "controller")
- for the word, take the mean of the observed relative positions, minus the mean by -1 and take the absolute value
- for the word, take the median of the observed relative positions, minus the median by -1 and take the absolute value
- Calculate 1/count
- Calculate the variance of the relative positions of the word
- Repeat the above on all the collected words and acquire 4 lists of metrics above (abs median difference, abs mean difference, 1/count, variance)
- normalize the 4 lists
- for each collected word, multiply its 4 metrics with user-defined weights and take the sum to get a final score
The best candidate words at location -1 will have the smallest final score.
When we want to find the most likely tri-grams, the algorithm considers a trigram with the target token in each possible location. Thus in the example output above, the target term "break" appears as the 5th, 4th, 3rd, 2nd and 1st term on the n-gram respectively.
Now we get more context around "break":
Expensive Joycon Controller breaks in months seem to be the problem.
Currently, when the target term appears more than once in a single document, the CPM only takes the first one into consideration. I will try to improve this in the near future
This method works better when we have more documents while each document is short. It will not work well on, for example, a collection of News articles.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size contextSearching-0.4.2.tar.gz (9.8 kB)||File type Source||Python version None||Upload date||Hashes View|