Search the context where a token appears

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction

Given a word token and a corpus where this word appears, this package helps you find and analyze the context in which the word appears. It can be easily leveraged to improve your bag-of-words based analysis.

Installation

pip install contextSearching

Usage

As an example to illustrate the usage, we choose the term "break" and the Amazon review corpus for Nintendo Switch where people used the term "break".

From a simple bag-of-words analysis, we know that whenever people mention "break", the product is likely to receive a low star rating. But we do not know what breaks or any other context around "break."

Preparation

"""
Preparation
"""
import pandas as pd
import numpy as np
# read in corpus
corpus = pd.read_csv("data/switch_w_break.csv")
# define the target token
target = "break"

corpus.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	stars	titles	reviews	dates
0	1.0	Already broken parts\n	Only 3 months later and parts are breaking. Th...	September 13, 2019
1	1.0	Dock is broken\n	Hey. This was supposed to work. Dock is broken...	September 11, 2019
2	5.0	Dependable seller\n	Arrived on time, well packed for the trip. N...	August 10, 2019
3	1.0	Nintendo Does Not Honor Warranty\n	My son used this unit for 7 months. At which ...	August 8, 2019
4	4.0	Great product, Joycons need work.\n	Everyone knows the switch is great. I waited a...	August 5, 2019

Loading the package and initialize the class

"""
Loading the package and initialize the class
"""
from contextSearching import context_searching

cs = context_searching(target_token=target,doc=corpus['reviews'],left_window=5,right_window=5,padding_token="_empty_")

In addition to the target token and the corpus, the class requires three more inputs: left/right window and padding token.

The algorithm takes in the target token and aggressively collect all the words within the specified window.

For example, when left_window is set to 10, it will find the target token within each document of the corpus, then collect all the ten words to the left of the target, recording the relative position. If there are less than 10 words to the left, the algorithm will append the word list with the padding token.

Get the Context Probing Matrix

"""
Get the Context Probing Matrix
"""
# Get a list of stopwords
from gensim.parsing.preprocessing import STOPWORDS
stopwords = list(STOPWORDS)
contextPMat = cs.get_context_prob_matrix(stop_words = stopwords,lemmatize=True, stem = False)

Assuming we have N documents in the corpus, and left_window and right window are set to 5. The Context Probing Matrix (CPM) is an N by 11 matrix like below:

# We can examine the actual CPM like this:
cpm_df = pd.DataFrame(np.array(contextPMat.context_prob_matrix))
cpm_df.columns = [str(x) for x in contextPMat.position_idx]
cpm_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	-5	-4	-3	-2	-1	0	1	2	3	4	5
0	_empty_	_empty_	_empty_	month	later	break	joy	button	work	replace	_empty_
1	_empty_	hey	suppose	work	dock	break	replace	dock	pretty	angry	_empty_
2	_empty_	arrive	time	pack	trip	break	work	great	_empty_	_empty_	_empty_
3	gadget	year	expensive	relative	function	break	quickly	support	manufacturer	beware	nintendo
4	issue	real	button	leave	joycon	break	month	fortunately	nintendo	replace	warranty

The column index indicates the relative position. For example, in the first document, the word "button" appears two words to the right of the target term "break".

Get the vocabs dictionary

"""
Get the vocabs dictionary
"""
contextPMat.vocabs['joycon']

[-1, -1, -4, -2, -2, -1, -2, 1]

The .vocabs is a dictionary whose keys are unique tokens collected in constructing the CPM, and the values are lists of recorded relative positions to the target token.

In the output above, we see the term "joycon" appears 8 times in total within the +- 5 window of the target term. It most often appears on the left side of the target term.

Get the statistics table for each term

"""
Get the statistics table for each term
"""
cpm_stats = contextPMat.get_cpm_stats_tb()
cpm_stats.cpm_stats_tb.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	tokens	mean	variance	abs_mean	count	median
0	month	0.142857	3.979592	1.857143	14	1.0
1	later	0.000000	2.000000	1.333333	3	-1.0
2	break	0.186441	0.643206	0.186441	118	0.0
3	joy	-1.444444	6.024691	2.555556	9	-2.0
4	button	-0.500000	2.583333	1.500000	6	-1.0

To understand the context, we can look at the statistics of relative positions for each term collected above.

For example,

When the occurrence of a term is high, we know that it always appears around the target token;

When the variance of a term's relative position is low, we know that it always appears at the same relative location;

Infer potential N-grams containing the target term

"""
Infer potential N-grams containing the term
"""
cpm_stats.guess_ngram(n = 5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ngram_candidates	total_scores
0	look expensive joycon controller break	0.327352
1	expensive joycon controller break month	0.330129
2	joycon controller break month handle	0.337807
3	controller break month handle inside	0.379144
4	break month handle inside item	0.447479

based on the statistics table, the algorithm can infer most likely n-grams containing the target term

For example, when we want to infer what is most likely the word appears to the left of "break" (i.e. with relative location = -1), we go through the following steps

start with a word collected in the CPM constructing process above (e.g. "controller")
for the word, take the mean of the observed relative positions, minus the mean by -1 and take the absolute value
for the word, take the median of the observed relative positions, minus the median by -1 and take the absolute value
Calculate 1/count
Calculate the variance of the relative positions of the word
Repeat the above on all the collected words and acquire 4 lists of metrics above (abs median difference, abs mean difference, 1/count, variance)
normalize the 4 lists
for each collected word, multiply its 4 metrics with user-defined weights and take the sum to get a final score

The best candidate words at location -1 will have the smallest final score.

When we want to find the most likely tri-grams, the algorithm considers a trigram with the target token in each possible location. Thus in the example output above, the target term "break" appears as the 5th, 4th, 3rd, 2nd and 1st term on the n-gram respectively.

Now we get more context around "break":

Expensive Joycon Controller breaks in months seem to be the problem.

Some Notes

Currently, when the target term appears more than once in a single document, the CPM only takes the first one into consideration. I will try to improve this in the near future
This method works better when we have more documents while each document is short. It will not work well on, for example, a collection of News articles.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.2

May 8, 2020

This version

0.4.1

Dec 10, 2019

0.4

Dec 9, 2019

0.3

Dec 9, 2019

0.2

Sep 26, 2019

0.1

Sep 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextSearching-0.4.1.tar.gz (10.0 kB view hashes)

Uploaded Dec 10, 2019 Source

Hashes for contextSearching-0.4.1.tar.gz

Hashes for contextSearching-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`b7928d103e92d70968779f7fcda45a0112edd93082d70418aac989b179ad7c95`
MD5	`1fd0387e1fa3b67c680bec75ddb9d08b`
BLAKE2b-256	`584ebd819d83a8375a9e5cd1e1d5c350894e6358617dfbd99daf40e95e7ba516`