A text analysis/feature generation tool for english text, or text corpuses
Project description
TGNLP:
A Three-Graph Natural Language Processing Library
Authors:
Jesse Coulson: Linkedin, Github
Primrose Johns: Linkedin, Github
Introduction
This library allows users to create graphs that represent multiple kinds of word relationships in a provided corpus in Python3. Outputs are NetworkX graphs, allowing the user to use their powerful library to further investigate graph output. We also provide some functions of our own that produce subgraphs and metrics that tell the user more about a particular graph.Our work was originally inspired by TensorGCN.
Quick Start Guide
Making a Corpus
To start, you'll need to build a TGNLP.Corpus
object, which is what all graph generators take as input. The Corpus can be built from a Pandas Series of strings, a list of strings, or just one large string. A user can also provide a tokenized corpus as a list of lists of strings.
import TGNLP as tgnlp
#data could be a pd.Series, a list, or one large string
data = "Some text."
corpus = tgnlp.Corpus(data)
TGNLP.Corpus
objects are never modified by any functions that use them, so the same Corpus object can be reused to create as many graphs as are needed.
Generating a Graph
Once you have a TGNLP.Corpus
object, you can use it to generate a graph. There are three types of graphs TGNLP can make, and you can generate them using get_sequential_graph()
, get_semantic_graph()
, and get_syntactic_graph()
. The output is a weighted, undirected NetworkX graph.
G = tgnlp.get_sequential_graph(corpus)
print(type(G))
Output:
<class 'networkx.classes.graph.Graph'>
Graph Processing
You can also trim the graph down, using trim_norm_graph()
.
tgnlp.trim_norm_graph(G, inplace=True)
This is done by trimming a percentage of the total edgeweight and not by trimming a certain percentage of all edges. This means that the trimming process may remove far more than 10% of the edges if a large portion of graph edges have very small weights. We recommend trimming at least 10% of edges by weight on all graphs, which is what trim_norm_graph()
does by default. This function returns a trimmed copy of the provided graph be default, but you can use inplace=True
to avoid the extra memory usage that entails.
Graph Analysis
You can get a PDF report summarizing some of the important metrics about your corpus using generate_graph_report()
, with your graph as input. It will appear in the directory the script is called from with the name tgnlp_report.pdf
.
#This will show up in the directory the python script is called from
tgnlp.generate_graph_report(G)
The report is a two pages long, here's an example of one of our word subgraphs
The report also features visualizations of linear and logarithmic degree distributions, as well as overall graph metrics like average degree, and specific details on the highest-degree and lowest-degree nodes in the graph.
Documentation
Raw Text Parsing
We provide two methods of parsing raw text (or a raw text corpus) that a user would like to analyze. One is the TGNLP.Corpus
class, which turns the user's raw text/text corpus into an object that our graph generators can use. The other is dataframe_to_tokens_labels()
, which turns the raw text/text corpus into a tokenized format compatible with common feature extrapolation methods (bag or words, TF-IDF, word embeddings). The output of dataframe_to_tokens_labels()
is a valid input to the TGNLP.Corpus
class constructor.
Corpus()
Corpus(data)
Parameters
- data: Pandas series, list, list of lists, or string
The data to be parsed. Lists, series, or the lists within lists passed in must contain only strings. Each string is expected to be at least as big as a single sentence, but can be as large as entire documents/collections of documents.
Returns
- corpus: TGNLP Corpus object
A corpus representing the data parsed, which can be used for graph generation
Attributes
- sentence_corpus: List of strings
Every Sentence in the corpus in order of appearance. Each sentence is an item in the list. All punctuation and whitespace (except for one space between each word) has been removed. - word_corpus: List of strings
Every word in the corpus in order of appearance. Each word is an item in the list. All punctuation and whitespace has been removed. - word_counts: Dict of string:int
Every word in the corpus is a key in the dict. Values are the number of appearances that word makes over the whole corpus.
Errors
- TypeError: Load_data requires one of type: Pandas Series, list of strings, list of list of tokens, string.
Raised when the user provides a data input of incorrect type. - ValueError: series must only have elements of type: string
Raised when the user provides a Pandas Series that is not populated with exclusively strings. - ValueError: list must only have elements of type: string
Raised when the user provides a list that is not populated with exclusively strings.
This is the object that all graph generation functions take as input. It stores a word corpus, a sentence corpus, and a word frequency dictionary. Corpus objects are never modified or altered by any TGNLP functions after they are created, so they can be reused to create multiple graphs as needed.
dataframe_to_tokens_labels()
dataframe_to_tokens_labels(df, text_column_name, label_column_name, lower_case=True, remove_stopwords=True, lemmatization = False)
Parameters
- df: Pandas Dataframe
A DataFrame containing the text to be tokenized and their labels . - text_column_name: scalar object
The name of the column that contains the text to be tokenized. - label_column_name: scalar object
The Name of the column that contains the text labels. - lower_case: bool, default=True
Flag that determines whether or not to convert all words to lower case - remove_stopwords: bool, default=True
Flag that determines whether or not to remove stopwords. - lemmatization: bool, default=False
Flag that determines whether or not to remove lemmatize words.
Returns
- document_list: List of list of strings
The text/corpus as lists of tokenized words. - label_list: List of *
The provided labels column, as a list.
Errors
- TypeError : Inputted data is not of type Pandas DataFrame
Raised when the user-provided object is not a Pandas DataFrame. - ValueError : [text_column_name] is not a valid column in the DataFrame
Raised when the user-provided text column name is not found in the provided DataFrame. - ValueError : [label_column_name] is not a valid column in the DataFrame
Raised when the user-provided label column name is not found in the provided DataFrame.
This function turns two columns from a dataframe into preprocessed list of documents with each document containing a list of words. It optionally supports lemmatization, which will slow runtime significantly but provides more powerful results. The first output of this function, document_list, is a valid input type for the TGNLP.Corpus
class constructor.
The Three Graphs
TGNLP has three different graphs that it can generate, each of which represents different kinds of word relationships. All of the graphs are undirected, weighted graphs where nodes are represented by words and edges represent relationships between words. The weight of an edge represents how strong that relationship is, although what "strong" means depends on what graph is being worked with. There is the sequential graph, the semantic graph, and the syntactic graph.These graphs were originally inspired by a methodology proposed by Xien Liu et al., and one implementation of their approach can be found in their TensorGCN Github.
get_sequential_graph()
get_sequential_graph(corpus, window_size=5)
Parameters
- corpus: TGNLP Corpus object
A corpus generated from the data to be analyzed - window_size: int, default=5
The size of the sliding window to be used for observing word co-occurrences
Returns
- G: NetworkX Graph
A graph representing sequential relationships between words in the corpus
Errors
- TypeError : Inputted data is not of type Corpus
Raised when the user provides a non-Corpus input
Nodes in the sequential graph represent individual words in the corpus. The weighted edges in the sequential graph represent how frequently two words appear near one another. This "nearness" is observed using a sliding window approach that the user can specify the size of. Every time two words appear in the same window, that counts as a co-occurrence of those two words. In an untrimmed graph every pair of words that appear together in a window will have an edge. The edge weight is calculated as $W_{i,j} = \frac{freq_{i,j}}{min\{freq_{i}, freq_{j}\}}$, where $freq_{i,j}$ is the number of co-occurences of two words and $min\{freq_{i}, freq_{j}\}$ is the overall frequency of the less frequent word.
get_semantic_graph()
get_semantic_graph(corpus)
Parameters
- corpus: TGNLP Corpus object
A corpus generated from the data to be analyzed
Returns
- G: NetworkX Graph
A graph representing semantic relationships between words in the corpus
Errors
- TypeError: Inputted data is not of type Corpus
Raised when the user provides a non-Corpus input
This function uses word2vec to generate embeddings of each word. Once the embeddings are generated, we utilize the most_similar(n)
function in Word2Vec in order to find the 20 most similar words to the provided word. These "most similar" word pairs each become an edge in the graph, with each node being a word. The cosine similarity between these words becomes the weight of the edge between them.
get_syntactic_graph()
get_syntactic_graph(corpus)
Parameters
- corpus: TGNLP Corpus object
A corpus generated from the data to be analyzed
Returns
- G: NetworkX Graph
A graph representing syntactic relationships between words in the corpus
Errors
- TypeError: Inputted data is not of type Corpus
Raised when the user provides a non-Corpus input
This function uses the spaCy library’s dependancy parser in order to identify all syntactic dependancies between wrods in each setences. While there are different types of syntactic dependancies in english, we treat all dependancy types as equal in terms of how much they contribute to edge weight. The edge weight is calculated as $W_{i,j} = \frac{freq_{i,j}}{min\{freq_{i}, freq_{j}\}}$, where $freq_{i,j}$ is the number of times two words share a dependancy in a sentence and $min\{freq_{i}, freq_{j}\}$ is the number of occurences of the less frequent word.
Graph Processing
We offer a tool which normalizes edge weights and trims smaller edges.trim_norm_graph()
trim_norm_graph(G_full, trim = 0.1, inplace = False)
Parameters
-
G_full: NetworkX Graph
A graph generated by TGNLP -
trim: float, default=0.1
The amount of total edge weight to be trimmed. Must be between a positive value less than 1. -
inplace: bool, default=False
Indicates whether or not to perform the trimming on the provided graph (False), or a copy of the graph (True).
Returns
- G_proc: NetworkX Graph
A version of the provided graph with edges trimmed and normalized.
Errors
- TypeError: Provided value for trim is too large. Trim value must be <= 1
Raised when the user provides trim value that is not a positive float less than 1
This function normalizes edge weight values in a provided graph to be between 0 and 1. It also trims a graph's edges to reduce the total weight of all edges. The latter process is done by iterating through each edge in ascending order of weight, and removing edges until the requested amount of the total edge weight has been removed. It will not remove an edge if doing so would remove more than the requested amount. We reccomend removing at least 10% (trim = 0.1) of edge weight in all of the graphs TGNLP generates.
Graph Reporting
You can use our library to generate a report describing some important features of the graph you generated, and what it says about your text/corpus.generate_graph_report()
generate_graph_report(G)
Parameters
- G: NetworkX Graph
A graph generated by TGNLP
This function generates a pdf report (called tgnlp_report.pdf) that will appear in the directory from which the program is run. This report includes a visualization of a word subgraph (the word is chosen semi-randomly and must have a degree less than 15 and greater than 10), as well as some basic metrics about the graph in question. These metrics include the lowest-degree word, the highest-degree word, the number of nodes, the number of edges, the average degree, the average centrality, the assortativity coefficient, and both linear and logarithmic degree distributions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tgnlp-1.0.0.tar.gz
.
File metadata
- Download URL: tgnlp-1.0.0.tar.gz
- Upload date:
- Size: 18.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c160bbdb737152b1499271a31efe7209e4f13552f71f324d0e0f800bc34cc701 |
|
MD5 | 8b678907bd065f5c73e9b127145d6204 |
|
BLAKE2b-256 | f6b4365531f9c3781769d7b1b27b1f3e7781b1621fb47ec192fc5783ac81cc4d |
File details
Details for the file TGNLP-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: TGNLP-1.0.0-py3-none-any.whl
- Upload date:
- Size: 13.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f14bd5c86b58381cca9899a8493c88a2018187efd7c99fe8e41f29a6d580687c |
|
MD5 | 3d37403601e85ea156093ceb059360a1 |
|
BLAKE2b-256 | ba1611172ba04a1531ea84858a518436f6228b9622487a65e6917ee9b9cfa786 |