A text analysis/feature generation tool for english text, or text corpuses

These details have not been verified by PyPI

Project links

Homepage

Project description

TGNLP:
A Three-Graph Natural Language Processing Library

Authors:
Jesse Coulson: Linkedin, Github
Primrose Johns: Linkedin, Github

Introduction

This library allows users to create graphs that represent multiple kinds of word relationships in a provided corpus in Python3. Outputs are NetworkX graphs, allowing the user to use their powerful library to further investigate graph output. We also provide some functions of our own that produce subgraphs and metrics that tell the user more about a particular graph.
Our work was originally inspired by TensorGCN.

Quick Start Guide

Installation

You can install this library with Pypi, just run

pip3 install TGNLP

to get the latest version!

Making a Corpus

To start, you'll need to build a TGNLP.Corpus object, which is what all graph generators take as input. The Corpus can be built from a Pandas Series of strings, a list of strings, or just one large string. A user can also provide a tokenized corpus as a list of lists of strings.

import TGNLP as tgnlp

#data could be a pd.Series, a list, or one large string
data = "Some text."
corpus = tgnlp.Corpus(data)

TGNLP.Corpus objects are never modified by any functions that use them, so the same Corpus object can be reused to create as many graphs as are needed.

Generating a Graph

Once you have a TGNLP.Corpus object, you can use it to generate a graph. There are three types of graphs TGNLP can make, and you can generate them using get_sequential_graph(), get_semantic_graph(), and get_syntactic_graph(). The output is a weighted, undirected NetworkX graph.

G = tgnlp.get_sequential_graph(corpus)
print(type(G))

Output:

<class 'networkx.classes.graph.Graph'>

Graph Processing

You can also trim the graph down, using trim_norm_graph().

tgnlp.trim_norm_graph(G, inplace=True)

This is done by trimming a percentage of the total edgeweight and not by trimming a certain percentage of all edges. This means that the trimming process may remove far more than 10% of the edges if a large portion of graph edges have very small weights. We recommend trimming at least 10% of edges by weight on all graphs, which is what trim_norm_graph() does by default. This function returns a trimmed copy of the provided graph be default, but you can use inplace=True to avoid the extra memory usage that entails.

Graph Analysis

You can get a PDF report summarizing some of the important metrics about your corpus using generate_graph_report(), with your graph as input. It will appear in the directory the script is called from with the name tgnlp_report.pdf.

#This will show up in the directory the python script is called from 
tgnlp.generate_graph_report(G)

The report is a two pages long, here's an example of one of our word subgraphs

The report also features visualizations of linear and logarithmic degree distributions, as well as overall graph metrics like average degree, and specific details on the highest-degree and lowest-degree nodes in the graph.

Documentation

Raw Text Parsing

We provide two methods of parsing raw text (or a raw text corpus) that a user would like to analyze. One is the TGNLP.Corpus class, which turns the user's raw text/text corpus into an object that our graph generators can use. The other is dataframe_to_tokens_labels(), which turns the raw text/text corpus into a tokenized format compatible with common feature extrapolation methods (bag or words, TF-IDF, word embeddings). The output of dataframe_to_tokens_labels() is a valid input to the TGNLP.Corpus class constructor.

Corpus()

Corpus(data)

Parameters

data: Pandas series, list, list of lists, or string
The data to be parsed. Lists, series, or the lists within lists passed in must contain only strings. Each string is expected to be at least as big as a single sentence, but can be as large as entire documents/collections of documents.

Returns

corpus: TGNLP Corpus object
A corpus representing the data parsed, which can be used for graph generation

Attributes

sentence_corpus: List of strings
Every Sentence in the corpus in order of appearance. Each sentence is an item in the list. All punctuation and whitespace (except for one space between each word) has been removed.
word_corpus: List of strings
Every word in the corpus in order of appearance. Each word is an item in the list. All punctuation and whitespace has been removed.
word_counts: Dict of string:int
Every word in the corpus is a key in the dict. Values are the number of appearances that word makes over the whole corpus.

Errors

TypeError: Load_data requires one of type: Pandas Series, list of strings, list of list of tokens, string.
Raised when the user provides a data input of incorrect type.
ValueError: series must only have elements of type: string
Raised when the user provides a Pandas Series that is not populated with exclusively strings.
ValueError: list must only have elements of type: string
Raised when the user provides a list that is not populated with exclusively strings.

This is the object that all graph generation functions take as input. It stores a word corpus, a sentence corpus, and a word frequency dictionary. Corpus objects are never modified or altered by any TGNLP functions after they are created, so they can be reused to create multiple graphs as needed.

dataframe_to_tokens_labels()

dataframe_to_tokens_labels(df, text_column_name, label_column_name, lower_case=True, remove_stopwords=True, lemmatization = False)

Parameters

df: Pandas Dataframe
A DataFrame containing the text to be tokenized and their labels .
text_column_name: scalar object
The name of the column that contains the text to be tokenized.
label_column_name: scalar object
The Name of the column that contains the text labels.
lower_case: bool, default=True
Flag that determines whether or not to convert all words to lower case
remove_stopwords: bool, default=True
Flag that determines whether or not to remove stopwords.
lemmatization: bool, default=False
Flag that determines whether or not to remove lemmatize words.

Returns

document_list: List of list of strings
The text/corpus as lists of tokenized words.
label_list: List of *
The provided labels column, as a list.

Errors

TypeError : Inputted data is not of type Pandas DataFrame
Raised when the user-provided object is not a Pandas DataFrame.
ValueError : [text_column_name] is not a valid column in the DataFrame
Raised when the user-provided text column name is not found in the provided DataFrame.
ValueError : [label_column_name] is not a valid column in the DataFrame
Raised when the user-provided label column name is not found in the provided DataFrame.

This function turns two columns from a dataframe into preprocessed list of documents with each document containing a list of words. It optionally supports lemmatization, which will slow runtime significantly but provides more powerful results. The first output of this function, document_list, is a valid input type for the TGNLP.Corpus class constructor.

The Three Graphs

TGNLP has three different graphs that it can generate, each of which represents different kinds of word relationships. All of the graphs are undirected, weighted graphs where nodes are represented by words and edges represent relationships between words. The weight of an edge represents how strong that relationship is, although what "strong" means depends on what graph is being worked with. There is the sequential graph, the semantic graph, and the syntactic graph.
These graphs were originally inspired by a methodology proposed by Xien Liu et al., and one implementation of their approach can be found in their TensorGCN Github.

get_sequential_graph()

get_sequential_graph(corpus, window_size=5)

Parameters

corpus: TGNLP Corpus object
A corpus generated from the data to be analyzed
window_size: int, default=5
The size of the sliding window to be used for observing word co-occurrences

Returns

G: NetworkX Graph
A graph representing sequential relationships between words in the corpus

Errors

TypeError : Inputted data is not of type Corpus
Raised when the user provides a non-Corpus input

Nodes in the sequential graph represent individual words in the corpus. The weighted edges in the sequential graph represent how frequently two words appear near one another. This "nearness" is observed using a sliding window approach that the user can specify the size of. Every time two words appear in the same window, that counts as a co-occurrence of those two words. In an untrimmed graph every pair of words that appear together in a window will have an edge. The edge weight is calculated as $W_{i,j} = \frac{freq_{i,j}}{min\{freq_{i}, freq_{j}\}}$, where $freq_{i,j}$ is the number of co-occurences of two words and $min\{freq_{i}, freq_{j}\}$ is the overall frequency of the less frequent word.

get_semantic_graph()

get_semantic_graph(corpus)

Parameters

corpus: TGNLP Corpus object
A corpus generated from the data to be analyzed

Returns

G: NetworkX Graph
A graph representing semantic relationships between words in the corpus

Errors

TypeError: Inputted data is not of type Corpus
Raised when the user provides a non-Corpus input

This function uses word2vec to generate embeddings of each word. Once the embeddings are generated, we utilize the most_similar(n) function in Word2Vec in order to find the 20 most similar words to the provided word. These "most similar" word pairs each become an edge in the graph, with each node being a word. The cosine similarity between these words becomes the weight of the edge between them.

get_syntactic_graph()

get_syntactic_graph(corpus)

Parameters

corpus: TGNLP Corpus object
A corpus generated from the data to be analyzed

Returns

G: NetworkX Graph
A graph representing syntactic relationships between words in the corpus

Errors

TypeError: Inputted data is not of type Corpus
Raised when the user provides a non-Corpus input

This function uses the spaCy library’s dependancy parser in order to identify all syntactic dependancies between wrods in each setences. While there are different types of syntactic dependancies in english, we treat all dependancy types as equal in terms of how much they contribute to edge weight. The edge weight is calculated as $W_{i,j} = \frac{freq_{i,j}}{min\{freq_{i}, freq_{j}\}}$, where $freq_{i,j}$ is the number of times two words share a dependancy in a sentence and $min\{freq_{i}, freq_{j}\}$ is the number of occurences of the less frequent word.

Graph Processing

We offer a tool which normalizes edge weights and trims smaller edges.

trim_norm_graph()

trim_norm_graph(G_full, trim = 0.1, inplace = False)

Parameters

G_full: NetworkX Graph
A graph generated by TGNLP
trim: float, default=0.1
The amount of total edge weight to be trimmed. Must be between a positive value less than 1.
inplace: bool, default=False
Indicates whether or not to perform the trimming on the provided graph (False), or a copy of the graph (True).

Returns

G_proc: NetworkX Graph
A version of the provided graph with edges trimmed and normalized.

Errors

TypeError: Provided value for trim is too large. Trim value must be <= 1
Raised when the user provides trim value that is not a positive float less than 1

This function normalizes edge weight values in a provided graph to be between 0 and 1. It also trims a graph's edges to reduce the total weight of all edges. The latter process is done by iterating through each edge in ascending order of weight, and removing edges until the requested amount of the total edge weight has been removed. It will not remove an edge if doing so would remove more than the requested amount. We reccomend removing at least 10% (trim = 0.1) of edge weight in all of the graphs TGNLP generates.

Graph Reporting

You can use our library to generate a report describing some important features of the graph you generated, and what it says about your text/corpus.

generate_graph_report()

generate_graph_report(G)

Parameters

G: NetworkX Graph
A graph generated by TGNLP

This function generates a pdf report (called tgnlp_report.pdf) that will appear in the directory from which the program is run. This report includes a visualization of a word subgraph (the word is chosen semi-randomly and must have a degree less than 15 and greater than 10), as well as some basic metrics about the graph in question. These metrics include the lowest-degree word, the highest-degree word, the number of nodes, the number of edges, the average degree, the average centrality, the assortativity coefficient, and both linear and logarithmic degree distributions.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.1

May 28, 2024

1.0.0

May 28, 2024

0.0.1

May 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tgnlp-1.0.1.tar.gz (18.0 kB view details)

Uploaded May 28, 2024 Source

Built Distribution

TGNLP-1.0.1-py3-none-any.whl (13.8 kB view details)

Uploaded May 28, 2024 Python 3

File details

Details for the file tgnlp-1.0.1.tar.gz.

File metadata

Download URL: tgnlp-1.0.1.tar.gz
Upload date: May 28, 2024
Size: 18.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.9.2

File hashes

Hashes for tgnlp-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`e8826adec6b20dbdb9a7b374e10e54679add7e6584989e867914eff7c9de5a7b`
MD5	`e2152d7dd04c1ef05fe467484abbf0b8`
BLAKE2b-256	`fa87fc90e800fac31436aab322522babcded6e289dc02a3581326487881201e3`

See more details on using hashes here.

File details

Details for the file TGNLP-1.0.1-py3-none-any.whl.

File metadata

Download URL: TGNLP-1.0.1-py3-none-any.whl
Upload date: May 28, 2024
Size: 13.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.9.2

File hashes

Hashes for TGNLP-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94031228019480564ec14668ae0eb9bd11cdf8f505a004541727ced3778eacf3`
MD5	`8308c3d67d6ee469fe5094c4c6929915`
BLAKE2b-256	`166d0af62e763fe9d85f85e001cd14c35a14aa6af159d33da20e24aeb4a1e9f5`

See more details on using hashes here.

TGNLP 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TGNLP:A Three-Graph Natural Language Processing Library

Authors: Jesse Coulson: Linkedin, Github Primrose Johns: Linkedin, Github

Introduction

Quick Start Guide

Installation

Making a Corpus

Generating a Graph

Graph Processing

Graph Analysis

Documentation

Raw Text Parsing

Corpus()

Parameters

Returns

Attributes

Errors

dataframe_to_tokens_labels()

Parameters

Returns

Errors

The Three Graphs

get_sequential_graph()

Parameters

Returns

Errors

get_semantic_graph()

Parameters

Returns

Errors

get_syntactic_graph()

Parameters

Returns

Errors

Graph Processing

trim_norm_graph()

Parameters

Returns

Errors

Graph Reporting

generate_graph_report()

Parameters

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

TGNLP:
A Three-Graph Natural Language Processing Library

Authors:
Jesse Coulson: Linkedin, Github
Primrose Johns: Linkedin, Github