Use this library to transform raw text into differents graph representations.
Project description
text2graph Library
text2graphapi is a python library for text-to-graph tranformations. To use this library it is necessary to install the modules and dependencies in user’s application. Also, the corpus of text documents to be transformed into graphs has to be loaded and read.
The text to graph transformation pipeline consists of three main modules, ad depict in the following diagram:
-
Text Preprocessing and Normalization. This component receives the input text (in a specific format/structure) and performs all the cleaning and normalization steps for the data. It applies different text cleaning methods such as removing stop words, handling contractions, handling ASCI characters, and so on. Moreover, it performs different NLP techniques such as POS tags, tokenization, lemmatization, etc (using third-party libraries such as Spacy, and NLTK).
-
Graph Model. This second component aims to define and construct the entities/nodes and their relationships/edges from the corpus texts to generate the specified graph representation. Currently, this library supports three text-to-graph transformations: Word-Coocurrence, Heterogeneous Graph, and Integrated Syntactic Graph (ISG). We will see each of them in detail in the following sections.
-
Graph Transformation. This final module receives the generated graph as an input (set of nodes and edges) and applies vector transformations to obtain the final graph representation as an output. This graph output is specified in the input parameters and supports different formats such as adjacency list, adjacency matrix, dense matrix, networkx object, etc.
Installation from PYPI
Inside your project, from your CLI type the following command in order to install the latest version of the library:
pip install text2graphapi
Types of graph representation available:
Currently, this library support three types of graph representation: Word Co-Ocurrence Graph, Heterogeneous Graph and Integrated Syntactic Graph. The Word Co-Occurrence transformations are classified as Document-level graphs due to there is one output graph per one input document (obtain one graph for each document in the corpus), and the Heterogeneous and ISG transformations are classified as Corpus-level graphs due to there is one output graph to represent the whole corpus (obtain one graph for all document in the corpus).
The following code snippet shows a basic example using the text2graphapi library fot these three repsentation.
# The input has to be a list of dictionaries, where ecah dict conatins an 'id' and 'doc' text data
from text2graphapi.src.Cooccurrence import Cooccurrence
from text2graphapi.src.Heterogeneous import Heterogeneous
from text2graphapi.src.IntegratedSyntacticGraph import ISG
corpus_docs = [
{'id': 1, 'doc': "The sun was shining, making the river look bright and happy."},
{'id': 2, 'doc': "Even with the rain, the sun came out a bit, making the wet river shine."}]
to_word_coocc_graph = Cooccurrence(graph_type = 'DiGraph',
language = 'en', apply_preprocessing = True,
window_size = 3, output_format = 'adj_matrix')
to_hetero_graph = Heterogeneous(graph_type = 'Graph',
window_size = 20, apply_preprocessing = True,
language = 'en', output_format = 'networkx')
to_isg_graph = ISG(graph_type = 'DiGraph', language = 'en',
apply_preprocessing = True, output_format = 'networkx')
to_hetero_graph.transform(corpus_docs)
to_word_coocc_graph.transform(corpus_docs)
to_isg_graph.transform(corpus_docs)
In the next section, we will see some illustrative examples generated for this code. We will show each of the graph representations and explain in detail how they are built.
- Word Co-Ocurrence Graph: In this graph, words are represented as a node, and the co-occurrence of two words within the document text is represented as an edge between the words/nodes. As attributes/weights, nodes have the POS tag, and edges have the number of co-occurrences between words in the text document. As output, we will have one graph representation for each text document in the corpus.
- Heterogeneous Graph: In this graph, words and documents are represented as nodes, and the relation between word to word and word to document as edges. As attributes/weights, the word-to-word relation has the point-wise mutual information (PMI) measure, and the word-to-document relation has the Term Frequency-Inverse Document Frequency (TFIDF) measure. As output, we will have only one graph representation for all the text documents in the corpus
.
- Integrated Syntactic Graph:
This representation, integrates multiple linguistic levels in a single data structure. These levels are: the Lexical level (lexical items such as words), Morphological level(deals with the identification, analysis, and description of the structure of the given language’s morphemes such as POS, roots, stem, etc), Syntactic level (deals with the sentence structure such as the dependency trees), and Semantic level (deals with the meaning of the sentences, this can include antonymy, synonymy, etc).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text2graphapi-0.2.0.tar.gz
.
File metadata
- Download URL: text2graphapi-0.2.0.tar.gz
- Upload date:
- Size: 29.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8cf276c44740139934adcc1e989617d37d49b6068e5505dd59cbf7492f6d35d |
|
MD5 | dd9d2104c1d7c72d50ef7368c97875ec |
|
BLAKE2b-256 | bd2bffaaf0967300ce1f38523a83a72ea60834496873d75d1c37761005f68351 |
File details
Details for the file text2graphapi-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: text2graphapi-0.2.0-py3-none-any.whl
- Upload date:
- Size: 33.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba5b81f9b1664f5d11f69c851b37ddd79a3b2d2c5e97a4cde0fbdb6f28dc8926 |
|
MD5 | af4516ad7198b367c0d8c11130bc39a2 |
|
BLAKE2b-256 | 7ff332bf6e922087ede5cb5fd029ab22a7c4f951b66dea73900dc30d64aab15d |