Skip to main content

Use this library to transform raw text into differents graph representations.

Project description

text2graph Library

text2graphapi is a python library for text-to-graph tranformations. To use this library it is necessary to install the modules and dependencies in user’s application. Also, the corpus of text documents to be transformed into graphs has to be loaded and read.

The text to graph transformation pipeline consists of three main modules, ad depict in the following diagram:

texto to graph pipeline

  • Text Preprocessing and Normalization. This component receives the input text (in a specific format/structure) and performs all the cleaning and normalization steps for the data. It applies different text cleaning methods such as removing stop words, handling contractions, handling ASCI characters, and so on. Moreover, it performs different NLP techniques such as POS tags, tokenization, lemmatization, etc (using third-party libraries such as Spacy, and NLTK).

  • Graph Model. This second component aims to define and construct the entities/nodes and their relationships/edges from the corpus texts to generate the specified graph representation. Currently, this library supports three text-to-graph transformations: Word-Coocurrence, Heterogeneous Graph, and Integrated Syntactic Graph (ISG). We will see each of them in detail in the following sections.

  • Graph Transformation. This final module receives the generated graph as an input (set of nodes and edges) and applies vector transformations to obtain the final graph representation as an output. This graph output is specified in the input parameters and supports different formats such as adjacency list, adjacency matrix, dense matrix, networkx object, etc.

Installation from PYPI

Inside your project, from your CLI type the following command in order to install the latest version of the library:

pip install text2graphapi

Types of graph representation available:

Currently, this library support three types of graph representation: Word Co-Ocurrence Graph, Heterogeneous Graph and Integrated Syntactic Graph. The Word Co-Occurrence transformations are classified as Document-level graphs due to there is one output graph per one input document (obtain one graph for each document in the corpus), and the Heterogeneous and ISG transformations are classified as Corpus-level graphs due to there is one output graph to represent the whole corpus (obtain one graph for all document in the corpus).

The following code snippet shows a basic example using the text2graphapi library fot these three repsentation.

# The input has to be a list of dictionaries, where ecah dict conatins an 'id' and 'doc' text data

from text2graphapi.src.Cooccurrence import Cooccurrence
from text2graphapi.src.Heterogeneous import Heterogeneous
from text2graphapi.src.IntegratedSyntacticGraph import ISG

corpus_docs = [
    {'id': 1, 'doc': "The sun was shining, making the river look bright and happy."},
    {'id': 2, 'doc': "Even with the rain, the sun came out a bit, making the wet river shine."}]

to_word_coocc_graph = Cooccurrence(graph_type = 'DiGraph', 
        language = 'en', apply_preprocessing = True, 
        window_size = 3, output_format = 'adj_matrix')

to_hetero_graph = Heterogeneous(graph_type = 'Graph', 
        window_size = 20, apply_preprocessing = True, 
        language = 'en', output_format = 'networkx')

to_isg_graph = ISG(graph_type = 'DiGraph',  language = 'en', 
        apply_preprocessing = True, output_format = 'networkx')

to_hetero_graph.transform(corpus_docs)
to_word_coocc_graph.transform(corpus_docs)
to_isg_graph.transform(corpus_docs)

In the next section, we will see some illustrative examples generated for this code. We will show each of the graph representations and explain in detail how they are built.

  • Word Co-Ocurrence Graph: In this graph, words are represented as a node, and the co-occurrence of two words within the document text is represented as an edge between the words/nodes. As attributes/weights, nodes have the POS tag, and edges have the number of co-occurrences between words in the text document. As output, we will have one graph representation for each text document in the corpus.

Cooc Graph

  • Heterogeneous Graph: In this graph, words and documents are represented as nodes, and the relation between word to word and word to document as edges. As attributes/weights, the word-to-word relation has the point-wise mutual information (PMI) measure, and the word-to-document relation has the Term Frequency-Inverse Document Frequency (TFIDF) measure. As output, we will have only one graph representation for all the text documents in the corpus

Hetero Graph

.

  • Integrated Syntactic Graph:

This representation, integrates multiple linguistic levels in a single data structure. These levels are: the Lexical level (lexical items such as words), Morphological level(deals with the identification, analysis, and description of the structure of the given language’s morphemes such as POS, roots, stem, etc), Syntactic level (deals with the sentence structure such as the dependency trees), and Semantic level (deals with the meaning of the sentences, this can include antonymy, synonymy, etc).

ISG Graph

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2graphapi-0.2.0.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

text2graphapi-0.2.0-py3-none-any.whl (33.7 kB view details)

Uploaded Python 3

File details

Details for the file text2graphapi-0.2.0.tar.gz.

File metadata

  • Download URL: text2graphapi-0.2.0.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for text2graphapi-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a8cf276c44740139934adcc1e989617d37d49b6068e5505dd59cbf7492f6d35d
MD5 dd9d2104c1d7c72d50ef7368c97875ec
BLAKE2b-256 bd2bffaaf0967300ce1f38523a83a72ea60834496873d75d1c37761005f68351

See more details on using hashes here.

File details

Details for the file text2graphapi-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for text2graphapi-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba5b81f9b1664f5d11f69c851b37ddd79a3b2d2c5e97a4cde0fbdb6f28dc8926
MD5 af4516ad7198b367c0d8c11130bc39a2
BLAKE2b-256 7ff332bf6e922087ede5cb5fd029ab22a7c4f951b66dea73900dc30d64aab15d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page