Incremental Knowledge Graphs Constructor Using Large Language Models

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
Operating System
Programming Language
- Python :: 3

Project description

iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models

Overview

iText2KG is a Python package designed to construct knowledge graphs incrementally by leveraging large language models for entity and relation extraction from text documents. It features zero-shot capability, allowing for knowledge extraction across various domains without specific training. The package includes modules for document distillation, entity extraction, and relation extraction, ensuring resolved and unique entities and relationships. It continuously updates the KG with new documents and integrates them into Neo4j for visual representation

Installation

To install iText2KG, ensure you have Python installed, then use pip to install

pip install itext2kg

The Overall Architecture

The iText2KG package consists of four main modules that work together to construct and visualize knowledge graphs from unstructured text. An overview of the overall architecture:

Document Distiller: This module processes raw documents and reformulates them into semantic blocks based on a user-defined schema. It improves the signal-to-noise ratio by focusing on relevant information and structuring it in a predefined format.
Incremental Entity Extractor: This module extracts unique entities from the semantic blocks and resolves ambiguities to ensure each entity is clearly defined. It uses cosine similarity measures to match local entities with global entities.
Incremental Relation Extractor: This module identifies relationships between the extracted entities. It can operate in two modes: using global entities to enrich the graph with potential information or using local entities for more precise relationships.
Graph Integrator and Visualization: This module integrates the extracted entities and relationships into a Neo4j database, providing a visual representation of the knowledge graph. It allows for interactive exploration and analysis of the structured data.

itext2kg

The LLM is prompted to extract entities representing one unique concept to avoid semantically mixed entities. The following figure presents the entity and relation extraction prompts using the Langchain JSON Parser. They are categorized as follows: Blue - prompts automatically formatted by Langchain; Regular - prompts we have designed; and Italic - specifically designed prompts for entity and relation extraction. (a) prompts for relation extraction and (b) prompts for entity extraction.

prompts

Modules and Examples

All the examples are provided in the following jupyter notebook example.

The Document Distiller module reformulates raw documents into predefined and semantic blocks using LLMs. It utilizes a schema to guide the extraction of specific information from each document.

The `DocumentDistiller`

Example

from itext2kg import DocumentDistiller
# You can define a schema or upload some predefined ones.
from itext2kg.utils import Article

# Define your OpenAI API key.
OPENAI_API_KEY = "####"

# Initialize the DocumentDistiller with the OpenAI API key.
document_distiller = DocumentDistiller(openai_api_key=OPENAI_API_KEY)

# List of documents to be distilled.
documents = ["doc1", "doc2", "doc3"]

# Information extraction query.
IE_query = '''
# DIRECTIVES : 
- Act like an experienced information extractor. 
- You have a chunk of a scientific paper.
- If you do not find the right information, keep its place empty.
'''

# Distill the documents using the defined query and output data structure.
distilled_doc = document_distiller.distill(documents=documents, IE_query=IE_query, output_data_structure=Article)

You can define a custom schema using pydantic_v1 of langchain_core.

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

# Define an Author model with name and affiliation fields.
class Author(BaseModel):
    name: str = Field(description="The name of the author")
    affiliation: str = Field(description="The affiliation of the author")
    
# Define an Article model with various fields describing a scientific article.
class Article(BaseModel):
    title: str = Field(description="The title of the scientific article")
    authors: List[Author] = Field(description="The list of the article's authors and their affiliation")
    abstract: str = Field(description="The article's abstract")
    key_findings: str = Field(description="The key findings of the article")
    limitation_of_sota: str = Field(description="limitation of the existing work")
    proposed_solution: str = Field(description="The proposed solution in details")
    paper_limitations: str = Field(description="The limitations of the proposed solution of the paper")

The `iText2KG`

The iText2KG module is the core component of the package, responsible for integrating various functionalities to construct the knowledge graph. It uses the distilled semantic sections from documents to extract entities and relationships, and then builds the knowledge graph incrementally.

# Initialize iText2KG with the OpenAI API key.
itext2kg = iText2KG(openai_api_key=OPENAI_API_KEY)

# Format the distilled document into semantic sections.
semantic_blocks = [f"{key} - {value}".replace("{", "[").replace("}", "]") for key, value in distilled_doc.items()]

# Build the knowledge graph using the semantic sections.
global_ent, global_rel = itext2kg.build_graph(sections=semantic_blocks)

The Arguments of iText2KG:

openai_api_key (str): The API key for accessing OpenAI services. This key is used to authenticate and make requests to OpenAI's language models for entity and relation extraction.
embeddings_model_name (str, optional): The name of the embeddings model to be used for generating text embeddings. Default is "text-embedding-3-large".
model_name (str, optional): The name of the language model to be used for entity and relation extraction. Default is "gpt-4-turbo".
temperature (float, optional): The temperature setting for the language model, controlling the randomness of the output. Default is 0.
sleep_time (int, optional): The sleep time between API requests to avoid rate limiting. Default is 5 seconds.

The `GraphIntegrator`

It integrates the extracted entities and relationships into a Neo4j graph database and provides a visualization of the knowledge graph. This module allows users to easily explore and analyze the structured data using Neo4j's graph capabilities.

from itext2kg.graph_integration import GraphIntegrator

URI = "bolt://localhost:####"
USERNAME = "####"
PASSWORD = "####"


new_graph = {}
new_graph["nodes"] = global_ent
new_graph["relationships"] = global_rel

GraphIntegrator(uri=URI, username=USERNAME, password=PASSWORD).visualize_graph(json_graph=new_graph)

Some `iText2KG` use-cases

In the figure below, we have constructed a KG for the article seasonal and for the company company, with its permission to publish it publicly. Additionally, the Curriculum Vitae (CV) KG is based on the following generated CV.

text2kg

Dataset

The dataset consists of five generated CVs using GPT-4, five randomly selected scientific articles representing various domains of study with diverse structures, and five company websites from different industries of varying sizes. Additionally, we have included distilled versions of the CVs and scientific articles based on predefined schemas.

Public Collaboration

We welcome contributions from the community to improve iText2KG.

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
Operating System
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.0.5

Sep 20, 2024

0.0.4

Sep 17, 2024

0.0.3

Sep 6, 2024

This version

0.0.1

Jul 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

itext2kg-0.0.1.tar.gz (27.2 kB view hashes)

Uploaded Jul 16, 2024 Source

Built Distribution

itext2kg-0.0.1-py3-none-any.whl (30.6 kB view hashes)

Uploaded Jul 16, 2024 Python 3

Hashes for itext2kg-0.0.1.tar.gz

Hashes for itext2kg-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`2da3d06c7353973dfcc70565bb38e07345123d03b7e23c7309efe219237cdb0f`
MD5	`463781c7b1dfa00cddd2f8f8a6ae10f3`
BLAKE2b-256	`0f0a1784e4d1ad8a0e3e8cf4d0356b194994a84f5e54b57281a6c89b38af2ba6`

Hashes for itext2kg-0.0.1-py3-none-any.whl

Hashes for itext2kg-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0efc526e327f096a4349efe1a8044551f720fbe4fe6281479adb2c84e2c68a4d`
MD5	`6529cd710b9e03c6b3a1fc2a977dab86`
BLAKE2b-256	`701d7eb931606ecc92fb69b5a124896308c028039115fe76e7c46e9f23e4e072`

itext2kg 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models

Overview

Installation

The Overall Architecture

Modules and Examples

The `DocumentDistiller`

The `iText2KG`

The `GraphIntegrator`

Some `iText2KG` use-cases

Dataset

Public Collaboration

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

itext2kg 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models

Overview

Installation

The Overall Architecture

Modules and Examples

The DocumentDistiller

The iText2KG

The GraphIntegrator

Some iText2KG use-cases

Dataset

Public Collaboration

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

The `DocumentDistiller`

The `iText2KG`

The `GraphIntegrator`

Some `iText2KG` use-cases