Skip to main content

Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models

Project description

Open Source Framework: Docs2KG

Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models

PyPI Demo Lint Documentation Status

Installation

We have published the package to PyPi: Docs2KG,

You can install it via:

pip install Docs2KG

Tutorial

We have a demonstration to walk through the components of Docs2KG.

The downstream usage examples is also included.

Video is available at Demo Docs2KG

The tutorial details is available at Tutorial Docs2KG

Which includes:

We also provide the Example Codes in Example Codes Docs2KG

The source codes documentation is available at Docs2KG Documentation


Motivation

Three pillars of the LLM applications in our opinion:

  • Data
  • RAG
  • LLM

Most of the tools in the market nowadays are focusing on the Retrieval Augmented Generation (RAG) pipelines or How to get Large Language Models (LLMs) to run locally.

Typical tools include: Ollama, LangChain, LLamaIndex, etc.

However, to make sure the wider community can benefit from the latest research, we need to first solve the data problem.

The Wider community includes personal users, small business, and even large enterprises. Some of them might have developed databases, while most of them do have a lot of data, but they are all in unstructured form, and distributed in different places.

So the first challenges will be:

  • How can we easily process the unstructured data into a centralized place?
  • What is the best way to organize the data within the centralized place?

Proposed Solution

This package is a proposed solution to the above challenges.

  • We developed the tool for the wider community to easily process the unstructured data into a centralized place.
  • We proposed a way to organize the data within the centralized place, via a Unified Multimodal Knowledge Graph altogether with semi-structured data.

Given the nature of unstructured and heterogeneous data, information extraction and knowledge representation pose significant challenges. In this package, we introduce Docs2KG, a novel framework designed to extract multi-modal information from diverse and heterogeneous unstructured data sources, including emails, web pages, PDF files, and Excel files. Docs2KG dynamically generates a unified knowledge graph that represents the extracted information, enabling efficient querying and exploration of the data. Unlike existing approaches that focus on specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework not only simplifies data processing but also improves the interpretability of models across diverse domains.

Overall Architecture

The overall architecture design will be shown in:

img.png

The data from multiple sources will be processed by the Dual-Path Data Processing. Some of the data, for example, the exported PDF files, Excel files, etc., they can be processed and handle by programming parser. So it will be converted generally into the markdown, and then transformed into the unified knowledge graph. For data like scanned PDF, images, etc., we will need the help from Doc Layout Analysis and OCR to extract the information, then we will convert the extracted information into the markdown, and then transformed into the unified knowledge graph.

Then the unified multimodal knowledge graph will be generated based on the outputs:

  • Text
    • Markdown
    • Text2KG Output
  • Table CSV
  • Table Image
  • Image

The unified multimodal knowledge graph will have mainly two aspects:

  • Layout Knowledge Graph
    • The layout of the documents are helping us to understand the structure of the documents.
    • So it will be also necessary and important represented within the unified multimodal knowledge graph.
  • Semantic Knowledge Graph
    • The semantic connections are the part our brain will be interested in when we read the documents.
    • So with the help of the LLM, we can try to extract the semantic connections from the documents.
    • Which can help human to understand the documents better from the semantic perspective.

Implemented System Architecture

img.png

The overall steps include:

  • Data Processing
    • Dual-Path Data Processing
    • Get the documents from diverse sources with diverse formats into Markdown, CSV, JSON, etc.
  • Unified Multimodal Knowledge Graph Construction
  • GraphDB Loader
    • Load the unified multimodal knowledge graph into the GraphDB
    • We use Neo4j as the GraphDB in this project
  • Further Enhancement
    • The KG schema is generated and dynamic, and will not be perfect at the beginning.
    • So we need to further enhance the KG schema
      • Via automatic schema merge: node label frequency based merge, label semantic similarity based merge
      • Via human in the loop: human review and further enhance the KG schema
  • Downstream Applications
    • Traditional Cypher Query: NLP Query to Cypher Query (Can with help from LLM)
    • Vector Based RAG:
      • Get the embedding of each node first.
      • Then use the embedding of the query to do the similarity search to extract the anchor nodes within the graph.
      • Use these nodes as the anchor nodes, doing multi hop information extraction to augment the query.
      • Use LLM to do the final generation based on the augmented query.

Setup and Development

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements.dev.txt

pip install -e .

Citation

If you find this package useful, please consider citing our work:

@misc{sun2024docs2kg,
      title={Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models}, 
      author={Qiang Sun and Yuanyi Luo and Wenxiao Zhang and Sirui Li and Jichunyang Li and Kai Niu and Xiangrui Kong and Wei Liu},
      year={2024},
      eprint={2406.02962},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docs2kg-0.2.3.tar.gz (63.6 kB view details)

Uploaded Source

Built Distribution

Docs2KG-0.2.3-py3-none-any.whl (90.4 kB view details)

Uploaded Python 3

File details

Details for the file docs2kg-0.2.3.tar.gz.

File metadata

  • Download URL: docs2kg-0.2.3.tar.gz
  • Upload date:
  • Size: 63.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for docs2kg-0.2.3.tar.gz
Algorithm Hash digest
SHA256 dbb786e248dcfb81e7debc535e795e308ffa2d3e826a3f008199f028bbdb816e
MD5 706f521d826dea24540ebe3823b28d24
BLAKE2b-256 840005e957ec8ec4d24e2f4e0132fb7911791b212453a510f5264ff2b4caad4b

See more details on using hashes here.

File details

Details for the file Docs2KG-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: Docs2KG-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 90.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for Docs2KG-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0f37c7036c86de625ccb390f41dc6f93cd81595b9aba5a3496d6dc7a49d3e97f
MD5 07e73028fb38c1f41f806eacdf7fb205
BLAKE2b-256 403d475a09824363dcd6e2ccb5f2161d6651afb17a4dc6fdea558cb556952b6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page