Skip to main content

Construct a _knowledge graph_ (KG) from unstructured data sources using _state of the art_ (SOTA) models for _named entity recognition_ (NER), then implement an enhanced _GraphRAG_ approach, and curate semantics for optimizing AI app outcomes within a specific domain.

Project description

Strwythura

DOI

Strwythura tutorial, based on a presentation about GraphRAG for GraphGeeks on 2024-08-14

How to construct a knowledge graph (KG) from unstructured data sources using state of the art (SOTA) models for named entity recognition (NER), then implement an enhanced GraphRAG approach, and curate semantics for optimizing AI app outcomes within a specific domain.

Motivation for this tutorial comes from the stark fact that the term "GraphRAG" means many things, based on multiple conflicting definitions. Several popular implementations reveal a relatively cursory understanding about either natural language processing (NLP) or graph algorithms, plus a vendor bias toward their own query language.

See this article for more details and history: "Unbundling the Graph in GraphRAG".

Instead of delegating KG construction to a large language model (LLM), this tutorial shows the use of sophisticated NLP pipelines based on spaCy, GLiNER, TextRank, and related libraries. Results are better/faster/cheaper, plus this provides more control and oversight for intentional arrangement of the KG. Then for downstream usage in a question/answer chat bot, an enhanced GraphRAG approach leverages graph algorithms (e.g., semantic random walk) to optimize retrieval of text chunks which ultimately get presented to an LLM for summarization to produce responses.

For more detailed discussions, see:

A few key issues regarding KG construction with LLMs still have not been addressed by the graph community in general:

  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see Mai2024 referenced in the "GraphRAG to enhance LLM-based apps" talk above.
  2. You need to introduce a semantic layer for representing the domain context, which follows more of a neurosymbolic AI approach.
  3. Most all LLMs perform question rewriting in ways which cannot be disabled, even when the temperature parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds.
  4. Any model used for prediction introduces reasoning based on generalization, even more so when the model uses a loss function for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.

Of course, YMMV.

Overall, this approach leverages neurosymbolic AI methods, combining best practices from:

  • _natural language processing
  • graph data science
  • ontology pipeline
  • context engineering
  • human-in-the-loop

to illustrate a reference implementation for entity-resolved retrieval-augmented generation (ER-RAG).

Set up

Caveat: this code runs with Python 3.11, though the range of versions may be extended soon.

poetry update
poetry run python3 -m spacy download en_core_web_md

Note: if you're working with text documents in another language, change the spaCy model downloaded here, and also the model setting in the config.toml file. It's a shame this cannot be done programmatically in a more fluent Pythonic way, for a variety of complex reasons.

Usage

Caveat: this repo provides the source code and notebooks which accompany an instructional tutorial; it is not intended as a packaged library or maintained product.

That said, if you want to use this code to build an application it may help to copy settings in config.toml into a custom configuration file, then instantiate new Strwythura and GraphRAG objects using it.

Part 1: Build assets

Given as input:

  • a list of URLs from which to scrape content
  • domain.ttl -- semantics for the domain context

Note: the domain.ttl file provides a ontology pipeline for the given domain, used as the human-in-the-loop basis for constructing a semantic layer. Along with the curate.py script described below this illustrates human-in-the-loop approaches in KG construction.

The build.py script scrapes text sources and constructs a knowledge graph plus entity embeddings, with nodes linked to chunks in a vector store:

poetry run python3 build.py

Demo data used in this case includes articles about the linkage between eating processed red meat frequently and the risks of dementia later in life, based on long-term studies.

The approach in this tutorial iterates through multiple steps to produce the assets needed for GraphRAG downstream:

  1. Scrape each URL using requests and BeautifulSoup
  2. Split the text into chunks
  3. Build vector embeddings for each chunk, in LanceDB
  4. Parse each text chunk using spaCy, iterating per sentence
  5. Extract entities from each sentence using GLiNER
  6. Build a lexical graph from the parse trees in NetworkX
  7. Run a textrank algorithm to rank important entities
  8. Build an embedding model for entities using gensim.Word2Vec
  9. Generate an interactive visualization using PyVis

Note: processing may take a few extra minutes the first time it runs since PyTorch must download a large (~2GB) file.

If you look at the performance statistics, it takes almost twice as long to generate an interactive graph visualization as it does to perform everything else.

The assets get serialized into these files:

  • data/lancedb -- vector database tables in LanceDB
  • data/kg.json -- serialization of NetworkX graph
  • data/sem.csv -- entity semantics from curate.py
  • data/entity.w2v -- entity embeddings in Gensim
  • data/url_cache.sqlite -- URL cache in SQLite
  • kg.html -- interactive graph visualization in PyVis

Part 2: GraphRAG chat bot

A good downstream use case for exploring a newly constructed KG is GraphRAG, used for grounding the responses by an LLM in a question/answer chat.

This implementation uses BAML https://docs.boundaryml.com/home and leverages the KG using semantic random walks.

To set up, first download/install Ollama https://ollama.com/ and pull the Gemma3 model https://huggingface.co/google/gemma-3-12b-it

ollama pull gemma3:12b

Then run the errag.py script for an interactive GraphRAG example:

poetry run python3 errag.py

Part 3: Semantics curation (WIP)

This code uses a semantic layer -- in other words, a "backbone" for the KG -- to organize the entities and relations which get abstracted from the lexical graph.

If you had previously run entity resolution from structured data sources, which tend to be more reliable than unstructured content, this approach could integrate those results as well.

For now, run the curate.py script to generate a view of the ranked NER results, serialized as the data/sem.csv file. This can be viewed in a spreadsheet to understand how to iterate on the semantic definitions for more effective graph organization in the domain of the scraped documents.

poetry run python3 curate.py

Generalized, Unbundled Process

Objective:

Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up at the lexical graph (without the EL part yet):

Semantic overlay:

  1. Load any pre-defined controlled vocabularies directly into the KG.

Data graph:

  1. Load the structured data sources or updates into a data graph.
  2. Perform entity resolution (ER) on PII extracted from the data graph.
  3. Use ER results to generate a semantic overlay as a "backbone" for the KG.

Lexical graph:

  1. Parse the text chunks, using lemmatization to normalize token spans.
  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
  3. Analyze named entity recognition (NER) to extract candidate entities from NP spans.
  4. Analyze relation extraction (RE) to extract relations between pairwise entities.
  5. Perform entity linking (EL) leveraging the ER results.
  6. Promote the extracted entities and relations up to the semantic overlay.

Of course many vendors suggest using a large language model (LLM) as a one-size-fits-all (OSFA) "black box" approach for extracting entities and generating an entire graph automagically.

However, the business process of resolution -- for both entities and relations -- requires judgements. If the entities getting resolved are low-risk, low-effort in nature, then yeah knock yourself out. If the entities represent people or organizations, these have agency and may take actions when misrepresented in applications which have consequences.

Whenever judgements get delegated to model-based approaches, generalization becomes a form of reasoning employed. When the technology within the model is based on loss functions, then generalization becomes dominant -- regardless of any marketing claims about "AI reasoning" made by tech firms.

Fortunately, decisions can be made without models, even in AI applications. Shock, horror!!! Please, say it isn't so!?! Brace yourselves, using models is a thing, but not the only thing. For more detailed discussion, see:

Also keep in mind that black box approaches don't work especially well for regulated environments, where audits, explanations, evidence, data provenance, etc., are required.

Moreover, KGs used in mission-critical apps, such as investigations, generally require periodic data updates, so construction isn't a one-step process. By producing a KG based on the approach sketched above, updates can be handled more effectively. Any downstream use cases, such as AI applications, also benefit from improved quality of semantics and representation.

Experiment: Relation Extraction library evals

Current Python libraries for relation extraction (RE) are probably best characterized as "experimental research projects".

Their tokenization approaches tend to make the mistake of "throwing the baby out with the bath water" by not leveraging other available information, e.g., what we have in the textgraph representation of the parsed documents. Also, they tend to ignore the semantic constraints of the domain context, while computationally boiling the ocean.

RE libraries which have been evaluated:

This project had used GLiREL although its results were quite sparse. RE will be replaced by BAML or DSPy workflows in the near future.

There is some experimental code which illustrates OpenNRE evaluation. Use the archive/nre.sh script to load OpenNRE pre-trained models before running the archive/opennre.ipynb notebook.

This may not work in many environments, depending on how well the OpenNRE library is being maintained.

Tutorial notebooks

There is a collection of Jupyter notebooks (now archived) which were used to prototype code. These help illustrate important intermediate steps within these workflows:

.venv/bin/jupyter-lab
  • Part 1: archive/construct.ipynb -- detailed KG construction using a lexical graph
  • Part 2: archive/chunk.ipynb -- simple example of how to scrape and chunk text
  • Part 3: archive/vector.ipynb -- query LanceDB table for text chunk embeddings (after running build.py)
  • Part 4: archive/embed.ipynb -- query the entity embedding model (after running build.py)

Developer notes

After each BAML release update, some committer needs to regenerate its Python client source:

poetry run baml-cli generate --from strwythura/baml_src

Kudos to @prrao87, @hellovai, @louisguitton, @cj2001

FAQ

Q: "Have you tried this with langextract yet?"
A: "I'll take How does an instructor know a student ignored the README? from the FAFO category, for $200" ... but yes of course, it's an interesting package, building on other interesting work used here.

Q: "What the hell is the name of this repo about?"
A: "As you may have noticed, many open source projects by Derwen are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word strwythura translates as the verb 'structure' in English."

Q: "Why aren't you using an LLM instead to build the graph?"
A: "I promise to visit you in jail."

License and Copyright

Source code, documentation, and examples have an MIT license which is succinct andsimplifies use in commercial applications.

All materials herein are Copyright © 2024-2025 Senzing, Inc.

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strwythura-1.2.1.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strwythura-1.2.1-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file strwythura-1.2.1.tar.gz.

File metadata

  • Download URL: strwythura-1.2.1.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for strwythura-1.2.1.tar.gz
Algorithm Hash digest
SHA256 14a9fc15d2fc5ab4451006c45ea49374213f78aa2e574a6821ab5690d5b5c447
MD5 69928571ca804f8e0a7db95413a944e9
BLAKE2b-256 16e764368bf7573e9ae994387797cda44a7818ffbe7c3603ee829ebb64230607

See more details on using hashes here.

File details

Details for the file strwythura-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: strwythura-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 40.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for strwythura-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bde751d799941aa85589592e123a7943bd4351e75adb54ca3226476dee6bee32
MD5 a71f44b660f7085b00ab9e4b998ce7bc
BLAKE2b-256 bda1c32078e478b992aadcd3f48e7ad36757631658b3ffa6bf604c47cb49213a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page