Construct a _knowledge graph_ (KG) from unstructured data sources using _state of the art_ (SOTA) models for _named entity recognition_ (NER), then implement an enhanced _GraphRAG_ approach, and curate semantics for optimizing AI app outcomes within a specific domain.
Project description
Strwythura
Strwythura tutorial, based on a presentation about GraphRAG for GraphGeeks on 2024-08-14
How to construct a knowledge graph (KG) from unstructured data sources using state of the art (SOTA) models for named entity recognition (NER), then implement an enhanced GraphRAG approach, and curate semantics for optimizing AI app outcomes within a specific domain.
- videos: https://youtu.be/B6_NfvQL-BE, https://senzing.com/gph-graph-rag-llm-knowledge-graphs/
- slides: https://derwen.ai/s/2njz#1
Motivation for this tutorial comes from the stark fact that the term "GraphRAG" means many things, based on multiple conflicting definitions. Several popular implementations reveal a relatively cursory understanding about either natural language processing (NLP) or graph algorithms, plus a vendor bias toward their own query language.
See this article for more details and history: "Unbundling the Graph in GraphRAG".
Instead of delegating KG construction to a large language model
(LLM), this tutorial shows the use of sophisticated NLP pipelines
based on spaCy, GLiNER, TextRank, and related libraries.
Results are better/faster/cheaper, plus this provides more control
and oversight for intentional arrangement of the KG. Then for
downstream usage in a question/answer chat bot, an enhanced GraphRAG
approach leverages graph algorithms (e.g., semantic random walk)
to optimize retrieval of text chunks which ultimately get presented
to an LLM for summarization to produce responses.
For more detailed discussions, see:
- enhanced GraphRAG: "GraphRAG to enhance LLM-based apps"
- ontology pipeline: "Intentional Arrangement" by Jessica Talisman
spaCy: https://spacy.io/GLiNER: https://huggingface.co/urchade/gliner_base- TextRank: https://www.derwen.ai/docs/ptr/explain_algo/
A few key issues regarding KG construction with LLMs still have not been addressed by the graph community in general:
- LLMs tend to mangle cross-domain semantics when used for building graphs; see Mai2024 referenced in the "GraphRAG to enhance LLM-based apps" talk above.
- You need to introduce a semantic layer for representing the domain context, which follows more of a neurosymbolic AI approach.
- Most all LLMs perform question rewriting in ways which cannot be disabled, even when the
temperatureparameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds. - Any model used for prediction introduces reasoning based on generalization, even more so when the model uses a loss function for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
- The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.
Of course, YMMV.
Overall, this approach leverages neurosymbolic AI methods, combining best practices from:
- _natural language processing
- graph data science
- ontology pipeline
- context engineering
- human-in-the-loop
to illustrate a reference implementation for entity-resolved retrieval-augmented generation (ER-RAG).
Set up
Caveat: this code runs with Python 3.11, though the range of versions may be extended soon.
poetry update
poetry run python3 -m spacy download en_core_web_md
Note: if you're working with text documents in another language,
change the spaCy model downloaded here, and also the model setting
in the config.toml file. It's a shame this cannot be done
programmatically in a more fluent Pythonic way, for a variety of
complex reasons.
Usage
Caveat: this repo provides the source code and notebooks which accompany an instructional tutorial; it is not intended as a packaged library or maintained product.
That said, if you want to use this code to build an application it may
help to copy settings in config.toml into a custom configuration
file, then instantiate new Strwythura and GraphRAG objects using
it.
Part 1: Build assets
Given as input:
- a list of URLs from which to scrape content
domain.ttl-- semantics for the domain context
Note: the domain.ttl file provides a ontology pipeline for the
given domain, used as the human-in-the-loop basis for constructing a
semantic layer. Along with the curate.py script described below
this illustrates human-in-the-loop approaches in KG construction.
The build.py script scrapes text sources and constructs a
knowledge graph plus entity embeddings, with nodes linked to
chunks in a vector store:
poetry run python3 build.py
Demo data used in this case includes articles about the linkage between eating processed red meat frequently and the risks of dementia later in life, based on long-term studies.
The approach in this tutorial iterates through multiple steps to produce the assets needed for GraphRAG downstream:
- Scrape each URL using
requestsandBeautifulSoup - Split the text into chunks
- Build vector embeddings for each chunk, in
LanceDB - Parse each text chunk using
spaCy, iterating per sentence - Extract entities from each sentence using
GLiNER - Build a lexical graph from the parse trees in
NetworkX - Run a textrank algorithm to rank important entities
- Build an embedding model for entities using
gensim.Word2Vec - Generate an interactive visualization using
PyVis
Note: processing may take a few extra minutes the first time it runs
since PyTorch must download a large (~2GB) file.
If you look at the performance statistics, it takes almost twice as long to generate an interactive graph visualization as it does to perform everything else.
The assets get serialized into these files:
data/lancedb-- vector database tables inLanceDBdata/kg.json-- serialization ofNetworkXgraphdata/sem.csv-- entity semantics fromcurate.pydata/entity.w2v-- entity embeddings inGensimdata/url_cache.sqlite-- URL cache inSQLitekg.html-- interactive graph visualization inPyVis
Part 2: GraphRAG chat bot
A good downstream use case for exploring a newly constructed KG is GraphRAG, used for grounding the responses by an LLM in a question/answer chat.
This implementation uses BAML https://docs.boundaryml.com/home
and leverages the KG using semantic random walks.
To set up, first download/install Ollama https://ollama.com/
and pull the Gemma3 model https://huggingface.co/google/gemma-3-12b-it
ollama pull gemma3:12b
Then run the errag.py script for an interactive GraphRAG example:
poetry run python3 errag.py
Part 3: Semantics curation (WIP)
This code uses a semantic layer -- in other words, a "backbone" for the KG -- to organize the entities and relations which get abstracted from the lexical graph.
If you had previously run entity resolution from structured data sources, which tend to be more reliable than unstructured content, this approach could integrate those results as well.
For now, run the curate.py script to generate a view of the ranked
NER results, serialized as the data/sem.csv file. This can be
viewed in a spreadsheet to understand how to iterate on the semantic
definitions for more effective graph organization in the domain of the
scraped documents.
poetry run python3 curate.py
Generalized, Unbundled Process
Objective:
Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.
These steps define a generalized process, where this tutorial picks up at the lexical graph (without the EL part yet):
Semantic overlay:
- Load any pre-defined controlled vocabularies directly into the KG.
Data graph:
- Load the structured data sources or updates into a data graph.
- Perform entity resolution (ER) on PII extracted from the data graph.
- Use ER results to generate a semantic overlay as a "backbone" for the KG.
Lexical graph:
- Parse the text chunks, using lemmatization to normalize token spans.
- Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
- Analyze named entity recognition (NER) to extract candidate entities from NP spans.
- Analyze relation extraction (RE) to extract relations between pairwise entities.
- Perform entity linking (EL) leveraging the ER results.
- Promote the extracted entities and relations up to the semantic overlay.
Of course many vendors suggest using a large language model (LLM) as a one-size-fits-all (OSFA) "black box" approach for extracting entities and generating an entire graph automagically.
However, the business process of resolution -- for both entities and relations -- requires judgements. If the entities getting resolved are low-risk, low-effort in nature, then yeah knock yourself out. If the entities represent people or organizations, these have agency and may take actions when misrepresented in applications which have consequences.
Whenever judgements get delegated to model-based approaches, generalization becomes a form of reasoning employed. When the technology within the model is based on loss functions, then generalization becomes dominant -- regardless of any marketing claims about "AI reasoning" made by tech firms.
Fortunately, decisions can be made without models, even in AI applications. Shock, horror!!! Please, say it isn't so!?! Brace yourselves, using models is a thing, but not the only thing. For more detailed discussion, see:
- Part 1: Let's talk about "Today's AI" https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/
- Part 2: Let's talk about "Resolution" https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/
Also keep in mind that black box approaches don't work especially well for regulated environments, where audits, explanations, evidence, data provenance, etc., are required.
Moreover, KGs used in mission-critical apps, such as investigations, generally require periodic data updates, so construction isn't a one-step process. By producing a KG based on the approach sketched above, updates can be handled more effectively. Any downstream use cases, such as AI applications, also benefit from improved quality of semantics and representation.
Experiment: Relation Extraction library evals
Current Python libraries for relation extraction (RE) are probably best characterized as "experimental research projects".
Their tokenization approaches tend to make the mistake of "throwing the baby out with the bath water" by not leveraging other available information, e.g., what we have in the textgraph representation of the parsed documents. Also, they tend to ignore the semantic constraints of the domain context, while computationally boiling the ocean.
RE libraries which have been evaluated:
GLiREL: https://github.com/jackboyla/GLiRELReLIK: https://github.com/SapienzaNLP/relikOpenNRE: https://github.com/thunlp/OpenNREmREBEL: https://github.com/Babelscape/rebel
This project had used GLiREL although its results were quite sparse.
RE will be replaced by BAML or DSPy workflows in the near future.
There is some experimental code which illustrates OpenNRE evaluation.
Use the archive/nre.sh script to load OpenNRE pre-trained models
before running the archive/opennre.ipynb notebook.
This may not work in many environments, depending on how well the
OpenNRE library is being maintained.
Tutorial notebooks
There is a collection of Jupyter notebooks (now archived) which were used to prototype code. These help illustrate important intermediate steps within these workflows:
.venv/bin/jupyter-lab
- Part 1:
archive/construct.ipynb-- detailed KG construction using a lexical graph - Part 2:
archive/chunk.ipynb-- simple example of how to scrape and chunk text - Part 3:
archive/vector.ipynb-- query LanceDB table for text chunk embeddings (after runningbuild.py) - Part 4:
archive/embed.ipynb-- query the entity embedding model (after runningbuild.py)
Developer notes
After each BAML release update, some committer needs to regenerate
its Python client source:
poetry run baml-cli generate --from strwythura/baml_src
Kudos to @prrao87, @hellovai, @louisguitton, @cj2001
FAQ
Q: "Have you tried this with langextract yet?"
A: "I'll take How does an instructor know a student ignored the README? from the FAFO category, for $200" ... but yes of course, it's an interesting package, building on other interesting work used here.
Q: "What the hell is the name of this repo about?"
A: "As you may have noticed, many open source projects by Derwen are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word strwythura translates as the verb 'structure' in English."
Q: "Why aren't you using an LLM instead to build the graph?"
A: "I promise to visit you in jail."
License and Copyright
Source code, documentation, and examples have an MIT license which is succinct andsimplifies use in commercial applications.
All materials herein are Copyright © 2024-2025 Senzing, Inc.
Star History
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strwythura-1.2.1.tar.gz.
File metadata
- Download URL: strwythura-1.2.1.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14a9fc15d2fc5ab4451006c45ea49374213f78aa2e574a6821ab5690d5b5c447
|
|
| MD5 |
69928571ca804f8e0a7db95413a944e9
|
|
| BLAKE2b-256 |
16e764368bf7573e9ae994387797cda44a7818ffbe7c3603ee829ebb64230607
|
File details
Details for the file strwythura-1.2.1-py3-none-any.whl.
File metadata
- Download URL: strwythura-1.2.1-py3-none-any.whl
- Upload date:
- Size: 40.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bde751d799941aa85589592e123a7943bd4351e75adb54ca3226476dee6bee32
|
|
| MD5 |
a71f44b660f7085b00ab9e4b998ce7bc
|
|
| BLAKE2b-256 |
bda1c32078e478b992aadcd3f48e7ad36757631658b3ffa6bf604c47cb49213a
|