Construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain.
Project description
Strwythura
Strwythura library/tutorial, based on a presentation about GraphRAG for GraphGeeks on 2024-08-14
Overview
How to construct a knowledge graph (KG) from unstructured data sources using state of the art (SOTA) models for named entity recognition (NER), then implement an enhanced GraphRAG approach, and curate semantics for optimizing AI app outcomes within a specific domain.
- videos: https://youtu.be/B6_NfvQL-BE, https://senzing.com/gph-graph-rag-llm-knowledge-graphs/
- slides: https://derwen.ai/s/2njz#1
Motivation for this tutorial comes from the stark fact that the term "GraphRAG" means many things, based on multiple conflicting definitions. Several popular implementations reveal a relatively cursory understanding about either natural language processing (NLP) or graph algorithms, plus a vendor bias toward their own query language.
See this article for more details and history: "Unbundling the Graph in GraphRAG".
Instead of delegating KG construction to a large language model
(LLM), this tutorial shows the use of sophisticated NLP pipelines
based on spaCy, GLiNER, TextRank, and related libraries.
Results are better/faster/cheaper, plus this provides more control
and oversight for intentional arrangement of the KG. Then for
downstream usage in a question/answer chat bot, an enhanced GraphRAG
approach leverages graph algorithms (e.g., semantic random walk)
to optimize retrieval of text chunks which ultimately get presented
to an LLM for summarization to produce responses.
For more detailed discussions, see:
- enhanced GraphRAG: "GraphRAG to enhance LLM-based apps"
- ontology pipeline: "Intentional Arrangement" by Jessica Talisman
spaCy: https://spacy.io/GLiNER: https://huggingface.co/urchade/gliner_base- TextRank: https://www.derwen.ai/docs/ptr/explain_algo/
Some key issues regarding KG construction with LLMs which don't get addressed much by the graph community and AI community in general:
- LLMs tend to mangle cross-domain semantics when used for building graphs; see Mai2024 referenced in the "GraphRAG to enhance LLM-based apps" talk above.
- You need to introduce a semantic layer for representing the domain context, which follows more of a neurosymbolic AI approach.
- Most all LLMs perform question rewriting in ways which cannot be disabled, even when the
temperatureparameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds. - Any model used for prediction introduces reasoning based on generalization, even more so when the model uses a loss function for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
- The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.
Of course, YMMV.
This approach leverages neurosymbolic AI methods, combining practices from:
- natural language processing
- graph data science
- entity resolution
- ontology pipeline
- context engineering
- human-in-the-loop
Overall, this illustrates a reference implementation for entity-resolved retrieval-augmented generation (ER-RAG).
Usage in applications
This runs with Python 3.11, though the range of versions may be extended soon.
To pip install from PyPi:
python3 -m pip install strwathura
python3 -m spacy download en_core_web_md
Then to integrate this library within an application:
- Copy settings in
config.tomlinto a custom configuration file. - Subclass
DomainContextto extend it for the use case. - Define semantics in
domain.ttlfor the domain context. - Run entity resolutin on your structured data.
- Run
Ollamaand have already downloaded the Gemma3 LLM as described below. - Instantiate new
DomainContext,Strwythura,VisHTML, andGraphRAGobjects or their subclassed extensions. - ...
- Profit
Follow the patterns in the build.py and errag.py example scripts.
If you're working with documents in a language other than English, well first that's absolutely fantastic, though next you need to:
- Update model settings in the
config.tomlfile. - Change the
spaCymodel downloaded here. - Also change the language tags used in
domain.ttlas needed.
Set up for demo or development
This library uses poetry for
package management, and first you need to install it. Then run:
poetry update
poetry run python3 -m spacy download en_core_web_md
Demo Part 1: Entity Resolution
Run entity resolution (ER) to produce entities and relations from structured data sources, which tend to be more reliable than those extracted from unstructured content.
What does this ER step buy us? ER allows us to merge multiple structured data sets, even without consistent foreign keys being available, producing an overlay of entities and relations among them -- which is useful as a "backbone" for constructing a KG. Morever when there are judgements being made from the KG about people or organizations, ER provides accountability for the merge decisions.
This is especially important in public sector, healthcare, banking, insurance -- i.e., in use cases where you might need to "send flowers" when automated judgements go wrong. For example, someone gets denied a loan, has a medical insurance claim blocked, gets a tax audit, has their voter registration voided, becomes the subject of an arrest warrant, and so on. In other words, people and organizations tend to take legal actions when someone else causes them harm. You'll want an audit trail of decisions based on evidence, when your software systems are making these kinds of judgements.
For the domain context in this tutorial, say we have two hypothetical datasets which provide business directory listings:
sz_er/acme_biz.json-- "ACME Business Directory"sz_er/corp_home.json-- "Corporates Home UK"
Plus we have slices from datasets which provide listings about researchers and scientific authors:
These four datasets can be merged using ER, with the results being a domain-specific thesaurus that generates graph elements: entities, relations, properties. We'll blend this into our semantic layer used for organizing the KG later.
The following steps are optional, since these ER results have already
been pre-computed and provided in the sz_er/export.json file.
If you'd like to run Senzing
to reproduce these ER results, use the following steps -- otherwise
continue to the "Part 2" of this tutorial.
Senzing SDK runs in Python or Java, though ER can also be run in batch with a container from DockerHub:
docker pull senzing/demo-senzing
Once this container is available, run:
docker run -it --rm --volume ./sz_er:/tmp/data senzing/demo-senzing
This brings up a Linux command line prompt I have no name! and the
local subdirectory sz_er will be mapped to the `/tmp/data' directory
Type the following commands for batch ER into the command line prompt.
First, set up the Senzing configuration for merging these datasets:
G2ConfigTool.py
Within the configuration tool, register the names of the data sources being used:
addDataSource ACME_BIZ
addDataSource CORP_HOME
addDataSource ORCID
addDataSource SCOPUS
save
exit
Load each file and run ER on its data records:
G2Loader.py -f /tmp/data/acme_biz.json
G2Loader.py -f /tmp/data/corp_home.json
G2Loader.py -f /tmp/data/orcid.json
G2Loader.py -f /tmp/data/scopus.json
Export the ER results to the sz_er/export.json file, then exit the
container:
G2Export.py -F JSON -o /tmp/data/export.json
exit
WIP:
Finally, run the parser.py script to represent the Senzing ER
results as a SKOS-based thesaurus:
pushd sz_er
poetry run python3 parser.py
popd
This produces the sz_er/er.ttl file (RDF in "Turtle" format) which
get used in the next part of the demo to augment the semantic layer.
Demo Part 2: Build Assets
Given as input:
domain.ttl-- semantics for the domain contextsz_er/er.ttl-- a domain-specific thesaurus based on entity resolution- a list of URLs from which to scrape content
The domain.ttl file provides a basis for iterating with an ontology
pipeline process, to represent the semantics for the given domain.
It specifies metadata in terms of vocabulary, taxonomy, and
thesaurus -- to use in representing the core entities and relations
in the KG.
The curate.py script described below then will introduce the
human-in-the-loop part of this process, where you can review
entities extracted from documents. Based on this analysis, decide
where to refine the domain context to be able to extract,
classify, and connect more of what gets extracted from
unstructured data sources and linked into the KG. Overall, this
process distills elements of the lexical graph, linking them with
elements from the data graph, to produce a more abstracted (i.e.,
less noisy) semantic layer as the resulting KG.
Meanwhle, let's get started. The build.py script scrapes text
sources and constructs a knowledge graph plus entity embeddings,
with nodes linked to chunks in a vector store:
poetry run python3 build.py
Demo data used in this case includes articles about the linkage between eating processed red meat frequently and the risks of dementia later in life, based on long-term studies.
The approach in this tutorial iterates through multiple steps to produce the assets needed for GraphRAG downstream:
- Scrape each URL using
requestsandBeautifulSoup - Split the text into chunks
- Build vector embeddings for each chunk, in
LanceDB - Parse each text chunk using
spaCy, iterating per sentence - Extract entities from each sentence using
GLiNER - Build a lexical graph from the parse trees in
NetworkX - Run a textrank algorithm to rank important entities
- Build an embedding model for entities using
gensim.Word2Vec - Generate an interactive visualization using
PyVis
Note: processing may take a few extra minutes the first time it runs
since PyTorch must download a large (~2GB) file.
The assets get serialized into these files:
data/lancedb-- vector database tables inLanceDBdata/kg.json-- serialization ofNetworkXgraphdata/sem.csv-- entity semantics fromcurate.pydata/entity.w2v-- entity embeddings inGensimdata/url_cache.sqlite-- URL cache inSQLitekg.html-- interactive graph visualization inPyVis
Demo Part 3: Enhanced GraphRAG chat bot
A good downstream use case for exploring a newly constructed KG is GraphRAG, used for grounding the responses by an LLM in a question/answer chat.
This implementation uses BAML https://docs.boundaryml.com/home
and leverages the KG using semantic random walks.
To set up, first download/install Ollama https://ollama.com/
and pull the Gemma3 model https://huggingface.co/google/gemma-3-12b-it
ollama pull gemma3:12b
Then run the errag.py script for an interactive GraphRAG example:
poetry run python3 errag.py
Demo Part 4: Curating an Ontology Pipeline
This code uses a semantic layer -- in other words, a "backbone" for the KG -- to organize the entities and relations which get abstracted from the lexical graph.
For now, run the curate.py script to generate a view of the ranked
NER results, serialized as the data/sem.csv file. This can be
viewed in a spreadsheet to understand how to iterate on the semantic
definitions for more effective graph organization in the domain of the
scraped documents.
poetry run python3 curate.py
Unbundling GraphRAG
Objective:
Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.
These steps define a generalized process, where this tutorial picks up at the lexical graph, without the entity linking (EL) part yet:
Semantic layer:
- Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.
Data graph:
- Load the structured data sources or updates into a data graph.
- Perform entity resolution (ER) on PII extracted from the data graph.
- Blend the ER results into the semantic layer as a "backbone" for structuring the KG.
Lexical graph:
- Parse the text chunks, using lemmatization to normalize token spans.
- Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
- Analyze named entity recognition (NER) to extract candidate entities from noun phrase spans.
- Analyze relation extraction (RE) to extract relations between pairwise entities.
- Perform entity linking (EL) leveraging the ER results.
- Promote the extracted entities and relations up to the semantic layer.
Of course many vendors suggest using a large language model (LLM) as a one-size-fits-all (OSFA) "black box" approach for extracting entities and generating an entire graph automagically.
However, the business process of resolution -- for both entities and relations -- requires judgements. If the entities getting resolved are low-risk, low-effort in nature, then yeah knock yourself out. If the entities represent people or organizations, these have agency and may take actions when misrepresented in applications which have consequences.
Whenever judgements get delegated to model-based approaches, generalization becomes a form of reasoning employed. When the technology within the model is based on loss functions, then generalization becomes dominant -- regardless of any marketing claims about "AI reasoning" made by tech firms.
Fortunately, decisions can be made without models, even in AI applications. Shock, horror!!! Please, say it isn't so!?! Brace yourselves, using models is a thing, but not the only thing. For more detailed discussion, see:
- Part 1: Let's talk about "Today's AI" https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/
- Part 2: Let's talk about "Resolution" https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/
Also keep in mind that black box approaches don't work especially well for regulated environments, where audits, explanations, evidence, data provenance, etc., are required.
Moreover, KGs used in mission-critical apps, such as investigations, generally require periodic data updates, so construction isn't a one-step process. By producing a KG based on the approach sketched above, updates can be handled more effectively. Any downstream use cases, such as AI applications, also benefit from improved quality of semantics and representation.
FAQ
- Q:
- "Have you tried this with
langextractyet?" - A:
- "I'll take
How does an instructor know a student ignored the README?from the What is FAFO? category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly."
- Q:
- "What the hell is the name of this repo about?"
- A:
- "As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word
strwythuratranslates as the verb structure in English."
- Q:
- "Why aren't you using an LLM to build the graph instead?"
- A:
- "I promise to visit you in jail."
- Q:
- "Um, yeah, like, didn't Karpathy say to use vibe coding, or something? #justsayin"
- A:
- "Piss the eff off tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance."
Developer Notes
After each BAML release update, some committer needs to regenerate
its Python client source:
poetry run baml-cli generate --from strwythura/baml_src
Experimental: Relation Extraction evaluation
Current Python libraries for relation extraction (RE) are probably best characterized as "experimental research projects".
Their tokenization approaches tend to make the mistake of "throwing the baby out with the bath water" by not leveraging other available information, e.g., what we have in the textgraph representation of the parsed documents. Also, they tend to ignore the semantic constraints of the domain context, while computationally boiling the ocean.
RE libraries which have been evaluated:
GLiREL: https://github.com/jackboyla/GLiRELReLIK: https://github.com/SapienzaNLP/relikOpenNRE: https://github.com/thunlp/OpenNREmREBEL: https://github.com/Babelscape/rebel
This project had used GLiREL although its results were quite sparse.
RE will be replaced by BAML or DSPy workflows in the near future.
There is some experimental code which illustrates OpenNRE evaluation.
Use the archive/nre.sh script to load OpenNRE pre-trained models
before running the archive/opennre.ipynb notebook.
This may not work in many environments, depending on how well the
OpenNRE library is being maintained.
Experimental: Tutorial notebooks
A collection of Jupyter notebooks were used to prototype code. These help illustrate important intermediate steps within these workflows:
.venv/bin/jupyter-lab
- `archive/construct.ipynb` -- detailed KG construction using a lexical graph
- `archive/chunk.ipynb` -- simple example of how to scrape and chunk text
- `archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)
- `archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)
These are now archived, though kept available for study.
License and Copyright
Source code for Strwythura plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.
All materials herein are Copyright © 2024-2025 Senzing, Inc.
Kudos and Attribution
Please use the following BibTeX entry for citing Strwythura if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.
@software{strwythura,
author = {Paco Nathan},
title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},
year = 2024,
publisher = {Senzing},
doi = {10.5281/zenodo.16934079},
url = {https://github.com/DerwenAI/strwythura}
}
Kudos to
@louisguitton,
@cj2001,
@prrao87,
@hellovai,
@docktermj,
@jbutcher21,
and the kind folks at GraphGeeks for their support.
Star History
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strwythura-1.4.0.tar.gz.
File metadata
- Download URL: strwythura-1.4.0.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
402ed41ace3f77545753d6c7de62e1d4422556d5c4107832389f84f01e2a9851
|
|
| MD5 |
7f4ed2f2de97582127e1488f2371a060
|
|
| BLAKE2b-256 |
b6e118404dd43a5186f94d6b5850900d6f89a3a408eb7b44ce9fa7d012f0b2ab
|
File details
Details for the file strwythura-1.4.0-py3-none-any.whl.
File metadata
- Download URL: strwythura-1.4.0-py3-none-any.whl
- Upload date:
- Size: 47.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f89f41d6cb36ed60bd2eccbbc33521aee858158f41e70f4830aad932eced9c44
|
|
| MD5 |
7c299a45f5f1bad68e1590a45f6cc5f2
|
|
| BLAKE2b-256 |
f9c515fb8eb2976e464f018dee3cbbae9cb5cbeb74cc7220c5a12551b1910575
|