Skip to main content

Construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain.

Project description

Strwythura

DOI Licence Repo size Checked with mypy GitHub commit activity

Strwythura library/tutorial, based on a presentation about GraphRAG for GraphGeeks on 2024-08-14

Overview

How to construct a knowledge graph (KG) from unstructured data sources using state of the art (SOTA) models for named entity recognition (NER), then implement an enhanced GraphRAG approach, and curate semantics for optimizing AI app outcomes within a specific domain.

Motivation for this tutorial comes from the stark fact that the term "GraphRAG" means many things, based on multiple conflicting definitions. Several popular implementations reveal a relatively cursory understanding about either natural language processing (NLP) or graph algorithms, plus a vendor bias toward their own query language.

See this article for more details and history: "Unbundling the Graph in GraphRAG".

Instead of delegating KG construction to a large language model (LLM), this tutorial shows the use of sophisticated NLP pipelines based on spaCy, GLiNER, TextRank, and related libraries. Results are better/faster/cheaper, plus this provides more control and oversight for intentional arrangement of the KG. Then for downstream usage in a question/answer chat bot, an enhanced GraphRAG approach leverages graph algorithms (e.g., semantic random walk) to optimize retrieval of text chunks which ultimately get presented to an LLM for summarization to produce responses.

For more detailed discussions, see:

Some key issues regarding KG construction with LLMs which don't get addressed much by the graph community and AI community in general:

  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see Mai2024 referenced in the "GraphRAG to enhance LLM-based apps" talk above.
  2. You need to introduce a semantic layer for representing the domain context, which follows more of a neurosymbolic AI approach.
  3. Most all LLMs perform question rewriting in ways which cannot be disabled, even when the temperature parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds.
  4. Any model used for prediction introduces reasoning based on generalization, even more so when the model uses a loss function for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.

Of course, YMMV.

This approach leverages neurosymbolic AI methods, combining practices from:

  • natural language processing
  • graph data science
  • entity resolution
  • ontology pipeline
  • context engineering
  • human-in-the-loop

Overall, this illustrates a reference implementation for entity-resolved retrieval-augmented generation (ER-RAG).

Usage in applications

This runs with Python 3.11, though the range of versions may be extended soon.

To pip install from PyPi:

python3 -m pip install strwathura
python3 -m spacy download en_core_web_md

Then to integrate this library within an application:

  1. Copy settings in config.toml into a custom configuration file.
  2. Subclass DomainContext to extend it for the use case.
  3. Define semantics in domain.ttl for the domain context.
  4. Run entity resolutin on your structured data.
  5. Run Ollama and have already downloaded the Gemma3 LLM as described below.
  6. Instantiate new DomainContext, Strwythura, VisHTML, and GraphRAG objects or their subclassed extensions.
  7. ...
  8. Profit

Follow the patterns in the build.py and errag.py example scripts.

If you're working with documents in a language other than English, well first that's absolutely fantastic, though next you need to:

  • Update model settings in the config.toml file.
  • Change the spaCy model downloaded here.
  • Also change the language tags used in domain.ttl as needed.

Set up for demo or development

This library uses poetry for package management, and first you need to install it. Then run:

poetry update
poetry run python3 -m spacy download en_core_web_md

Demo Part 1: Entity Resolution

Run entity resolution (ER) to produce entities and relations from structured data sources, which tend to be more reliable than those extracted from unstructured content.

What does this ER step buy us? ER allows us to merge multiple structured data sets, even without consistent foreign keys being available, producing an overlay of entities and relations among them -- which is useful as a "backbone" for constructing a KG. Morever when there are judgements being made from the KG about people or organizations, ER provides accountability for the merge decisions.

This is especially important in public sector, healthcare, banking, insurance -- i.e., in use cases where you might need to "send flowers" when automated judgements go wrong. For example, someone gets denied a loan, has a medical insurance claim blocked, gets a tax audit, has their voter registration voided, becomes the subject of an arrest warrant, and so on. In other words, people and organizations tend to take legal actions when someone else causes them harm. You'll want an audit trail of decisions based on evidence, when your software systems are making these kinds of judgements.

For the domain context in this tutorial, say we have two hypothetical datasets which provide business directory listings:

  • sz_er/acme_biz.json -- "ACME Business Directory"
  • sz_er/corp_home.json -- "Corporates Home UK"

Plus we have slices from datasets which provide listings about researchers and scientific authors:

  • sz_er/orcid.json -- ORCID
  • sz_er/scopus.json -- Scopus

These four datasets can be merged using ER, with the results being a domain-specific thesaurus that generates graph elements: entities, relations, properties. We'll blend this into our semantic layer used for organizing the KG later.

The following steps are optional, since these ER results have already been pre-computed and provided in the sz_er/export.json file. If you'd like to run Senzing to reproduce these ER results, use the following steps -- otherwise continue to the "Part 2" of this tutorial.

Senzing SDK runs in Python or Java, though ER can also be run in batch with a container from DockerHub:

docker pull senzing/demo-senzing

Once this container is available, run:

docker run -it --rm --volume ./sz_er:/tmp/data senzing/demo-senzing

This brings up a Linux command line prompt I have no name! and the local subdirectory sz_er will be mapped to the `/tmp/data' directory Type the following commands for batch ER into the command line prompt.

First, set up the Senzing configuration for merging these datasets:

G2ConfigTool.py

Within the configuration tool, register the names of the data sources being used:

addDataSource ACME_BIZ
addDataSource CORP_HOME
addDataSource ORCID
addDataSource SCOPUS
save
exit

Load each file and run ER on its data records:

G2Loader.py -f /tmp/data/acme_biz.json
G2Loader.py -f /tmp/data/corp_home.json
G2Loader.py -f /tmp/data/orcid.json
G2Loader.py -f /tmp/data/scopus.json

Export the ER results to the sz_er/export.json file, then exit the container:

G2Export.py -F JSON -o /tmp/data/export.json
exit

WIP:

Finally, run the parser.py script to represent the Senzing ER results as a SKOS-based thesaurus:

pushd sz_er
poetry run python3 parser.py
popd

This produces the sz_er/er.ttl file (RDF in "Turtle" format) which get used in the next part of the demo to augment the semantic layer.

Demo Part 2: Build Assets

Given as input:

  • domain.ttl -- semantics for the domain context
  • sz_er/er.ttl -- a domain-specific thesaurus based on entity resolution
  • a list of URLs from which to scrape content

The domain.ttl file provides a basis for iterating with an ontology pipeline process, to represent the semantics for the given domain. It specifies metadata in terms of vocabulary, taxonomy, and thesaurus -- to use in representing the core entities and relations in the KG.

The curate.py script described below then will introduce the human-in-the-loop part of this process, where you can review entities extracted from documents. Based on this analysis, decide where to refine the domain context to be able to extract, classify, and connect more of what gets extracted from unstructured data sources and linked into the KG. Overall, this process distills elements of the lexical graph, linking them with elements from the data graph, to produce a more abstracted (i.e., less noisy) semantic layer as the resulting KG.

Meanwhle, let's get started. The build.py script scrapes text sources and constructs a knowledge graph plus entity embeddings, with nodes linked to chunks in a vector store:

poetry run python3 build.py

Demo data used in this case includes articles about the linkage between eating processed red meat frequently and the risks of dementia later in life, based on long-term studies.

The approach in this tutorial iterates through multiple steps to produce the assets needed for GraphRAG downstream:

  1. Scrape each URL using requests and BeautifulSoup
  2. Split the text into chunks
  3. Build vector embeddings for each chunk, in LanceDB
  4. Parse each text chunk using spaCy, iterating per sentence
  5. Extract entities from each sentence using GLiNER
  6. Build a lexical graph from the parse trees in NetworkX
  7. Run a textrank algorithm to rank important entities
  8. Build an embedding model for entities using gensim.Word2Vec
  9. Generate an interactive visualization using PyVis

Note: processing may take a few extra minutes the first time it runs since PyTorch must download a large (~2GB) file.

The assets get serialized into these files:

  • data/lancedb -- vector database tables in LanceDB
  • data/kg.json -- serialization of NetworkX graph
  • data/sem.csv -- entity semantics from curate.py
  • data/entity.w2v -- entity embeddings in Gensim
  • data/url_cache.sqlite -- URL cache in SQLite
  • kg.html -- interactive graph visualization in PyVis

Demo Part 3: Enhanced GraphRAG chat bot

A good downstream use case for exploring a newly constructed KG is GraphRAG, used for grounding the responses by an LLM in a question/answer chat.

This implementation uses BAML https://docs.boundaryml.com/home and leverages the KG using semantic random walks.

To set up, first download/install Ollama https://ollama.com/ and pull the Gemma3 model https://huggingface.co/google/gemma-3-12b-it

ollama pull gemma3:12b

Then run the errag.py script for an interactive GraphRAG example:

poetry run python3 errag.py

Demo Part 4: Curating an Ontology Pipeline

This code uses a semantic layer -- in other words, a "backbone" for the KG -- to organize the entities and relations which get abstracted from the lexical graph.

For now, run the curate.py script to generate a view of the ranked NER results, serialized as the data/sem.csv file. This can be viewed in a spreadsheet to understand how to iterate on the semantic definitions for more effective graph organization in the domain of the scraped documents.

poetry run python3 curate.py

Unbundling GraphRAG

Objective:

Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up at the lexical graph, without the entity linking (EL) part yet:

Semantic layer:

  1. Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.

Data graph:

  1. Load the structured data sources or updates into a data graph.
  2. Perform entity resolution (ER) on PII extracted from the data graph.
  3. Blend the ER results into the semantic layer as a "backbone" for structuring the KG.

Lexical graph:

  1. Parse the text chunks, using lemmatization to normalize token spans.
  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
  3. Analyze named entity recognition (NER) to extract candidate entities from noun phrase spans.
  4. Analyze relation extraction (RE) to extract relations between pairwise entities.
  5. Perform entity linking (EL) leveraging the ER results.
  6. Promote the extracted entities and relations up to the semantic layer.

Of course many vendors suggest using a large language model (LLM) as a one-size-fits-all (OSFA) "black box" approach for extracting entities and generating an entire graph automagically.

However, the business process of resolution -- for both entities and relations -- requires judgements. If the entities getting resolved are low-risk, low-effort in nature, then yeah knock yourself out. If the entities represent people or organizations, these have agency and may take actions when misrepresented in applications which have consequences.

Whenever judgements get delegated to model-based approaches, generalization becomes a form of reasoning employed. When the technology within the model is based on loss functions, then generalization becomes dominant -- regardless of any marketing claims about "AI reasoning" made by tech firms.

Fortunately, decisions can be made without models, even in AI applications. Shock, horror!!! Please, say it isn't so!?! Brace yourselves, using models is a thing, but not the only thing. For more detailed discussion, see:

Also keep in mind that black box approaches don't work especially well for regulated environments, where audits, explanations, evidence, data provenance, etc., are required.

Moreover, KGs used in mission-critical apps, such as investigations, generally require periodic data updates, so construction isn't a one-step process. By producing a KG based on the approach sketched above, updates can be handled more effectively. Any downstream use cases, such as AI applications, also benefit from improved quality of semantics and representation.

FAQ

Q:
"Have you tried this with langextract yet?"
A:
"I'll take How does an instructor know a student ignored the README? from the What is FAFO? category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly."
Q:
"What the hell is the name of this repo about?"
A:
"As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word strwythura translates as the verb structure in English."
Q:
"Why aren't you using an LLM to build the graph instead?"
A:
"I promise to visit you in jail."
Q:
"Um, yeah, like, didn't Karpathy say to use vibe coding, or something? #justsayin"
A:
"Piss the eff off tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance."
Developer Notes

After each BAML release update, some committer needs to regenerate its Python client source:

poetry run baml-cli generate --from strwythura/baml_src
Experimental: Relation Extraction evaluation

Current Python libraries for relation extraction (RE) are probably best characterized as "experimental research projects".

Their tokenization approaches tend to make the mistake of "throwing the baby out with the bath water" by not leveraging other available information, e.g., what we have in the textgraph representation of the parsed documents. Also, they tend to ignore the semantic constraints of the domain context, while computationally boiling the ocean.

RE libraries which have been evaluated:

This project had used GLiREL although its results were quite sparse. RE will be replaced by BAML or DSPy workflows in the near future.

There is some experimental code which illustrates OpenNRE evaluation. Use the archive/nre.sh script to load OpenNRE pre-trained models before running the archive/opennre.ipynb notebook.

This may not work in many environments, depending on how well the OpenNRE library is being maintained.

Experimental: Tutorial notebooks

A collection of Jupyter notebooks were used to prototype code. These help illustrate important intermediate steps within these workflows:

.venv/bin/jupyter-lab
  • `archive/construct.ipynb` -- detailed KG construction using a lexical graph
  • `archive/chunk.ipynb` -- simple example of how to scrape and chunk text
  • `archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)
  • `archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)

These are now archived, though kept available for study.

License and Copyright

Source code for Strwythura plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2024-2025 Senzing, Inc.

Kudos and Attribution

Please use the following BibTeX entry for citing Strwythura if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{strwythura,
  author = {Paco Nathan},
  title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},
  year = 2024,
  publisher = {Senzing},
  doi = {10.5281/zenodo.16934079},
  url = {https://github.com/DerwenAI/strwythura}
}

Kudos to @louisguitton, @cj2001, @prrao87, @hellovai, @docktermj, @jbutcher21,
and the kind folks at GraphGeeks for their support.

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strwythura-1.4.0.tar.gz (43.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strwythura-1.4.0-py3-none-any.whl (47.7 kB view details)

Uploaded Python 3

File details

Details for the file strwythura-1.4.0.tar.gz.

File metadata

  • Download URL: strwythura-1.4.0.tar.gz
  • Upload date:
  • Size: 43.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for strwythura-1.4.0.tar.gz
Algorithm Hash digest
SHA256 402ed41ace3f77545753d6c7de62e1d4422556d5c4107832389f84f01e2a9851
MD5 7f4ed2f2de97582127e1488f2371a060
BLAKE2b-256 b6e118404dd43a5186f94d6b5850900d6f89a3a408eb7b44ce9fa7d012f0b2ab

See more details on using hashes here.

File details

Details for the file strwythura-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: strwythura-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 47.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for strwythura-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f89f41d6cb36ed60bd2eccbbc33521aee858158f41e70f4830aad932eced9c44
MD5 7c299a45f5f1bad68e1590a45f6cc5f2
BLAKE2b-256 f9c515fb8eb2976e464f018dee3cbbae9cb5cbeb74cc7220c5a12551b1910575

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page