Construct a _knowledge graph_ (KG) from unstructured data sources using _state of the art_ (SOTA) models for _named entity recognition_ (NER), then implement an enhanced _GraphRAG_ approach, and curate semantics for optimizing AI app outcomes within a specific domain.

These details have not been verified by PyPI

Project links

Project description

Strwythura

Strwythura tutorial, based on a presentation about GraphRAG for GraphGeeks on 2024-08-14

How to construct a knowledge graph (KG) from unstructured data sources using state of the art (SOTA) models for named entity recognition (NER), then implement an enhanced GraphRAG approach, and curate semantics for optimizing AI app outcomes within a specific domain.

videos: https://youtu.be/B6_NfvQL-BE, https://senzing.com/gph-graph-rag-llm-knowledge-graphs/
slides: https://derwen.ai/s/2njz#1

Motivation for this tutorial comes from the stark fact that the term "GraphRAG" means many things, based on multiple conflicting definitions. Several popular implementations reveal a relatively cursory understanding about either natural language processing (NLP) or graph algorithms, plus a vendor bias toward their own query language.

See this article for more details and history: "Unbundling the Graph in GraphRAG".

Instead of delegating KG construction to a large language model (LLM), this tutorial shows the use of sophisticated NLP pipelines based on spaCy, GLiNER, TextRank, and related libraries. Results are better/faster/cheaper, plus this provides more control and oversight for intentional arrangement of the KG. Then for downstream usage in a question/answer chat bot, an enhanced GraphRAG approach leverages graph algorithms (e.g., semantic random walk) to optimize retrieval of text chunks which ultimately get presented to an LLM for summarization to produce responses.

For more detailed discussions, see:

enhanced GraphRAG: "GraphRAG to enhance LLM-based apps"
ontology pipeline: "Intentional Arrangement" by Jessica Talisman
spaCy: https://spacy.io/
GLiNER: https://huggingface.co/urchade/gliner_base
TextRank: https://www.derwen.ai/docs/ptr/explain_algo/

A few key issues regarding KG construction with LLMs still have not been addressed by the graph community in general:

LLMs tend to mangle cross-domain semantics when used for building graphs; see Mai2024 referenced in the "GraphRAG to enhance LLM-based apps" talk above.
You need to introduce a semantic layer for representing the domain context, which follows more of a neurosymbolic AI approach.
Most all LLMs perform question rewriting in ways which cannot be disabled, even when the temperature parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds.
Any model used for prediction introduces reasoning based on generalization, even more so when the model uses a loss function for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.

Of course, YMMV.

Overall, this approach leverages neurosymbolic AI methods, combining best practices from:

_natural language processing
graph data science
ontology pipeline
context engineering
human-in-the-loop

to illustrate a reference implementation for entity-resolved retrieval-augmented generation (ER-RAG).

Set up

Caveat: this code runs with Python 3.11, though the range of versions may be extended soon.

poetry update
poetry run python3 -m spacy download en_core_web_md

Note: if you're working with text documents in another language, change the spaCy model downloaded here, and also the model setting in the config.toml file. It's a shame this cannot be done programmatically in a more fluent Pythonic way, for a variety of complex reasons.

Usage

Caveat: this repo provides the source code and notebooks which accompany an instructional tutorial; it is not intended as a packaged library or maintained product.

That said, if you want to use this code to build an application it may help to copy settings in config.toml into a custom configuration file, then instantiate new Strwythura and GraphRAG objects using it.

Part 1: Build assets

Given as input:

a list of URLs from which to scrape content
domain.ttl -- semantics for the domain context

Note: the domain.ttl file provides a ontology pipeline for the given domain, used as the human-in-the-loop basis for constructing a semantic layer. Along with the curate.py script described below this illustrates human-in-the-loop approaches in KG construction.

The build.py script scrapes text sources and constructs a knowledge graph plus entity embeddings, with nodes linked to chunks in a vector store:

poetry run python3 build.py

Demo data used in this case includes articles about the linkage between eating processed red meat frequently and the risks of dementia later in life, based on long-term studies.

The approach in this tutorial iterates through multiple steps to produce the assets needed for GraphRAG downstream:

Scrape each URL using requests and BeautifulSoup
Split the text into chunks
Build vector embeddings for each chunk, in LanceDB
Parse each text chunk using spaCy, iterating per sentence
Extract entities from each sentence using GLiNER
Build a lexical graph from the parse trees in NetworkX
Run a textrank algorithm to rank important entities
Build an embedding model for entities using gensim.Word2Vec
Generate an interactive visualization using PyVis

Note: processing may take a few extra minutes the first time it runs since PyTorch must download a large (~2GB) file.

If you look at the performance statistics, it takes almost twice as long to generate an interactive graph visualization as it does to perform everything else.

The assets get serialized into these files:

data/lancedb -- vector database tables in LanceDB
data/kg.json -- serialization of NetworkX graph
data/sem.csv -- entity semantics from curate.py
data/entity.w2v -- entity embeddings in Gensim
data/url_cache.sqlite -- URL cache in SQLite
kg.html -- interactive graph visualization in PyVis

Part 2: GraphRAG chat bot

A good downstream use case for exploring a newly constructed KG is GraphRAG, used for grounding the responses by an LLM in a question/answer chat.

This implementation uses BAML https://docs.boundaryml.com/home and leverages the KG using semantic random walks.

To set up, first download/install Ollama https://ollama.com/ and pull the Gemma3 model https://huggingface.co/google/gemma-3-12b-it

ollama pull gemma3:12b

Then run the errag.py script for an interactive GraphRAG example:

poetry run python3 errag.py

Part 3: Semantics curation (WIP)

This code uses a semantic layer -- in other words, a "backbone" for the KG -- to organize the entities and relations which get abstracted from the lexical graph.

If you had previously run entity resolution from structured data sources, which tend to be more reliable than unstructured content, this approach could integrate those results as well.

For now, run the curate.py script to generate a view of the ranked NER results, serialized as the data/sem.csv file. This can be viewed in a spreadsheet to understand how to iterate on the semantic definitions for more effective graph organization in the domain of the scraped documents.

poetry run python3 curate.py

Generalized, Unbundled Process

Objective:

Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up at the lexical graph (without the EL part yet):

Semantic overlay:

Load any pre-defined controlled vocabularies directly into the KG.

Data graph:

Load the structured data sources or updates into a data graph.
Perform entity resolution (ER) on PII extracted from the data graph.
Use ER results to generate a semantic overlay as a "backbone" for the KG.

Lexical graph:

Parse the text chunks, using lemmatization to normalize token spans.
Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
Analyze named entity recognition (NER) to extract candidate entities from NP spans.
Analyze relation extraction (RE) to extract relations between pairwise entities.
Perform entity linking (EL) leveraging the ER results.
Promote the extracted entities and relations up to the semantic overlay.

Of course many vendors suggest using a large language model (LLM) as a one-size-fits-all (OSFA) "black box" approach for extracting entities and generating an entire graph automagically.

However, the business process of resolution -- for both entities and relations -- requires judgements. If the entities getting resolved are low-risk, low-effort in nature, then yeah knock yourself out. If the entities represent people or organizations, these have agency and may take actions when misrepresented in applications which have consequences.

Whenever judgements get delegated to model-based approaches, generalization becomes a form of reasoning employed. When the technology within the model is based on loss functions, then generalization becomes dominant -- regardless of any marketing claims about "AI reasoning" made by tech firms.

Fortunately, decisions can be made without models, even in AI applications. Shock, horror!!! Please, say it isn't so!?! Brace yourselves, using models is a thing, but not the only thing. For more detailed discussion, see:

Part 1: Let's talk about "Today's AI" https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/
Part 2: Let's talk about "Resolution" https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/

Also keep in mind that black box approaches don't work especially well for regulated environments, where audits, explanations, evidence, data provenance, etc., are required.

Moreover, KGs used in mission-critical apps, such as investigations, generally require periodic data updates, so construction isn't a one-step process. By producing a KG based on the approach sketched above, updates can be handled more effectively. Any downstream use cases, such as AI applications, also benefit from improved quality of semantics and representation.

Experiment: Relation Extraction library evals

Current Python libraries for relation extraction (RE) are probably best characterized as "experimental research projects".

Their tokenization approaches tend to make the mistake of "throwing the baby out with the bath water" by not leveraging other available information, e.g., what we have in the textgraph representation of the parsed documents. Also, they tend to ignore the semantic constraints of the domain context, while computationally boiling the ocean.

RE libraries which have been evaluated:

GLiREL: https://github.com/jackboyla/GLiREL
ReLIK: https://github.com/SapienzaNLP/relik
OpenNRE: https://github.com/thunlp/OpenNRE
mREBEL: https://github.com/Babelscape/rebel

This project had used GLiREL although its results were quite sparse. RE will be replaced by BAML or DSPy workflows in the near future.

There is some experimental code which illustrates OpenNRE evaluation. Use the archive/nre.sh script to load OpenNRE pre-trained models before running the archive/opennre.ipynb notebook.

This may not work in many environments, depending on how well the OpenNRE library is being maintained.

Tutorial notebooks

There is a collection of Jupyter notebooks (now archived) which were used to prototype code. These help illustrate important intermediate steps within these workflows:

.venv/bin/jupyter-lab

Part 1: archive/construct.ipynb -- detailed KG construction using a lexical graph
Part 2: archive/chunk.ipynb -- simple example of how to scrape and chunk text
Part 3: archive/vector.ipynb -- query LanceDB table for text chunk embeddings (after running build.py)
Part 4: archive/embed.ipynb -- query the entity embedding model (after running build.py)

Developer notes

After each BAML release update, some committer needs to regenerate its Python client source:

poetry run baml-cli generate --from strwythura/baml_src

Kudos to @prrao87, @hellovai, @louisguitton, @cj2001

FAQ

Q: "Have you tried this with langextract yet?"
A: "I'll take How does an instructor know a student ignored the README? from the FAFO category, for $200" ... but yes of course, it's an interesting package, building on other interesting work used here.

Q: "What the hell is the name of this repo about?"
A: "As you may have noticed, many open source projects by Derwen are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word strwythura translates as the verb 'structure' in English."

Q: "Why aren't you using an LLM instead to build the graph?"
A: "I promise to visit you in jail."

License and Copyright

Source code, documentation, and examples have an MIT license which is succinct andsimplifies use in commercial applications.

Star History

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.1.1

Apr 15, 2026

2.1.0

Apr 15, 2026

2.0.4

Mar 25, 2026

2.0.3

Feb 3, 2026

2.0.2

Jan 29, 2026

2.0.1

Jan 22, 2026

2.0.0

Dec 31, 2025

1.5.0

Sep 2, 2025

1.4.2

Sep 1, 2025

1.4.1

Aug 30, 2025

1.4.0

Aug 30, 2025

1.3.0

Aug 28, 2025

1.2.4

Aug 24, 2025

1.2.2

Aug 24, 2025

This version

1.2.1

Aug 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strwythura-1.2.1.tar.gz (34.2 kB view details)

Uploaded Aug 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strwythura-1.2.1-py3-none-any.whl (40.5 kB view details)

Uploaded Aug 24, 2025 Python 3

File details

Details for the file strwythura-1.2.1.tar.gz.

File metadata

Download URL: strwythura-1.2.1.tar.gz
Upload date: Aug 24, 2025
Size: 34.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for strwythura-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`14a9fc15d2fc5ab4451006c45ea49374213f78aa2e574a6821ab5690d5b5c447`
MD5	`69928571ca804f8e0a7db95413a944e9`
BLAKE2b-256	`16e764368bf7573e9ae994387797cda44a7818ffbe7c3603ee829ebb64230607`

See more details on using hashes here.

File details

Details for the file strwythura-1.2.1-py3-none-any.whl.

File metadata

Download URL: strwythura-1.2.1-py3-none-any.whl
Upload date: Aug 24, 2025
Size: 40.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for strwythura-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bde751d799941aa85589592e123a7943bd4351e75adb54ca3226476dee6bee32`
MD5	`a71f44b660f7085b00ab9e4b998ce7bc`
BLAKE2b-256	`bda1c32078e478b992aadcd3f48e7ad36757631658b3ffa6bf604c47cb49213a`

See more details on using hashes here.

strwythura 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Strwythura

Set up

Usage

Part 1: Build assets

Part 2: GraphRAG chat bot

Part 3: Semantics curation (WIP)

Generalized, Unbundled Process

Experiment: Relation Extraction library evals

Tutorial notebooks

Developer notes

FAQ

License and Copyright

Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes