A complete workflow for generating, normalizing, and visualizing Knowledge Graphs from unstructured Hebrew text
Project description
SimpleKG
SimpleKG is a Python library for generating Knowledge Graphs from unstructured text using LLMs. It is designed for humanities and digital scholarship research — particularly for multi-lingual, domain-specific corpora such as rabbinic Hebrew, ancient Greek patristic literature, and legal documents. The primary use case is cross-text comparison: extracting KGs from multiple source texts under a shared ontology, then comparing their structure to detect text reuse, semantic proximity, or shared tradition.
Table of Contents
- Installation
- Environment Setup
- Command Line Usage
- Python API — Basic Usage
- Post-KG: ACT Format and Visualization
- Pipeline Configuration Reference
- Signature Modules (Domains)
- Implementation
Installation
# From PyPI
pip install simplekg
# From source (recommended for development)
git clone https://gitlab.com/millerhadar/simplekg.git
cd simplekg
uv sync
# With optional stanza NLP support
uv sync --extra stanza
Environment Setup
Create a .env file in your project root:
# LLM
OPENAI_API_KEY=sk-...
# Elasticsearch (optional — only needed for ACT graph storage)
ELASTIC_HOST=https://your-es-host
ELASTIC_USER=your-user
ELASTIC_PASSWORD=your-password
Load it in your script or notebook:
from dotenv import load_dotenv
load_dotenv(".env")
Command Line Usage
kg_gen.py is the main script for batch pipeline execution from the command line.
# Basic usage
uv run kg_gen.py -f <file_name> -s <signature_module>
# Examples
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic -c 1500
uv run kg_gen.py -f IbnShuib19_2 -s signatures_ramban -c -1 -x Ramban19_2
uv run kg_gen.py -f Greek_tlg0526_tlg004_6_201_210 -s signatures_ancient_greek -c 1500 &
Arguments:
| Flag | Description |
|---|---|
-f |
Text file name (without .txt extension), looked up in the configured source path |
-s |
Signature module (domain): signatures_rabbinic, signatures_ramban, signatures_legal, signatures_wa, signatures_ancient_greek |
-c |
Chunk size in characters (default: 1100). Use -1 to chunk by sentences |
-x |
Optional context file name — provides background document context to the LLM |
-p |
Enable text preprocessing step before extraction |
-r |
Override source path for the input file |
-o |
Override output directory |
Output is written to a structured directory under base_output_path:
kg_d<file_name>/
C0_O0/singleStepRelations/
final_knowledge_graph.json
final_knowledge_graph_visualization.html
step_1_processed_subgraphs.json
...
logs/
nkg_pipeline.log
Python API — Basic Usage
Rabbinic / Hebrew Text
import os
from dotenv import load_dotenv
from simplekg import NKGGenerator
load_dotenv(".env")
generator = NKGGenerator(
model="openai/gpt-4o",
signature_module="signatures_rabbinic",
api_key=os.getenv("OPENAI_API_KEY"),
log_level="INFO",
log_to_file=False,
)
text = """מֵאֵימָתַי קוֹרִין אֶת שְׁמַע בְּעַרְבִית. מִשָּׁעָה שֶׁהַכֹּהֲנִים נִכְנָסִים
לֶאֱכֹל בִּתְרוּמָתָן, עַד סוֹף הָאַשְׁמוּרָה הָרִאשׁוֹנָה, דִּבְרֵי רַבִּי אֱלִיעֶזֶר."""
pipeline = {
"preProcessText": False,
"processSubGraphs": True,
"tagDefinitions": False,
"twoStepRelations": False,
"consolidateSubGraphs": True,
"mergeConcepts": True,
"pruneConcepts": False,
"storeGraph": True,
"storeEmbeddings": True,
"storeVisualization": True,
"storeGraphSteps": True,
"outputPath": "/tmp/kg_output/",
}
generator.execute_pipeline(
text=text,
doc_context=None,
pipeline=pipeline,
chunk_size=0, # 0 = no chunking, process as one unit
chunk_by_sentences=False,
chunk_overlap_sentences=0,
verbose=False,
)
Ancient Greek with Ontology Normalization
Ancient Greek texts require a text normalization function for matching LLM output back to the source (removing diacritics, normalizing sigma variants, etc.). Ontology normalization aligns all extracted predicates and entity types to the AncientGreekOntology — enabling meaningful cross-text comparison.
import os, unicodedata, re, regex
from dotenv import load_dotenv
from simplekg import NKGGenerator
load_dotenv(".env")
def normalize_greek(text, filter_non_greek=True):
"""Strip diacritics and normalize Greek letter variants."""
custom_mapping = {
'\u03c2': 'σ', # final sigma → regular sigma
'\u03f2': 'σ', # lunate sigma → regular sigma
}
normalized = unicodedata.normalize('NFD', text)
text = re.sub(r'[\u0300-\u036F]', '', normalized)
text = regex.sub(r'(\p{Script=Greek})[-—]\s+(\p{Script=Greek})', r'\1\2', text)
if filter_non_greek:
text = re.sub('[^\u0370-\u03FF\u1F00-\u1FFF\u0300-\u036F ]+', '', text)
for char, unified in custom_mapping.items():
text = text.replace(char, unified)
return text.lower()
generator = NKGGenerator(
model="openai/gpt-4o",
signature_module="signatures_ancient_greek",
api_key=os.getenv("OPENAI_API_KEY"),
log_level="INFO",
log_to_file=False,
normalize_text_for_matching=normalize_greek, # domain-specific normalization
)
text = ("ωστε συναγεσθαι απο πρωτου ετουσ κυρου και περσων βασιλειασ επι το τελοσ "
"τησ των μακκαβαιων γραφησ και επι την σιμωνοσ του αρχιερεωσ τελευτην ετη "
"τετρακοσια εικοσιπεντε")
pipeline = {
"preProcessText": False,
"processSubGraphs": True,
"tagDefinitions": False,
"normalizeOntology": True, # map predicates + entity types to AncientGreekOntology
"twoStepRelations": False,
"forceOrphanRelation": True, # attempt to connect isolated concepts
"consolidateSubGraphs": True,
"mergeConcepts": True,
"pruneConcepts": False,
"storeGraph": True,
"storeEmbeddings": True,
"storeVisualization": True,
"storeGraphSteps": True,
"outputPath": "/tmp/kg_greek/",
}
generator.execute_pipeline(
text=text,
doc_context=None,
pipeline=pipeline,
chunk_size=0,
chunk_by_sentences=False,
chunk_overlap_sentences=0,
verbose=False,
)
Post-KG: ACT Format and Visualization
After pipeline execution, the graph can be converted to ACT format (Annotated Concept Tree) — a JSON structure suitable for graph databases and network analysis — and then visualized as an interactive HTML graph.
import json
from simplekg.utilities import utils, networkXutils
nxu = networkXutils.NXUtils()
# Convert subgraphs to ACT format
ret = utils.kg2ACT(
generator.graph.subgraphs,
location="my_document_id",
categories=["private", "my_project"],
optin_entity_type=[], # empty = include all entity types
clean_overlapping=False,
normalizers={},
additional_attrs=["prefLabel_en"],
additional_edge_attrs=["predicate", "evidence_text"],
)
# Optionally store ACT graph as JSON
with open("/tmp/kg_greek/final_knowledge_graph_ACT.json", "w") as f:
json.dump(ret, f, indent=4, ensure_ascii=False)
# Convert to NetworkX graph for visualization
actnx = nxu.graphACT2nx(
ret,
title_node_attrs=["conceptDescription_en"],
node_label_attr="prefLabel_en",
edge_label="weight",
edge_hover=["predicate"],
)
# Render interactive HTML visualization
nxu.visualize_graph(
actnx,
output_file="/tmp/kg_greek/visualization.html",
open_browser=True,
show_legend=True,
)
The resulting HTML file contains a fully interactive graph (powered by pyvis) with hover tooltips, legend, and drag-and-drop layout.
Pipeline Configuration Reference
All pipeline flags are passed as a dictionary to execute_pipeline(). Missing flags fall back to defaults.
| Flag | Type | Default | Description |
|---|---|---|---|
preProcessText |
bool | False | Run domain-specific text preprocessing before extraction |
processSubGraphs |
bool | True | Extract concepts and relations from each chunk |
tagDefinitions |
bool | False | Mark definition concepts (e.g., "X is defined as...") |
twoStepRelations |
bool | False | Two-step relation extraction: candidates first, then structured relations |
forceOrphanRelation |
bool | False | Attempt to attach isolated concepts (no relations) to the graph |
normalizeOntology |
bool | False | Enforce domain ontology on entity types and predicates (see below) |
consolidateSubGraphs |
bool | True | Merge per-chunk subgraphs into one consolidated graph |
mergeConcepts |
bool | True | Cluster and merge semantically equivalent concepts |
pruneConcepts |
bool | False | Remove low-confidence or isolated concepts |
storeGraph |
bool | False | Save final graph as JSON |
storeEmbeddings |
bool | False | Save concept embeddings alongside the graph |
storeVisualization |
bool | False | Render and save an HTML visualization |
storeGraphSteps |
bool | False | Save intermediate pipeline stages as JSON snapshots |
outputPath |
str | None | Base output directory; subdirectory structure is auto-created |
Chunking parameters (passed directly to execute_pipeline, not inside the pipeline dict):
| Parameter | Default | Description |
|---|---|---|
chunk_size |
0 | Characters per chunk. 0 = no chunking. -1 = chunk by sentences |
chunk_by_sentences |
False | Automatically set to True when chunk_size=-1 |
chunk_overlap_sentences |
2 | Sentence overlap between adjacent chunks (for context continuity) |
Signature Modules (Domains)
Each domain has a dedicated SignatureRegistry under simplekg/signatures/ that encapsulates the DSPy extraction prompts (signatures) for that domain's language and conventions.
| Module | Class | Domain |
|---|---|---|
signatures_rabbinic |
RabbinicSignatureRegistry |
Mishnah, Talmud, and rabbinic Hebrew literature |
signatures_ramban |
RambanSignatureRegistry |
Ramban biblical commentary (medieval Hebrew) |
signatures_legal |
LegalSignatureRegistry |
Hebrew legal documents and work agreements |
signatures_wa |
WorkAgreementsSignatureRegistry |
Structured work agreement analysis |
signatures_ancient_greek |
AncientGreekSignatureRegistry |
Ancient Greek historical, patristic, and classical texts |
Pass the module name as a string to NKGGenerator:
generator = NKGGenerator(
model="openai/gpt-4o",
signature_module="signatures_rabbinic",
api_key=os.getenv("OPENAI_API_KEY"),
)
Implementation
Pipeline Architecture
NKGGenerator.execute_pipeline() orchestrates a sequential set of stages. Each stage is independently gated by a flag in the pipeline dictionary and can be disabled without affecting the others.
execute_pipeline(text)
│
├── init_graph() Split text into chunks → DocGraph with subgraphs
│
├── processSubGraphs() Per chunk (parallel):
│ ├── _extractChunkConcepts() LLM: terms → concepts
│ ├── _enforce_entity_types() [if normalizeOntology] validate against ontology
│ ├── _extractChunkRelations() LLM: concepts → relations
│ ├── _normalize_predicates() [if normalizeOntology] map predicates to ontology
│ └── _forceOrphanRelation() [if forceOrphanRelation] connect isolated concepts
│
├── consolidateSubGraphs() Merge all subgraph concepts + relations with id remapping
│
├── mergeConcepts() Cluster semantically equivalent concepts; pick canonical form
│
├── pruneConcepts() Remove low-value nodes
│
└── storeGraph / storeVisualization / storeEmbeddings
Each chunk is processed as a BaseGraph object (the subgraph). After consolidation the full document is a DocGraph containing the consolidated graph and all subgraphs.
Signature Registry System
The SignatureRegistry (in simplekg/signatures/signature_registry.py) is an abstract base class that defines the interface every domain must implement. Each method returns a DSPy Signature class used by the pipeline.
class SignatureRegistry(ABC):
def get_harvest_terms_signature(self) # text chunk → term candidates
def get_candidates_to_concepts_signature(self) # candidates → Concept objects
def get_harvest_definitions_signature(self) # identify definition concepts
def get_harvest_relation_candidates_signature(self) # two-step: raw candidates
def get_candidates_to_relations_signature(self) # two-step: structured relations
def get_harvest_relation_signature(self) # one-step relation extraction
def get_propose_concept_merges_signature(self) # merge proposals across subgraphs
def get_resolve_orphan_signature(self) # connect isolated concepts (default impl.)
def get_ontology(self) # returns None by default
def get_predicate_mapping_signature(self) # maps free predicates to ontology (default impl.)
Creating a new domain requires subclassing SignatureRegistry and implementing the abstract methods with domain-specific DSPy signatures. The pipeline discovers the registry class by convention (class name ending in SignatureRegistry) via importlib.
Ontology Normalization
By default, the LLM extracts relations using free-form predicates — maximizing recall but producing varied vocabulary that makes cross-text comparison unreliable. Ontology normalization is a post-extraction step that maps every extracted predicate and entity type to a fixed, domain-specific controlled vocabulary.
Enabling normalization:
pipeline = {
...
"normalizeOntology": True,
}
The normalization is only active when both the pipeline flag is True AND the signature registry provides an ontology via get_ontology(). All existing registries without an ontology are completely unaffected.
How it works:
Entity type enforcement (deterministic, no LLM call):
After concept extraction, each concept.entity_type is validated against the ontology's entity type list. Invalid types are silently replaced with the generic fallback (e.g., "Entity"). The original LLM output is preserved in concept.entity_type_raw.
Predicate normalization (pre-filter + LLM + post-validate):
Relations are grouped by (subject_type, object_type) pair. For each group:
- 0 candidates after domain/range filtering → assign
generic_predicatedirectly (no LLM call) - 1 candidate → assign directly (no LLM call)
- N candidates → single batched LLM call mapping all relations in the group
The original predicate is preserved in relation.predicate_raw.
Defining a domain ontology:
from simplekg.ontologies.base import Ontology, OntologyEntityType, OntologyPredicate
class MyOntology(Ontology):
def __init__(self):
super().__init__(
id="my-domain-v1",
description="Ontology for my domain.",
generic_entity_type="Entity",
generic_predicate="relatedTo",
entity_types=[
OntologyEntityType(
id="Person",
description="An individual.",
examples=["Aristotle", "Plato"],
aliases=["Individual", "Author"],
),
# ... more types
],
predicates=[
OntologyPredicate(
id="ruledOver",
label="ruled over",
description="A person or polity exercised political authority over a place.",
domain=["Person", "Polity"],
range=["Place", "Polity"],
examples=["Caesar ruledOver Rome"],
aliases=["governed", "controlled", "founded", "established", "led"],
),
# ... more predicates
],
)
Inject it via the registry:
class MySignatureRegistry(SignatureRegistry):
def get_ontology(self):
return MyOntology()
# ... implement other abstract methods
AncientGreekOntology (simplekg/ontologies/ancient_greek.py) is the reference implementation, defining 11 entity types (Person, Place, Polity, Role, Event, TimePeriod, Work, Abstraction, Ethnonym, Practice, Artifact) and 28 predicates with full domain/range constraints and alias lists. See OntologyBasedKGImplementation.md for the full design rationale.
Data Models
Core objects are Pydantic models defined in simplekg/models.py.
Concept — an extracted entity:
class Concept(BaseModel):
id: str # unique identifier within the graph
prefLabel: str # canonical label in source language
prefLabel_en: str # English translation
altLabels: List[str] # alternative surface forms
conceptDescription: str # description in source language
conceptDescription_en: str # English description
entity_type: str # ontology-enforced type (e.g., "Person")
entity_type_raw: Optional[str] # original LLM output before enforcement
concept_position: int # character offset in source text
definition: bool # True if this concept is a definition
Relation — an extracted relationship:
class Relation(BaseModel):
subject_id: str # id of the subject Concept
predicate: str # ontology-normalized predicate
predicate_raw: Optional[str] # original LLM predicate before mapping
object_id: str # id of the object Concept
evidence_text: Optional[str] # text span supporting this relation
BaseGraph — a single chunk's subgraph:
class BaseGraph:
gid: int
text: str
concepts: List[Concept]
relation_objects: List[Relation]
DocGraph — the full document:
class DocGraph:
consolidated_graph: BaseGraph
subgraphs: List[BaseGraph]
Project Structure
simplekg/
kg.py NKGGenerator — main pipeline class
models.py Pydantic data models (Concept, Relation, BaseGraph, DocGraph)
signatures/
signature_registry.py Abstract base class for all signature registries
signatures_rabbinic.py Rabbinic Hebrew signatures
signatures_ramban.py Ramban commentary signatures
signatures_legal.py Legal document signatures
signatures_wa.py Work agreement signatures
signatures_ancient_greek.py Ancient Greek signatures
ontologies/
base.py OntologyEntityType, OntologyPredicate, Ontology base classes
ancient_greek.py AncientGreekOntology (reference implementation)
utilities/
utils.py kg2ACT conversion, text utilities
networkXutils.py NetworkX graph construction and visualization
ElasticUtils.py Elasticsearch storage and retrieval
tests/ Applicative / research notebooks and ES test scripts
test_sys/ System and pipeline regression tests
kg_gen.py Command-line pipeline runner
Citation
If you use SimpleKG in your research, please cite:
@software{simplekg,
author = {Hadar Miller},
title = {SimpleKG: Knowledge Graph Generation for Humanities Research},
url = {https://gitlab.com/millerhadar/simplekg},
version = {0.1.3},
year = {2025}
}
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simplekg-0.1.3.tar.gz.
File metadata
- Download URL: simplekg-0.1.3.tar.gz
- Upload date:
- Size: 280.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7a60da9b630d65dddfaac537a358d78b99de3e3eb16149f42195b351a926f4a
|
|
| MD5 |
addcafd7e810d1add16cd2ba59b0c18c
|
|
| BLAKE2b-256 |
b5688d921d68dc09364e46370745295823ace4e451fd475d488a1e44bff2e060
|
File details
Details for the file simplekg-0.1.3-py3-none-any.whl.
File metadata
- Download URL: simplekg-0.1.3-py3-none-any.whl
- Upload date:
- Size: 157.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb1164f58e120febbe9015c50c81bacfe2476805407d84794532d6adfa9e7d03
|
|
| MD5 |
069513a74f2dbfe9217e6cbe89607359
|
|
| BLAKE2b-256 |
d826a3df5c8f4bd26b6a34a0fbb5b23908449d276f979566fe1e3b2b5f8c74b7
|