A complete workflow for generating, normalizing, and visualizing Knowledge Graphs from unstructured Hebrew text

These details have not been verified by PyPI

Project links

Project description

SimpleKG

SimpleKG is a Python library for generating Knowledge Graphs from unstructured text using LLMs. It is designed for humanities and digital scholarship research — particularly for multi-lingual, domain-specific corpora such as rabbinic Hebrew, ancient Greek patristic literature, and legal documents. The primary use case is cross-text comparison: extracting KGs from multiple source texts under a shared ontology, then comparing their structure to detect text reuse, semantic proximity, or shared tradition.

Installation
Environment Setup
Command Line Usage
Python API — Basic Usage
- Rabbinic / Hebrew Text
- Ancient Greek with Ontology Normalization
Post-KG: ACT Format and Visualization
Pipeline Configuration Reference
Signature Modules (Domains)
Implementation

Installation

# From PyPI
pip install simplekg

# From source (recommended for development)
git clone https://gitlab.com/millerhadar/simplekg.git
cd simplekg
uv sync

# With optional stanza NLP support
uv sync --extra stanza

Environment Setup

Create a .env file in your project root:

# LLM
OPENAI_API_KEY=sk-...

# Elasticsearch (optional — only needed for ACT graph storage)
ELASTIC_HOST=https://your-es-host
ELASTIC_USER=your-user
ELASTIC_PASSWORD=your-password

Load it in your script or notebook:

from dotenv import load_dotenv
load_dotenv(".env")

Command Line Usage

kg_gen.py is the main script for batch pipeline execution from the command line.

# Basic usage
uv run kg_gen.py -f <file_name> -s <signature_module>

# Examples
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic -c 1500
uv run kg_gen.py -f IbnShuib19_2 -s signatures_ramban -c -1 -x Ramban19_2
uv run kg_gen.py -f Greek_tlg0526_tlg004_6_201_210 -s signatures_ancient_greek -c 1500 &

Arguments:

Flag	Description
`-f`	Text file name (without `.txt` extension), looked up in the configured source path
`-s`	Signature module (domain): `signatures_rabbinic`, `signatures_ramban`, `signatures_legal`, `signatures_wa`, `signatures_ancient_greek`
`-c`	Chunk size in characters (default: 1100). Use `-1` to chunk by sentences
`-x`	Optional context file name — provides background document context to the LLM
`-p`	Enable text preprocessing step before extraction
`-r`	Override source path for the input file
`-o`	Override output directory

Output is written to a structured directory under base_output_path:

kg_d<file_name>/
  C0_O0/singleStepRelations/
    final_knowledge_graph.json
    final_knowledge_graph_visualization.html
    step_1_processed_subgraphs.json
    ...
  logs/
    nkg_pipeline.log

Python API — Basic Usage

Rabbinic / Hebrew Text

import os
from dotenv import load_dotenv
from simplekg import NKGGenerator

load_dotenv(".env")

generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_rabbinic",
    api_key=os.getenv("OPENAI_API_KEY"),
    log_level="INFO",
    log_to_file=False,
)

text = """מֵאֵימָתַי קוֹרִין אֶת שְׁמַע בְּעַרְבִית. מִשָּׁעָה שֶׁהַכֹּהֲנִים נִכְנָסִים
לֶאֱכֹל בִּתְרוּמָתָן, עַד סוֹף הָאַשְׁמוּרָה הָרִאשׁוֹנָה, דִּבְרֵי רַבִּי אֱלִיעֶזֶר."""

pipeline = {
    "preProcessText": False,
    "processSubGraphs": True,
    "tagDefinitions": False,
    "twoStepRelations": False,
    "consolidateSubGraphs": True,
    "mergeConcepts": True,
    "pruneConcepts": False,
    "storeGraph": True,
    "storeEmbeddings": True,
    "storeVisualization": True,
    "storeGraphSteps": True,
    "outputPath": "/tmp/kg_output/",
}

generator.execute_pipeline(
    text=text,
    doc_context=None,
    pipeline=pipeline,
    chunk_size=0,           # 0 = no chunking, process as one unit
    chunk_by_sentences=False,
    chunk_overlap_sentences=0,
    verbose=False,
)

Ancient Greek with Ontology Normalization

Ancient Greek texts require a text normalization function for matching LLM output back to the source (removing diacritics, normalizing sigma variants, etc.). Ontology normalization aligns all extracted predicates and entity types to the AncientGreekOntology — enabling meaningful cross-text comparison.

import os, unicodedata, re, regex
from dotenv import load_dotenv
from simplekg import NKGGenerator

load_dotenv(".env")

def normalize_greek(text, filter_non_greek=True):
    """Strip diacritics and normalize Greek letter variants."""
    custom_mapping = {
        '\u03c2': 'σ',  # final sigma → regular sigma
        '\u03f2': 'σ',  # lunate sigma → regular sigma
    }
    normalized = unicodedata.normalize('NFD', text)
    text = re.sub(r'[\u0300-\u036F]', '', normalized)
    text = regex.sub(r'(\p{Script=Greek})[-—]\s+(\p{Script=Greek})', r'\1\2', text)
    if filter_non_greek:
        text = re.sub('[^\u0370-\u03FF\u1F00-\u1FFF\u0300-\u036F ]+', '', text)
    for char, unified in custom_mapping.items():
        text = text.replace(char, unified)
    return text.lower()


generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_ancient_greek",
    api_key=os.getenv("OPENAI_API_KEY"),
    log_level="INFO",
    log_to_file=False,
    normalize_text_for_matching=normalize_greek,   # domain-specific normalization
)

text = ("ωστε συναγεσθαι απο πρωτου ετουσ κυρου και περσων βασιλειασ επι το τελοσ "
        "τησ των μακκαβαιων γραφησ και επι την σιμωνοσ του αρχιερεωσ τελευτην ετη "
        "τετρακοσια εικοσιπεντε")

pipeline = {
    "preProcessText": False,
    "processSubGraphs": True,
    "tagDefinitions": False,
    "normalizeOntology": True,        # map predicates + entity types to AncientGreekOntology
    "twoStepRelations": False,
    "forceOrphanRelation": True,      # attempt to connect isolated concepts
    "consolidateSubGraphs": True,
    "mergeConcepts": True,
    "pruneConcepts": False,
    "storeGraph": True,
    "storeEmbeddings": True,
    "storeVisualization": True,
    "storeGraphSteps": True,
    "outputPath": "/tmp/kg_greek/",
}

generator.execute_pipeline(
    text=text,
    doc_context=None,
    pipeline=pipeline,
    chunk_size=0,
    chunk_by_sentences=False,
    chunk_overlap_sentences=0,
    verbose=False,
)

Post-KG: ACT Format and Visualization

After pipeline execution, the graph can be converted to ACT format (Annotated Concept Tree) — a JSON structure suitable for graph databases and network analysis — and then visualized as an interactive HTML graph.

import json
from simplekg.utilities import utils, networkXutils

nxu = networkXutils.NXUtils()

# Convert subgraphs to ACT format
ret = utils.kg2ACT(
    generator.graph.subgraphs,
    location="my_document_id",
    categories=["private", "my_project"],
    optin_entity_type=[],               # empty = include all entity types
    clean_overlapping=False,
    normalizers={},
    additional_attrs=["prefLabel_en"],
    additional_edge_attrs=["predicate", "evidence_text"],
)

# Optionally store ACT graph as JSON
with open("/tmp/kg_greek/final_knowledge_graph_ACT.json", "w") as f:
    json.dump(ret, f, indent=4, ensure_ascii=False)

# Convert to NetworkX graph for visualization
actnx = nxu.graphACT2nx(
    ret,
    title_node_attrs=["conceptDescription_en"],
    node_label_attr="prefLabel_en",
    edge_label="weight",
    edge_hover=["predicate"],
)

# Render interactive HTML visualization
nxu.visualize_graph(
    actnx,
    output_file="/tmp/kg_greek/visualization.html",
    open_browser=True,
    show_legend=True,
)

The resulting HTML file contains a fully interactive graph (powered by pyvis) with hover tooltips, legend, and drag-and-drop layout.

Pipeline Configuration Reference

All pipeline flags are passed as a dictionary to execute_pipeline(). Missing flags fall back to defaults.

Flag	Type	Default	Description
`preProcessText`	bool	False	Run domain-specific text preprocessing before extraction
`processSubGraphs`	bool	True	Extract concepts and relations from each chunk
`tagDefinitions`	bool	False	Mark definition concepts (e.g., "X is defined as...")
`twoStepRelations`	bool	False	Two-step relation extraction: candidates first, then structured relations
`forceOrphanRelation`	bool	False	Attempt to attach isolated concepts (no relations) to the graph
`normalizeOntology`	bool	False	Enforce domain ontology on entity types and predicates (see below)
`consolidateSubGraphs`	bool	True	Merge per-chunk subgraphs into one consolidated graph
`mergeConcepts`	bool	True	Cluster and merge semantically equivalent concepts
`pruneConcepts`	bool	False	Remove low-confidence or isolated concepts
`storeGraph`	bool	False	Save final graph as JSON
`storeEmbeddings`	bool	False	Save concept embeddings alongside the graph
`storeVisualization`	bool	False	Render and save an HTML visualization
`storeGraphSteps`	bool	False	Save intermediate pipeline stages as JSON snapshots
`outputPath`	str	None	Base output directory; subdirectory structure is auto-created

Chunking parameters (passed directly to execute_pipeline, not inside the pipeline dict):

Parameter	Default	Description
`chunk_size`	0	Characters per chunk. `0` = no chunking. `-1` = chunk by sentences
`chunk_by_sentences`	False	Automatically set to True when `chunk_size=-1`
`chunk_overlap_sentences`	2	Sentence overlap between adjacent chunks (for context continuity)

Signature Modules (Domains)

Each domain has a dedicated SignatureRegistry under simplekg/signatures/ that encapsulates the DSPy extraction prompts (signatures) for that domain's language and conventions.

Module	Class	Domain
`signatures_rabbinic`	`RabbinicSignatureRegistry`	Mishnah, Talmud, and rabbinic Hebrew literature
`signatures_ramban`	`RambanSignatureRegistry`	Ramban biblical commentary (medieval Hebrew)
`signatures_legal`	`LegalSignatureRegistry`	Hebrew legal documents and work agreements
`signatures_wa`	`WorkAgreementsSignatureRegistry`	Structured work agreement analysis
`signatures_ancient_greek`	`AncientGreekSignatureRegistry`	Ancient Greek historical, patristic, and classical texts

Pass the module name as a string to NKGGenerator:

generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_rabbinic",
    api_key=os.getenv("OPENAI_API_KEY"),
)

Implementation

Pipeline Architecture

NKGGenerator.execute_pipeline() orchestrates a sequential set of stages. Each stage is independently gated by a flag in the pipeline dictionary and can be disabled without affecting the others.

execute_pipeline(text)
  │
  ├── init_graph()             Split text into chunks → DocGraph with subgraphs
  │
  ├── processSubGraphs()       Per chunk (parallel):
  │     ├── _extractChunkConcepts()     LLM: terms → concepts
  │     ├── _enforce_entity_types()     [if normalizeOntology] validate against ontology
  │     ├── _extractChunkRelations()    LLM: concepts → relations
  │     ├── _normalize_predicates()     [if normalizeOntology] map predicates to ontology
  │     └── _forceOrphanRelation()      [if forceOrphanRelation] connect isolated concepts
  │
  ├── consolidateSubGraphs()   Merge all subgraph concepts + relations with id remapping
  │
  ├── mergeConcepts()          Cluster semantically equivalent concepts; pick canonical form
  │
  ├── pruneConcepts()          Remove low-value nodes
  │
  └── storeGraph / storeVisualization / storeEmbeddings

Each chunk is processed as a BaseGraph object (the subgraph). After consolidation the full document is a DocGraph containing the consolidated graph and all subgraphs.

Signature Registry System

The SignatureRegistry (in simplekg/signatures/signature_registry.py) is an abstract base class that defines the interface every domain must implement. Each method returns a DSPy Signature class used by the pipeline.

class SignatureRegistry(ABC):
    def get_harvest_terms_signature(self)         # text chunk → term candidates
    def get_candidates_to_concepts_signature(self) # candidates → Concept objects
    def get_harvest_definitions_signature(self)    # identify definition concepts
    def get_harvest_relation_candidates_signature(self)  # two-step: raw candidates
    def get_candidates_to_relations_signature(self)      # two-step: structured relations
    def get_harvest_relation_signature(self)       # one-step relation extraction
    def get_propose_concept_merges_signature(self) # merge proposals across subgraphs
    def get_resolve_orphan_signature(self)         # connect isolated concepts (default impl.)
    def get_ontology(self)                         # returns None by default
    def get_predicate_mapping_signature(self)      # maps free predicates to ontology (default impl.)

Creating a new domain requires subclassing SignatureRegistry and implementing the abstract methods with domain-specific DSPy signatures. The pipeline discovers the registry class by convention (class name ending in SignatureRegistry) via importlib.

Ontology Normalization

By default, the LLM extracts relations using free-form predicates — maximizing recall but producing varied vocabulary that makes cross-text comparison unreliable. Ontology normalization is a post-extraction step that maps every extracted predicate and entity type to a fixed, domain-specific controlled vocabulary.

Enabling normalization:

pipeline = {
    ...
    "normalizeOntology": True,
}

The normalization is only active when both the pipeline flag is True AND the signature registry provides an ontology via get_ontology(). All existing registries without an ontology are completely unaffected.

How it works:

Entity type enforcement (deterministic, no LLM call): After concept extraction, each concept.entity_type is validated against the ontology's entity type list. Invalid types are silently replaced with the generic fallback (e.g., "Entity"). The original LLM output is preserved in concept.entity_type_raw.

Predicate normalization (pre-filter + LLM + post-validate): Relations are grouped by (subject_type, object_type) pair. For each group:

0 candidates after domain/range filtering → assign generic_predicate directly (no LLM call)
1 candidate → assign directly (no LLM call)
N candidates → single batched LLM call mapping all relations in the group

The original predicate is preserved in relation.predicate_raw.

Defining a domain ontology:

from simplekg.ontologies.base import Ontology, OntologyEntityType, OntologyPredicate

class MyOntology(Ontology):
    def __init__(self):
        super().__init__(
            id="my-domain-v1",
            description="Ontology for my domain.",
            generic_entity_type="Entity",
            generic_predicate="relatedTo",
            entity_types=[
                OntologyEntityType(
                    id="Person",
                    description="An individual.",
                    examples=["Aristotle", "Plato"],
                    aliases=["Individual", "Author"],
                ),
                # ... more types
            ],
            predicates=[
                OntologyPredicate(
                    id="ruledOver",
                    label="ruled over",
                    description="A person or polity exercised political authority over a place.",
                    domain=["Person", "Polity"],
                    range=["Place", "Polity"],
                    examples=["Caesar ruledOver Rome"],
                    aliases=["governed", "controlled", "founded", "established", "led"],
                ),
                # ... more predicates
            ],
        )

Inject it via the registry:

class MySignatureRegistry(SignatureRegistry):
    def get_ontology(self):
        return MyOntology()
    # ... implement other abstract methods

AncientGreekOntology (simplekg/ontologies/ancient_greek.py) is the reference implementation, defining 11 entity types (Person, Place, Polity, Role, Event, TimePeriod, Work, Abstraction, Ethnonym, Practice, Artifact) and 28 predicates with full domain/range constraints and alias lists. See OntologyBasedKGImplementation.md for the full design rationale.

Data Models

Core objects are Pydantic models defined in simplekg/models.py.

Concept — an extracted entity:

class Concept(BaseModel):
    id: str                          # unique identifier within the graph
    prefLabel: str                   # canonical label in source language
    prefLabel_en: str                # English translation
    altLabels: List[str]             # alternative surface forms
    conceptDescription: str          # description in source language
    conceptDescription_en: str       # English description
    entity_type: str                 # ontology-enforced type (e.g., "Person")
    entity_type_raw: Optional[str]   # original LLM output before enforcement
    concept_position: int            # character offset in source text
    definition: bool                 # True if this concept is a definition

Relation — an extracted relationship:

class Relation(BaseModel):
    subject_id: str                  # id of the subject Concept
    predicate: str                   # ontology-normalized predicate
    predicate_raw: Optional[str]     # original LLM predicate before mapping
    object_id: str                   # id of the object Concept
    evidence_text: Optional[str]     # text span supporting this relation

BaseGraph — a single chunk's subgraph:

class BaseGraph:
    gid: int
    text: str
    concepts: List[Concept]
    relation_objects: List[Relation]

DocGraph — the full document:

class DocGraph:
    consolidated_graph: BaseGraph
    subgraphs: List[BaseGraph]

Project Structure

simplekg/
  kg.py                     NKGGenerator — main pipeline class
  models.py                 Pydantic data models (Concept, Relation, BaseGraph, DocGraph)
  signatures/
    signature_registry.py   Abstract base class for all signature registries
    signatures_rabbinic.py  Rabbinic Hebrew signatures
    signatures_ramban.py    Ramban commentary signatures
    signatures_legal.py     Legal document signatures
    signatures_wa.py        Work agreement signatures
    signatures_ancient_greek.py  Ancient Greek signatures
  ontologies/
    base.py                 OntologyEntityType, OntologyPredicate, Ontology base classes
    ancient_greek.py        AncientGreekOntology (reference implementation)
  utilities/
    utils.py                kg2ACT conversion, text utilities
    networkXutils.py        NetworkX graph construction and visualization
    ElasticUtils.py         Elasticsearch storage and retrieval

tests/                      Applicative / research notebooks and ES test scripts
test_sys/                   System and pipeline regression tests
kg_gen.py                   Command-line pipeline runner

Citation

If you use SimpleKG in your research, please cite:

@software{simplekg,
  author  = {Hadar Miller},
  title   = {SimpleKG: Knowledge Graph Generation for Humanities Research},
  url     = {https://gitlab.com/millerhadar/simplekg},
  version = {0.1.3},
  year    = {2025}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Apr 2, 2026

0.1.2

Aug 20, 2025

0.1.0

Aug 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplekg-0.1.3.tar.gz (280.8 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simplekg-0.1.3-py3-none-any.whl (157.6 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file simplekg-0.1.3.tar.gz.

File metadata

Download URL: simplekg-0.1.3.tar.gz
Upload date: Apr 2, 2026
Size: 280.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for simplekg-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`f7a60da9b630d65dddfaac537a358d78b99de3e3eb16149f42195b351a926f4a`
MD5	`addcafd7e810d1add16cd2ba59b0c18c`
BLAKE2b-256	`b5688d921d68dc09364e46370745295823ace4e451fd475d488a1e44bff2e060`

See more details on using hashes here.

File details

Details for the file simplekg-0.1.3-py3-none-any.whl.

File metadata

Download URL: simplekg-0.1.3-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 157.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for simplekg-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb1164f58e120febbe9015c50c81bacfe2476805407d84794532d6adfa9e7d03`
MD5	`069513a74f2dbfe9217e6cbe89607359`
BLAKE2b-256	`d826a3df5c8f4bd26b6a34a0fbb5b23908449d276f979566fe1e3b2b5f8c74b7`

See more details on using hashes here.

simplekg 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SimpleKG

Table of Contents

Installation

Environment Setup

Command Line Usage

Python API — Basic Usage

Rabbinic / Hebrew Text

Ancient Greek with Ontology Normalization

Post-KG: ACT Format and Visualization

Pipeline Configuration Reference

Signature Modules (Domains)

Implementation

Pipeline Architecture

Signature Registry System

Ontology Normalization

Data Models

Project Structure

Citation

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes