Skip to main content

A complete workflow for generating, normalizing, and visualizing Knowledge Graphs from unstructured Hebrew text

Project description

SimpleKG

PyPI version Python 3.9+ License: MIT

SimpleKG is a Python library for generating Knowledge Graphs from unstructured text using LLMs. It is designed for humanities and digital scholarship research — particularly for multi-lingual, domain-specific corpora such as rabbinic Hebrew, ancient Greek patristic literature, and legal documents. The primary use case is cross-text comparison: extracting KGs from multiple source texts under a shared ontology, then comparing their structure to detect text reuse, semantic proximity, or shared tradition.


Table of Contents

  1. Installation
  2. Environment Setup
  3. Command Line Usage
  4. Python API — Basic Usage
  5. Post-KG: ACT Format and Visualization
  6. Pipeline Configuration Reference
  7. Signature Modules (Domains)
  8. Implementation

Installation

# From PyPI
pip install simplekg

# From source (recommended for development)
git clone https://gitlab.com/millerhadar/simplekg.git
cd simplekg
uv sync

# With optional stanza NLP support
uv sync --extra stanza

Environment Setup

Create a .env file in your project root:

# LLM
OPENAI_API_KEY=sk-...

# Elasticsearch (optional — only needed for ACT graph storage)
ELASTIC_HOST=https://your-es-host
ELASTIC_USER=your-user
ELASTIC_PASSWORD=your-password

Load it in your script or notebook:

from dotenv import load_dotenv
load_dotenv(".env")

Command Line Usage

kg_gen.py is the main script for batch pipeline execution from the command line.

# Basic usage
uv run kg_gen.py -f <file_name> -s <signature_module>

# Examples
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic -c 1500
uv run kg_gen.py -f IbnShuib19_2 -s signatures_ramban -c -1 -x Ramban19_2
uv run kg_gen.py -f Greek_tlg0526_tlg004_6_201_210 -s signatures_ancient_greek -c 1500 &

Arguments:

Flag Description
-f Text file name (without .txt extension), looked up in the configured source path
-s Signature module (domain): signatures_rabbinic, signatures_ramban, signatures_legal, signatures_wa, signatures_ancient_greek
-c Chunk size in characters (default: 1100). Use -1 to chunk by sentences
-x Optional context file name — provides background document context to the LLM
-p Enable text preprocessing step before extraction
-r Override source path for the input file
-o Override output directory

Output is written to a structured directory under base_output_path:

kg_d<file_name>/
  C0_O0/singleStepRelations/
    final_knowledge_graph.json
    final_knowledge_graph_visualization.html
    step_1_processed_subgraphs.json
    ...
  logs/
    nkg_pipeline.log

Python API — Basic Usage

Rabbinic / Hebrew Text

import os
from dotenv import load_dotenv
from simplekg import NKGGenerator

load_dotenv(".env")

generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_rabbinic",
    api_key=os.getenv("OPENAI_API_KEY"),
    log_level="INFO",
    log_to_file=False,
)

text = """מֵאֵימָתַי קוֹרִין אֶת שְׁמַע בְּעַרְבִית. מִשָּׁעָה שֶׁהַכֹּהֲנִים נִכְנָסִים
לֶאֱכֹל בִּתְרוּמָתָן, עַד סוֹף הָאַשְׁמוּרָה הָרִאשׁוֹנָה, דִּבְרֵי רַבִּי אֱלִיעֶזֶר."""

pipeline = {
    "preProcessText": False,
    "processSubGraphs": True,
    "tagDefinitions": False,
    "twoStepRelations": False,
    "consolidateSubGraphs": True,
    "mergeConcepts": True,
    "pruneConcepts": False,
    "storeGraph": True,
    "storeEmbeddings": True,
    "storeVisualization": True,
    "storeGraphSteps": True,
    "outputPath": "/tmp/kg_output/",
}

generator.execute_pipeline(
    text=text,
    doc_context=None,
    pipeline=pipeline,
    chunk_size=0,           # 0 = no chunking, process as one unit
    chunk_by_sentences=False,
    chunk_overlap_sentences=0,
    verbose=False,
)

Ancient Greek with Ontology Normalization

Ancient Greek texts require a text normalization function for matching LLM output back to the source (removing diacritics, normalizing sigma variants, etc.). Ontology normalization aligns all extracted predicates and entity types to the AncientGreekOntology — enabling meaningful cross-text comparison.

import os, unicodedata, re, regex
from dotenv import load_dotenv
from simplekg import NKGGenerator

load_dotenv(".env")

def normalize_greek(text, filter_non_greek=True):
    """Strip diacritics and normalize Greek letter variants."""
    custom_mapping = {
        '\u03c2': 'σ',  # final sigma → regular sigma
        '\u03f2': 'σ',  # lunate sigma → regular sigma
    }
    normalized = unicodedata.normalize('NFD', text)
    text = re.sub(r'[\u0300-\u036F]', '', normalized)
    text = regex.sub(r'(\p{Script=Greek})[-—]\s+(\p{Script=Greek})', r'\1\2', text)
    if filter_non_greek:
        text = re.sub('[^\u0370-\u03FF\u1F00-\u1FFF\u0300-\u036F ]+', '', text)
    for char, unified in custom_mapping.items():
        text = text.replace(char, unified)
    return text.lower()


generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_ancient_greek",
    api_key=os.getenv("OPENAI_API_KEY"),
    log_level="INFO",
    log_to_file=False,
    normalize_text_for_matching=normalize_greek,   # domain-specific normalization
)

text = ("ωστε συναγεσθαι απο πρωτου ετουσ κυρου και περσων βασιλειασ επι το τελοσ "
        "τησ των μακκαβαιων γραφησ και επι την σιμωνοσ του αρχιερεωσ τελευτην ετη "
        "τετρακοσια εικοσιπεντε")

pipeline = {
    "preProcessText": False,
    "processSubGraphs": True,
    "tagDefinitions": False,
    "normalizeOntology": True,        # map predicates + entity types to AncientGreekOntology
    "twoStepRelations": False,
    "forceOrphanRelation": True,      # attempt to connect isolated concepts
    "consolidateSubGraphs": True,
    "mergeConcepts": True,
    "pruneConcepts": False,
    "storeGraph": True,
    "storeEmbeddings": True,
    "storeVisualization": True,
    "storeGraphSteps": True,
    "outputPath": "/tmp/kg_greek/",
}

generator.execute_pipeline(
    text=text,
    doc_context=None,
    pipeline=pipeline,
    chunk_size=0,
    chunk_by_sentences=False,
    chunk_overlap_sentences=0,
    verbose=False,
)

Post-KG: ACT Format and Visualization

After pipeline execution, the graph can be converted to ACT format (Annotated Concept Tree) — a JSON structure suitable for graph databases and network analysis — and then visualized as an interactive HTML graph.

import json
from simplekg.utilities import utils, networkXutils

nxu = networkXutils.NXUtils()

# Convert subgraphs to ACT format
ret = utils.kg2ACT(
    generator.graph.subgraphs,
    location="my_document_id",
    categories=["private", "my_project"],
    optin_entity_type=[],               # empty = include all entity types
    clean_overlapping=False,
    normalizers={},
    additional_attrs=["prefLabel_en"],
    additional_edge_attrs=["predicate", "evidence_text"],
)

# Optionally store ACT graph as JSON
with open("/tmp/kg_greek/final_knowledge_graph_ACT.json", "w") as f:
    json.dump(ret, f, indent=4, ensure_ascii=False)

# Convert to NetworkX graph for visualization
actnx = nxu.graphACT2nx(
    ret,
    title_node_attrs=["conceptDescription_en"],
    node_label_attr="prefLabel_en",
    edge_label="weight",
    edge_hover=["predicate"],
)

# Render interactive HTML visualization
nxu.visualize_graph(
    actnx,
    output_file="/tmp/kg_greek/visualization.html",
    open_browser=True,
    show_legend=True,
)

The resulting HTML file contains a fully interactive graph (powered by pyvis) with hover tooltips, legend, and drag-and-drop layout.


Pipeline Configuration Reference

All pipeline flags are passed as a dictionary to execute_pipeline(). Missing flags fall back to defaults.

Flag Type Default Description
preProcessText bool False Run domain-specific text preprocessing before extraction
processSubGraphs bool True Extract concepts and relations from each chunk
tagDefinitions bool False Mark definition concepts (e.g., "X is defined as...")
twoStepRelations bool False Two-step relation extraction: candidates first, then structured relations
forceOrphanRelation bool False Attempt to attach isolated concepts (no relations) to the graph
normalizeOntology bool False Enforce domain ontology on entity types and predicates (see below)
consolidateSubGraphs bool True Merge per-chunk subgraphs into one consolidated graph
mergeConcepts bool True Cluster and merge semantically equivalent concepts
pruneConcepts bool False Remove low-confidence or isolated concepts
storeGraph bool False Save final graph as JSON
storeEmbeddings bool False Save concept embeddings alongside the graph
storeVisualization bool False Render and save an HTML visualization
storeGraphSteps bool False Save intermediate pipeline stages as JSON snapshots
outputPath str None Base output directory; subdirectory structure is auto-created

Chunking parameters (passed directly to execute_pipeline, not inside the pipeline dict):

Parameter Default Description
chunk_size 0 Characters per chunk. 0 = no chunking. -1 = chunk by sentences
chunk_by_sentences False Automatically set to True when chunk_size=-1
chunk_overlap_sentences 2 Sentence overlap between adjacent chunks (for context continuity)

Signature Modules (Domains)

Each domain has a dedicated SignatureRegistry under simplekg/signatures/ that encapsulates the DSPy extraction prompts (signatures) for that domain's language and conventions.

Module Class Domain
signatures_rabbinic RabbinicSignatureRegistry Mishnah, Talmud, and rabbinic Hebrew literature
signatures_ramban RambanSignatureRegistry Ramban biblical commentary (medieval Hebrew)
signatures_legal LegalSignatureRegistry Hebrew legal documents and work agreements
signatures_wa WorkAgreementsSignatureRegistry Structured work agreement analysis
signatures_ancient_greek AncientGreekSignatureRegistry Ancient Greek historical, patristic, and classical texts

Pass the module name as a string to NKGGenerator:

generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_rabbinic",
    api_key=os.getenv("OPENAI_API_KEY"),
)

Implementation

Pipeline Architecture

NKGGenerator.execute_pipeline() orchestrates a sequential set of stages. Each stage is independently gated by a flag in the pipeline dictionary and can be disabled without affecting the others.

execute_pipeline(text)
  │
  ├── init_graph()             Split text into chunks → DocGraph with subgraphs
  │
  ├── processSubGraphs()       Per chunk (parallel):
  │     ├── _extractChunkConcepts()     LLM: terms → concepts
  │     ├── _enforce_entity_types()     [if normalizeOntology] validate against ontology
  │     ├── _extractChunkRelations()    LLM: concepts → relations
  │     ├── _normalize_predicates()     [if normalizeOntology] map predicates to ontology
  │     └── _forceOrphanRelation()      [if forceOrphanRelation] connect isolated concepts
  │
  ├── consolidateSubGraphs()   Merge all subgraph concepts + relations with id remapping
  │
  ├── mergeConcepts()          Cluster semantically equivalent concepts; pick canonical form
  │
  ├── pruneConcepts()          Remove low-value nodes
  │
  └── storeGraph / storeVisualization / storeEmbeddings

Each chunk is processed as a BaseGraph object (the subgraph). After consolidation the full document is a DocGraph containing the consolidated graph and all subgraphs.

Signature Registry System

The SignatureRegistry (in simplekg/signatures/signature_registry.py) is an abstract base class that defines the interface every domain must implement. Each method returns a DSPy Signature class used by the pipeline.

class SignatureRegistry(ABC):
    def get_harvest_terms_signature(self)         # text chunk → term candidates
    def get_candidates_to_concepts_signature(self) # candidates → Concept objects
    def get_harvest_definitions_signature(self)    # identify definition concepts
    def get_harvest_relation_candidates_signature(self)  # two-step: raw candidates
    def get_candidates_to_relations_signature(self)      # two-step: structured relations
    def get_harvest_relation_signature(self)       # one-step relation extraction
    def get_propose_concept_merges_signature(self) # merge proposals across subgraphs
    def get_resolve_orphan_signature(self)         # connect isolated concepts (default impl.)
    def get_ontology(self)                         # returns None by default
    def get_predicate_mapping_signature(self)      # maps free predicates to ontology (default impl.)

Creating a new domain requires subclassing SignatureRegistry and implementing the abstract methods with domain-specific DSPy signatures. The pipeline discovers the registry class by convention (class name ending in SignatureRegistry) via importlib.

Ontology Normalization

By default, the LLM extracts relations using free-form predicates — maximizing recall but producing varied vocabulary that makes cross-text comparison unreliable. Ontology normalization is a post-extraction step that maps every extracted predicate and entity type to a fixed, domain-specific controlled vocabulary.

Enabling normalization:

pipeline = {
    ...
    "normalizeOntology": True,
}

The normalization is only active when both the pipeline flag is True AND the signature registry provides an ontology via get_ontology(). All existing registries without an ontology are completely unaffected.

How it works:

Entity type enforcement (deterministic, no LLM call): After concept extraction, each concept.entity_type is validated against the ontology's entity type list. Invalid types are silently replaced with the generic fallback (e.g., "Entity"). The original LLM output is preserved in concept.entity_type_raw.

Predicate normalization (pre-filter + LLM + post-validate): Relations are grouped by (subject_type, object_type) pair. For each group:

  • 0 candidates after domain/range filtering → assign generic_predicate directly (no LLM call)
  • 1 candidate → assign directly (no LLM call)
  • N candidates → single batched LLM call mapping all relations in the group

The original predicate is preserved in relation.predicate_raw.

Defining a domain ontology:

from simplekg.ontologies.base import Ontology, OntologyEntityType, OntologyPredicate

class MyOntology(Ontology):
    def __init__(self):
        super().__init__(
            id="my-domain-v1",
            description="Ontology for my domain.",
            generic_entity_type="Entity",
            generic_predicate="relatedTo",
            entity_types=[
                OntologyEntityType(
                    id="Person",
                    description="An individual.",
                    examples=["Aristotle", "Plato"],
                    aliases=["Individual", "Author"],
                ),
                # ... more types
            ],
            predicates=[
                OntologyPredicate(
                    id="ruledOver",
                    label="ruled over",
                    description="A person or polity exercised political authority over a place.",
                    domain=["Person", "Polity"],
                    range=["Place", "Polity"],
                    examples=["Caesar ruledOver Rome"],
                    aliases=["governed", "controlled", "founded", "established", "led"],
                ),
                # ... more predicates
            ],
        )

Inject it via the registry:

class MySignatureRegistry(SignatureRegistry):
    def get_ontology(self):
        return MyOntology()
    # ... implement other abstract methods

AncientGreekOntology (simplekg/ontologies/ancient_greek.py) is the reference implementation, defining 11 entity types (Person, Place, Polity, Role, Event, TimePeriod, Work, Abstraction, Ethnonym, Practice, Artifact) and 28 predicates with full domain/range constraints and alias lists. See OntologyBasedKGImplementation.md for the full design rationale.

Data Models

Core objects are Pydantic models defined in simplekg/models.py.

Concept — an extracted entity:

class Concept(BaseModel):
    id: str                          # unique identifier within the graph
    prefLabel: str                   # canonical label in source language
    prefLabel_en: str                # English translation
    altLabels: List[str]             # alternative surface forms
    conceptDescription: str          # description in source language
    conceptDescription_en: str       # English description
    entity_type: str                 # ontology-enforced type (e.g., "Person")
    entity_type_raw: Optional[str]   # original LLM output before enforcement
    concept_position: int            # character offset in source text
    definition: bool                 # True if this concept is a definition

Relation — an extracted relationship:

class Relation(BaseModel):
    subject_id: str                  # id of the subject Concept
    predicate: str                   # ontology-normalized predicate
    predicate_raw: Optional[str]     # original LLM predicate before mapping
    object_id: str                   # id of the object Concept
    evidence_text: Optional[str]     # text span supporting this relation

BaseGraph — a single chunk's subgraph:

class BaseGraph:
    gid: int
    text: str
    concepts: List[Concept]
    relation_objects: List[Relation]

DocGraph — the full document:

class DocGraph:
    consolidated_graph: BaseGraph
    subgraphs: List[BaseGraph]

Project Structure

simplekg/
  kg.py                     NKGGenerator — main pipeline class
  models.py                 Pydantic data models (Concept, Relation, BaseGraph, DocGraph)
  signatures/
    signature_registry.py   Abstract base class for all signature registries
    signatures_rabbinic.py  Rabbinic Hebrew signatures
    signatures_ramban.py    Ramban commentary signatures
    signatures_legal.py     Legal document signatures
    signatures_wa.py        Work agreement signatures
    signatures_ancient_greek.py  Ancient Greek signatures
  ontologies/
    base.py                 OntologyEntityType, OntologyPredicate, Ontology base classes
    ancient_greek.py        AncientGreekOntology (reference implementation)
  utilities/
    utils.py                kg2ACT conversion, text utilities
    networkXutils.py        NetworkX graph construction and visualization
    ElasticUtils.py         Elasticsearch storage and retrieval

tests/                      Applicative / research notebooks and ES test scripts
test_sys/                   System and pipeline regression tests
kg_gen.py                   Command-line pipeline runner

Citation

If you use SimpleKG in your research, please cite:

@software{simplekg,
  author  = {Hadar Miller},
  title   = {SimpleKG: Knowledge Graph Generation for Humanities Research},
  url     = {https://gitlab.com/millerhadar/simplekg},
  version = {0.1.3},
  year    = {2025}
}

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplekg-0.1.3.tar.gz (280.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simplekg-0.1.3-py3-none-any.whl (157.6 kB view details)

Uploaded Python 3

File details

Details for the file simplekg-0.1.3.tar.gz.

File metadata

  • Download URL: simplekg-0.1.3.tar.gz
  • Upload date:
  • Size: 280.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for simplekg-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f7a60da9b630d65dddfaac537a358d78b99de3e3eb16149f42195b351a926f4a
MD5 addcafd7e810d1add16cd2ba59b0c18c
BLAKE2b-256 b5688d921d68dc09364e46370745295823ace4e451fd475d488a1e44bff2e060

See more details on using hashes here.

File details

Details for the file simplekg-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: simplekg-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 157.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for simplekg-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bb1164f58e120febbe9015c50c81bacfe2476805407d84794532d6adfa9e7d03
MD5 069513a74f2dbfe9217e6cbe89607359
BLAKE2b-256 d826a3df5c8f4bd26b6a34a0fbb5b23908449d276f979566fe1e3b2b5f8c74b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page