knowledge-graph-foundry

Knowledge Graph Foundry - data science-informed knowledge graph construction with adaptive ontology evolution, Bayesian entity resolution, and statistical convergence detection

These details have not been verified by PyPI

Project links

Project description

A CLI tool that reads your documents and builds a knowledge graph in Neo4j. Point it at a folder of PDFs, manuals, or data files - it extracts entities, relationships, and specifications, resolves duplicates across documents, and loads a queryable graph. No predefined schema required - the tool discovers the ontology from your data, or you can seed one to guide extraction.

Built as a simpler, CLI-driven alternative to Neo4j LLM Graph Builder.

What It Does

Input: one or more files or directories (PDF, DOCX, TXT, MD, JSON, JSONL, CSV, XLSX)

Output: a Neo4j knowledge graph with entities, relationships, specifications, and provenance - queryable via Cypher

The pipeline:

Parses documents into text (PDF via pymupdf4llm, DOCX via python-docx)
Chunks text into overlapping token windows
Extracts entities and relationships from each chunk using an LLM (Claude, GPT-4, or any litellm-supported model)
Resolves duplicates - fuzzy name matching, embedding similarity, and Bayesian cross-type deduplication merge entities that refer to the same thing across documents
Loads the graph into Neo4j with APOC-based merge, vector and fulltext indexes, and provenance linking every entity back to its source chunk and document

The tool handles multi-document corpora where the same entities appear across files. A product mentioned in a datasheet, a user manual, and a brochure gets consolidated into one graph node with merged properties and multiple source references.

Quick Start

# Install
make install

# Configure (Neo4j connection + LLM provider)
kgf init
# Edit .kgf/config.yml with your Neo4j and LLM credentials

# Ingest unstructured documents - PDF, TXT, MD, DOCX (default, --unstructured is implicit)
kgf ingest data/raw/ --batch --fluid

# Ingest structured data - JSON, JSONL, CSV, XLSX (automatic file type filtering)
kgf ingest data/raw/ --structured --batch --fluid

# Multiple inputs with repeatable --input / -i option
kgf ingest --input data/raw/ --input /other/docs/ --batch --fluid
kgf ingest -i file1.pdf -i file2.pdf -i data/raw/ --batch --fluid

# Query the graph
# Open Neo4j Browser at http://localhost:7474
# MATCH (p:Product)-[:HAS_SPECIFICATION]->(s:Specification) RETURN p.name, s.name, s.value, s.unit

How Ontology Discovery Works

By default, KGF runs in free extraction mode - the LLM discovers entity types from your documents without constraints. As documents are processed, the tool tracks type frequencies, detects convergence, and builds an ontology as a side effect. After processing enough documents for the type distribution to stabilise, the schema cures (freezes) and remaining documents are extracted with type enforcement. If post-cure drift is detected, the system can re-cure - creating a full lifecycle: fluid -> curing -> cured -> drift detection -> recuring -> cured.

As the system processes documents, it also builds adaptive resolution guides - learned heuristics for handling ambiguous type pairs. These guides evolve from accumulated evidence rather than requiring manual rule authoring.

You can also seed an ontology in any format (OWL, YAML, markdown, plain text) to guide extraction from the start. The LLM normalises whatever format you provide into a canonical schema.

A resolution_intent in the config tells the LLM what the knowledge graph is for - "compare medical devices across manufacturers" or "map software architecture dependencies" - which dramatically improves extraction relevance from the first document.

Configuration

Configuration uses .kgf/config.yml with ${ENV_VAR:default} interpolation from .env:

neo4j:
  uri: ${NEO4J_URI:bolt://localhost:7687}
  user: ${NEO4J_USERNAME:neo4j}
  password: ${NEO4J_PASSWORD:}

llm:
  provider: bedrock                          # bedrock | openai | anthropic
  model: eu.anthropic.claude-sonnet-4-20250514-v1:0
  temperature: 0.0                           # deterministic extraction
  timeout: 120                               # seconds per LLM call

extract:
  chunk_size: 2000                           # tokens per chunk
  chunk_overlap: 200                         # overlap between consecutive chunks
  concurrency: 4                             # parallel extraction threads
  use_embeddings: true                       # embedding-based entity resolution
  embedding_model: amazon.titan-embed-text-v2:0
  bayesian_resolution: true                  # Bayesian type inference
  deferred_dedup: true                       # accumulate cross-type evidence across docs

ontology_buffer:
  resolution_intent: "describe your use case here"
  flush_on_complete: true                    # write discovered ontology to disk

curing:
  enabled: true                              # fluid -> cured lifecycle
  min_documents: 3                           # docs before curing can trigger
  max_fluid_documents: 20                    # force-cure safety net

Technical Details

Entity Resolution

Entities are resolved across documents through multi-channel evidence fusion producing a Bayesian posterior probability rather than ad-hoc weighted scoring:

Levenshtein fuzzy matching within the same type (configurable threshold)
Embedding cosine similarity via FAISS for semantic matching
Bayesian cross-type deduplication combining name identity prior, description similarity, embedding similarity, and co-occurrence likelihood ratios into a posterior
Hierarchy-boosted resolution where sibling types under a shared parent get elevated merge priors
Deferred dedup accumulates positive evidence for ambiguous pairs across documents, resolving at curing time when entity profiles are richer
LLM escalation only at decision boundaries where statistical signals are inconclusive - the system uses agents selectively, not on every resolution decision

The posterior supports three-zone decision logic: high confidence triggers automatic merge, low confidence triggers automatic block, and the gray zone triggers two-layer adjudication where an LLM reasoning model acts as a conservative veto over the statistical signal.

Pipeline Lifecycle

The graph has a formal lifecycle governed by a finite state machine tracking ontological maturity. Six states cover the graph from creation to mature knowledge base:

EMPTY -> INITIALIZING -> CURING -> STABLE
                             ^        |
                             | RECURING|
                             +--------+

Curing - ontology establishment and calibration. In fluid mode, types emerge freely and convergence metrics drive stabilization. In direct mode (strict seed), types are prescribed but Bayesian posteriors calibrate and resolution guides build. First run always enters CURING
Stable - well-calibrated ontology with type-enforced extraction and drift monitoring. Serves as both the active extraction state and the resting state between runs
Recuring - drift deliberation when sustained remap rates are detected. The system evaluates whether to revise the ontology (re-enter CURING) or dismiss the drift

The control plane lives in the graph itself as a (:KGFControl) metanode - recovery requires only a config file and a graph connection. The lifecycle is documented in KGF_DESIGN.md Section 14 with the full state machine definition, entry scenario matrix, and ontology conflict rules.

Curing and Convergence

The fluid-to-cured transition uses statistical convergence detection:

Jensen-Shannon divergence between consecutive type distributions
Shannon entropy delta tracking
Chao1 species richness estimation for type coverage
Optional generative curing advisory with graph query tool for ambiguous decisions
Post-cure drift detection with remap rate monitoring

Event System

All pipeline decisions emit blinker signals to a JSONL event log (41 signal types across extraction, resolution, curing, loading) for post-run analysis and debugging.

Supported Formats

Use --unstructured (default) or --structured to select the pipeline - the flags are mutually exclusive and file type filtering is automatic based on the chosen mode.

Unstructured (--unstructured, default): PDF, TXT, MD, DOCX
Structured (--structured): JSON, JSONL, CSV, XLSX with automatic schema inference
Ontology seeds: OWL, YAML, JSON, markdown, plain text

Technology Stack

Python 3.12, uv package manager
LLM: litellm + instructor (structured output with validation retry)
CLI: typer
Parsing: pymupdf4llm (PDF), python-docx (DOCX)
Chunking: tiktoken
Graph: neo4j driver, APOC procedures
Resolution: python-Levenshtein, faiss-cpu, numpy, boto3 (embeddings)
Ontology: owlready2 (OWL/RDF), pydantic (schema validation)
Events: blinker (signal dispatch)

Makefile Targets

make install - create environment and install package
make test - run tests (352 tests)
make lint / make format - check / fix code style
make build - build distributable wheel
make clean - remove compiled files and caches

Project Organization

├── kg_builder_cli/
│   ├── cli.py              <- CLI entry points (typer)
│   ├── config.py           <- Central module config, logger, paths
│   ├── settings/           <- YAML loading, defaults, env interpolation
│   ├── curing/             <- Fluid schema curing, metrics, drift detection
│   ├── extraction/         <- Parsing, chunking, LLM extraction, entity resolution
│   ├── loading/            <- Batch Cypher loading, indexes, validation
│   ├── ontology/           <- Ontology buffer, OWL import, hierarchy evolution
│   ├── events/             <- Blinker signals, event types, handlers
│   └── types/              <- Pydantic data models
├── tests/                  <- pytest test suite + benchmark scorecard
├── docs/
│   ├── KGF_DESIGN.md       <- Canonical design document
│   ├── research/           <- Foundational concept documents
│   └── benchmarks/         <- Versioned benchmark results with forensics
├── data/
│   ├── raw/                <- Immutable source data
│   ├── interim/            <- Intermediate transforms
│   └── processed/          <- Final datasets
└── .kgf/                   <- Runtime config, evolved ontology, event logs

Research Concepts

The docs/research/ folder documents the foundational concepts underlying KGF - both implemented and proposed:

Ontology Lifecycle - curing/stabilization phases (implemented) and adaptive resolution guides that evolve from accumulated evidence (implemented)
Bayesian Resolution - posterior-based entity resolution (implemented), multi-channel evidence fusion (implemented), positive evidence accumulation via deferred dedup (implemented)
Decision Architecture - two-layer adjudication with LLM escalation (implemented), Bayesian decision calibration (proposed), contextual adjudication triggers (proposed), epistemic loop prevention (design principle)
Pipeline Governance - formal FSM lifecycle (formalized in Section 14), graph metanode control plane (formalized), lease-style locking (proposed)
Architectural Positioning - innovation classification, agent escalation boundaries (implemented), design signature

References

Neo4j LLM Graph Builder - reference implementation for LLM-powered knowledge graph construction from unstructured data
CodeGraphContext - code indexing and graph analysis platform using tree-sitter AST parsing with Neo4j/KuzuDB/FalkorDB backends

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.35

Apr 17, 2026

This version

0.4.33

Apr 17, 2026

0.4.32

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

knowledge_graph_foundry-0.4.33-py3-none-any.whl (141.6 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file knowledge_graph_foundry-0.4.33-py3-none-any.whl.

File metadata

Download URL: knowledge_graph_foundry-0.4.33-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 141.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for knowledge_graph_foundry-0.4.33-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84557b0d09e5437a4a42bbfa95dd34d401ac43ed20cf503bf3402bf2a17dd8a9`
MD5	`f58de2b12eb4d61e321dc64cc423d672`
BLAKE2b-256	`fbad3a343c3409f0434f5b8cae1994d2abc2689dd865035b6b6eb9c4c35da9ec`

See more details on using hashes here.

knowledge-graph-foundry 0.4.33

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What It Does

Quick Start

How Ontology Discovery Works

Configuration

Technical Details

Entity Resolution

Pipeline Lifecycle

Curing and Convergence

Event System

Supported Formats

Technology Stack

Makefile Targets

Project Organization

Research Concepts

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes