Training free knowledge graph driven question answer using llm

Project description

kgnode

Training-Free Subgraph Extraction for Knowledge-Grounded Question Answering

Overview

kgnode is a Python library that extracts relevant subgraphs from large knowledge graphs using a path-aware Markov chain traversal algorithm for question answering tasks. Unlike traditional approaches that require KG-specific training (entity linkers, KG embeddings), kgnode achieves competitive performance through:

Hybrid Seed Discovery: Combines semantic search (ChromaDB) and keyword search (SPARQL) with type-aware filtering
Path-Aware Traversal: Priority-queue BFS with exponential probability scoring: P ∝ exp(cos(path, template))
Adaptive Stopping: Quality-based termination that monitors probability distribution
Training-Free: No KG-specific fine-tuning required - works across different knowledge graphs

Performance

Evaluated on two benchmarks without domain-specific training:

Dataset	Model	End-to-End Accuracy	Seed Discovery	Entity Coverage	Relation Coverage
DBLP-QuAD (252M triples)	gpt-4o-mini	85.8% (400Q)	92.5%	85.3%	98.0%
DBLP-QuAD	gpt-5-mini	72.5%	95.2%	91.9%	60.3%
QALD-10 (Wikidata, 1.65B triples)	gpt-4o-mini	53.0%	87.3%	78.1%	44.9%
QALD-10	gpt-5-mini	66.5%	90.0%	85.0%	50.0%

DBLP-QuAD gpt-4o-mini accuracy (85.8%) is measured on all 400 questions, running fresh end-to-end with no pre-computed pipeline stages.

Cross-domain transferability: 7 percentage point performance drop (DBLP→QALD) with consistent degradation across stages, validating the training-free approach.

Installation

pip install kgnode

Quick Start

from kgnode import KGConfig, generate_answer

# Configure for your knowledge graph
config = KGConfig(
    sparql_endpoint="http://localhost:7878/query",
    embedding_model="all-MiniLM-L6-v2",
    openai_model="gpt-4o-mini"
)

# End-to-end question answering
answer = generate_answer(
    query="Did Kamil Zbikowski and Michal Ostapowicz co-author a paper?",
    config=config
)
print(answer)

Step-by-Step Pipeline

from kgnode import get_seed_nodes, get_subgraphs, kg_retrieve

# 1. Find seed nodes (hybrid search: semantic + keyword)
seed_nodes, extracted_entities = get_seed_nodes(
    query="What papers did John Smith publish?",
    config=config
)
# Returns: Tuple of (seed_nodes list, extracted_entities list)

# 2. Extract relevant subgraphs using path-aware traversal
# Note: get_subgraphs processes one seed at a time, so we loop through all seeds
all_subgraphs = []
for seed_node in seed_nodes:
    subgraphs, template_text = get_subgraphs(
        seed_node=seed_node['entity_uri'],
        query="What papers did John Smith publish?",
        config=config,
        seed_nodes=seed_nodes  # Optional: provides context for template generation
    )
    all_subgraphs.extend(subgraphs)
# Returns: List of all subgraphs from all seeds with probability scores

# 3. Full pipeline: query → seed discovery → subgraph → SPARQL → answer
result = kg_retrieve(query="What papers did John Smith publish?", config=config)

Key Features

1. Hybrid Seed Discovery

Semantic search: ChromaDB vector database with all-MiniLM-L6-v2 embeddings
Keyword search: SPARQL text matching with name variations
Type-aware filtering: Fuzzy string matching with adaptive thresholds
Performance: 92-95% accuracy on seed discovery

2. Path-Aware Markov Chain Traversal

Exponential probability scoring: Amplifies differences by 7× ratio
Template-guided: LLM generates query templates for semantic alignment
Cycle detection: Prevents infinite loops
Deduplication: Removes redundant subgraphs
Adaptive stopping: Quality-based termination (default: min 3, max 15 subgraphs)

3. Training-Free Architecture

Works across different knowledge graphs (DBLP, Wikidata, etc.)
No entity linking training required
No KG embedding training required
Leverages pre-trained sentence transformers

4. Performance Optimizations

LRU cache: 5000-entry path embedding cache (3-5× speedup)
Parallel queries: Concurrent neighbor retrieval
Batch encoding: Efficient embedding computation
Schema-aware: Grounds templates in actual KG vocabulary

Folder Structure

kgnode/
├── src/kgnode/
│   ├── __init__.py              # Public API exports
│   ├── seed_finder.py           # Hybrid seed discovery
│   ├── subgraph_extraction.py   # Path-aware Markov chain algorithm
│   ├── generator.py             # SPARQL and answer generation
│   ├── validator.py             # Subgraph validation
│   ├── keyword_search.py        # SPARQL keyword search
│   ├── chroma_db.py             # ChromaDB vector operations
│   └── core/
│       ├── kg_config.py         # Configuration class
│       ├── sparql_query.py      # SPARQL endpoint communication
│       ├── schema_extractor.py  # Schema extraction
│       ├── schema_chromadb.py   # Schema vector DB
│       └── schema_selector.py   # Query-aware schema selection
├── tests/                        # Unit tests
└── eval/                         # Evaluation scripts

Prerequisites

1. SPARQL Endpoint (Required)

kgnode requires a SPARQL endpoint. We recommend Oxigraph:

# Install Oxigraph
# macOS: brew install oxigraph
# Linux: cargo install oxigraph_server
# Windows: Download from https://github.com/oxigraph/oxigraph/releases

# Start server (read-write mode)
oxigraph_server serve -l ./oxigraph_db --cors

# Start server (read-only mode)
oxigraph_server serve-read-only -l ./oxigraph_db --cors

# Load dataset (one-time setup)
oxigraph_server load -l ./oxigraph_db -f _data/dblp.nt

# Custom bind address
oxigraph_server serve -l ~/oxigraph_db --bind 127.0.0.1:7878

Default endpoint: http://localhost:7878/query

2. OpenAI API Key (Required for LLM operations)

export OPENAI_API_KEY="your-api-key-here"

3. ChromaDB (Auto-created on first run)

Vector database for entity and schema embeddings is created automatically.

Configuration

from kgnode import KGConfig, execute_sparql_query

# Create configuration with custom parameters
config = KGConfig(
    sparql_endpoint="http://localhost:7878/query",
    embedding_model="all-MiniLM-L6-v2",
    openai_model="gpt-4o-mini",
    min_subgraphs=3,              # Adaptive stopping: minimum
    max_subgraphs=15,             # Adaptive stopping: maximum
    quality_threshold_ratio=0.65, # Adaptive stopping: quality threshold
    absolute_prob_threshold=1.5,  # Minimum probability cutoff
)

# Execute SPARQL queries directly
results = execute_sparql_query(
    query="SELECT * WHERE { ?s ?p ?o } LIMIT 10",
    config=config
)

Configuration Options

Adaptive Stopping Parameters

Default (Balanced):

config = KGConfig(
    min_subgraphs=3,              # Minimum subgraphs to collect
    max_subgraphs=15,             # Maximum hard cap
    quality_threshold_ratio=0.65, # Stop if next prob < 0.65 × median
    absolute_prob_threshold=1.5   # Stop if next prob < 1.5 (exp(0.4))
)

Aggressive (Fewer subgraphs, faster):

config = KGConfig(
    min_subgraphs=2,
    max_subgraphs=10,
    quality_threshold_ratio=0.75,
    absolute_prob_threshold=2.0
)

Conservative (More subgraphs, higher coverage):

config = KGConfig(
    min_subgraphs=5,
    max_subgraphs=20,
    quality_threshold_ratio=0.4,
    absolute_prob_threshold=1.0
)

Disable Adaptive Stopping:

config = KGConfig(
    min_subgraphs=25,
    max_subgraphs=25  # Just use hard limit
)

Logging Configuration

Option 1: Environment Variable

# Show debug messages
export KGNODE_LOG_LEVEL=DEBUG
python your_script.py

# Only warnings and errors
export KGNODE_LOG_LEVEL=WARNING
python your_script.py

# Completely silent
export KGNODE_LOG_LEVEL=CRITICAL
python your_script.py

Option 2: In Code

from kgnode.core import set_log_level, disable_logging

# Show debug messages
set_log_level("DEBUG")

# Only warnings and errors
set_log_level("WARNING")

# Completely silent
disable_logging()

Datasets

DBLP-QuAD (Primary benchmark)

Domain: Academic publications knowledge graph
Source: https://dblp.org/rdf/
Download: https://zenodo.org/records/7638511
Paper: DBLP-QuAD (ECIR 2023)
Stats: 252M triples, 92M entities, 62 relations
Evaluation: 400 questions across 10 query types

QALD-10 (Cross-domain benchmark)

Domain: General knowledge (Wikidata)
Stats: 1.65B triples, 120M entities
Evaluation: 394 test questions
Paper: QALD-10 (Semantic Web 2024)

Testing

Run All Tests

python tests/test_runner.py

Run Specific Tests

# Run single test file
python tests/test_runner.py chromadb

# Run multiple test files
python tests/test_runner.py chromadb seed_finder subgraph_extraction

# List available tests
python tests/test_runner.py --list

# Run standalone test file
python tests/test_chromadb.py

Prerequisites for Testing

Oxigraph SPARQL server running at http://localhost:7878/query
OPENAI_API_KEY environment variable set
ChromaDB created (happens automatically on first run)

Development

Building from Source

# Clone the repository
git clone <repository-url>
cd kgnode

# Install dependencies using uv
uv sync

# Run tests
python tests/test_runner.py

Building Package for PyPI

# Clean previous builds
rm -rf dist/ build/ *.egg-info

# Build distribution packages (source + wheel)
uv build

# Check the built packages
twine check dist/*

The uv build command creates:

dist/kgnode-{version}.tar.gz - Source distribution
dist/kgnode-{version}-py3-none-any.whl - Wheel distribution

Documentation

For detailed usage, API reference, and examples:

Usage Guide: docs/USAGE.md
Research Paper: See paper/ directory for the academic paper with full methodology

Supported Technologies

Vector Databases

ChromaDB ✅ (implemented)
Pinecone (planned)
Qdrant (planned)

Embedding Models

all-MiniLM-L6-v2 ✅ (default, 384 dimensions)
google/embeddinggemma-300m (alternative)

LLM Backends

GPT-4o-mini ✅ (default, cost-effective)
GPT-5-mini ✅ (highest performance)

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

Acknowledgments

We acknowledge support from NHR Verein for this work.

Project details

Release history Release notifications | RSS feed

This version

0.3.0

May 19, 2026

0.2.0

Mar 17, 2026

0.1.1

Nov 28, 2025

0.1.0

Nov 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kgnode-0.3.0.tar.gz (84.4 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kgnode-0.3.0-py3-none-any.whl (100.5 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file kgnode-0.3.0.tar.gz.

File metadata

Download URL: kgnode-0.3.0.tar.gz
Upload date: May 19, 2026
Size: 84.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kgnode-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`799899d4ceff88f6da008923ecba6a0b72a50a51e3ea1387d293f43d90df4320`
MD5	`9ef1b1cd16ebb9973789e51a24453242`
BLAKE2b-256	`b7ef8e5440f8c84debab3666b80838fea044f689affdb675b1a496ae51327efa`

See more details on using hashes here.

File details

Details for the file kgnode-0.3.0-py3-none-any.whl.

File metadata

Download URL: kgnode-0.3.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 100.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kgnode-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8930ff786933c4e446db21a5e0e971c45f12f5ca9f8b101ad74a06ea68be7029`
MD5	`622550a6cefd380553244bba31cdf1b6`
BLAKE2b-256	`b663b3bfef617de6525176b40428c4b49927cdb3a029153194977f0e3fb548a3`

See more details on using hashes here.

kgnode 0.3.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

kgnode

Overview

Performance

Installation

Quick Start

Step-by-Step Pipeline

Key Features

1. Hybrid Seed Discovery

2. Path-Aware Markov Chain Traversal

3. Training-Free Architecture

4. Performance Optimizations

Folder Structure

Prerequisites

1. SPARQL Endpoint (Required)

2. OpenAI API Key (Required for LLM operations)

3. ChromaDB (Auto-created on first run)

Configuration

Configuration Options

Adaptive Stopping Parameters

Logging Configuration

Datasets

DBLP-QuAD (Primary benchmark)

QALD-10 (Cross-domain benchmark)

Testing

Run All Tests

Run Specific Tests

Prerequisites for Testing

Development

Building from Source

Building Package for PyPI

Documentation

Supported Technologies

Vector Databases

Embedding Models

LLM Backends

License

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes