Skip to main content

Agentic ontology and knowledge graph co-generation

Project description

OntoCast Agentic Ontology Triplecast logo

Agentic ontology-assisted framework for semantic triple extraction

Python PyPI version PyPI Downloads License pre-commit DOI


Overview

OntoCast is a framework for extracting semantic triples (creating a Knowledge Graph) from documents using an agentic, ontology-driven approach. It combines ontology management, natural language processing, and knowledge graph serialization to turn unstructured text into structured, queryable data.


Key Features

  • Ontology-Guided Extraction: Ensures semantic consistency and co-evolves ontologies
  • Entity Disambiguation: Resolves references across document chunks
  • Multi-Format Support: Handles text, JSON, PDF, and Markdown
  • Semantic Chunking: Splits text based on semantic similarity
  • MCP Compatibility: Implements Model Control Protocol endpoints
  • RDF Output: Produces standardized RDF/Turtle
  • Triple Store Integration: Supports Neo4j (n10s) and Apache Fuseki
  • Hierarchical Configuration: Type-safe configuration system with environment variable support
  • CLI Parameters: Flexible command-line interface with --skip-ontology-critique option
  • Automatic LLM Caching: Built-in response caching for improved performance and cost reduction
  • GraphUpdate Operations: Token-efficient SPARQL-based updates instead of full graph regeneration
  • Budget Tracking: Comprehensive tracking of LLM usage and triple generation metrics
  • Ontology Versioning: Automatic semantic versioning with hash-based lineage tracking

Applications

OntoCast can be used for:

  • Knowledge Graph Construction: Build domain-specific or general-purpose knowledge graphs from documents
  • Semantic Search: Power search and retrieval with structured triples
  • GraphRAG: Enable retrieval-augmented generation over knowledge graphs (e.g., with LLMs)
  • Ontology Management: Automate ontology creation, validation, and refinement
  • Data Integration: Unify data from diverse sources into a semantic graph

Installation

uv add ontocast 
# or
pip install ontocast

Quick Start

1. Configuration

Create a .env file with your configuration:

# LLM Configuration
LLM_PROVIDER=openai
LLM_API_KEY=your-api-key-here
LLM_MODEL_NAME=gpt-4o-mini
LLM_TEMPERATURE=0.1

# Server Configuration
PORT=8999
MAX_VISITS=3
RECURSION_LIMIT=1000
ESTIMATED_CHUNKS=30
ONTOLOGY_MAX_TRIPLES=10000

# Path Configuration
ONTOCAST_WORKING_DIRECTORY=/path/to/working
ONTOCAST_ONTOLOGY_DIRECTORY=/path/to/ontologies
ONTOCAST_CACHE_DIR=/path/to/cache

# Optional: Triple Store Configuration
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=ontocast

# Optional: Skip ontology critique
SKIP_ONTOLOGY_DEVELOPMENT=false
# Optional: Maximum triples allowed in ontology graph (set empty for unlimited)
ONTOLOGY_MAX_TRIPLES=10000

2. Start Server

ontocast \
    --env-path .env \
    --working-directory /path/to/working \
    --ontology-directory /path/to/ontologies

3. Process Documents

curl -X POST http://localhost:8999/process -F "file=@document.pdf"

4. API Endpoints

The OntoCast server provides the following endpoints:

  • POST /process: Process documents and extract semantic triples

    curl -X POST http://localhost:8999/process -F "file=@document.pdf"
    
  • POST /flush: Flush/clean triple store data

    # Clean all datasets (Fuseki) or entire database (Neo4j)
    curl -X POST http://localhost:8999/flush
    
    # Clean specific Fuseki dataset
    curl -X POST "http://localhost:8999/flush?dataset=my_dataset"
    

    Note: For Fuseki, you can specify a dataset query parameter to clean a specific dataset. If omitted, all datasets are cleaned. For Neo4j, the dataset parameter is ignored and all data is deleted.

  • GET /health: Health check endpoint

  • GET /info: Service information endpoint


LLM Caching

OntoCast includes automatic LLM response caching to improve performance and reduce API costs. Caching is enabled by default and requires no configuration.

Cache Locations

  • Tests: .test_cache/llm/ in the current working directory
  • Windows: %USERPROFILE%\AppData\Local\ontocast\llm\
  • Unix/Linux: ~/.cache/ontocast/llm/ (or $XDG_CACHE_HOME/ontocast/llm/)

Benefits

  • Faster Execution: Repeated queries return cached responses instantly
  • Cost Reduction: Identical requests don't hit the LLM API
  • Offline Capability: Tests can run without API access if responses are cached
  • Transparent: No configuration required - works automatically

Custom Cache Directory

If you need to specify a custom cache directory:

from pathlib import Path
from ontocast.tool.llm import LLMTool

# Cache directory is managed automatically by Cacher
llm_tool = LLMTool.create(
    config=llm_config
)

Configuration System

OntoCast uses a hierarchical configuration system built on Pydantic BaseSettings:

Environment Variables

Variable Description Default Required
LLM_API_KEY API key for LLM provider - Yes
LLM_PROVIDER LLM provider (openai, ollama) openai No
LLM_MODEL_NAME Model name gpt-4o-mini No
LLM_TEMPERATURE Temperature setting 0.1 No
ONTOCAST_WORKING_DIRECTORY Working directory path - Yes
ONTOCAST_ONTOLOGY_DIRECTORY Ontology files directory - No
PORT Server port 8999 No
MAX_VISITS Maximum visits per node 3 No
SKIP_ONTOLOGY_DEVELOPMENT Skip ontology critique false No
ONTOLOGY_MAX_TRIPLES Maximum triples allowed in ontology graph 10000 No
SKIP_FACTS_RENDERING Skip facts rendering and go straight to aggregation false No
ONTOCAST_CACHE_DIR Custom cache directory for LLM responses Platform default No

Triple Store Configuration

# Fuseki (Preferred)
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=dataset_name

# Neo4j (Alternative)
NEO4J_URI=bolt://localhost:7689
NEO4J_AUTH=neo4j:password

CLI Parameters

# Skip ontology critique step
ontocast --skip-ontology-critique

# Process only first N chunks (for testing)
ontocast --head-chunks 5

Triple Store Setup

OntoCast supports multiple triple store backends with automatic fallback:

  1. Apache Fuseki (Recommended) - Native RDF with SPARQL support
  2. Neo4j with n10s - Graph database with RDF capabilities
  3. Filesystem (Fallback) - Local file-based storage

When multiple triple stores are configured, Fuseki is preferred over Neo4j.

Quick Setup with Docker

Fuseki:

cd docker/fuseki
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env fuseki up -d

Neo4j:

cd docker/neo4j
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env neo4j up -d

See Triple Store Setup for detailed instructions.


Documentation


Recent Changes

Ontology Management Improvements

  • Automatic Versioning: Semantic version increment based on change analysis (MAJOR/MINOR/PATCH)
  • Hash-Based Lineage: Git-style versioning with parent hashes for tracking ontology evolution
  • Multiple Version Storage: Versions stored as separate named graphs in Fuseki triple stores
  • Timestamp Tracking: updated_at field tracks when ontology was last modified
  • Smart Version Analysis: Analyzes ontology changes (classes, properties, instances) to determine appropriate version bump

GraphUpdate System

  • Token Efficiency: LLM outputs structured SPARQL operations (insert/delete) instead of full TTL graphs
  • Incremental Updates: Only changes are generated, dramatically reducing token usage
  • Structured Operations: TripleOp operations with explicit prefix declarations for precise updates
  • SPARQL Generation: Automatic conversion of operations to executable SPARQL queries

Budget Tracking

  • LLM Statistics: Tracks API calls, characters sent/received for cost monitoring
  • Triple Metrics: Tracks ontology and facts triples generated per operation
  • Summary Reports: Budget summaries logged at end of processing
  • Integrated Tracking: Budget tracker integrated into AgentState for clean dependency injection

Configuration System Overhaul

  • Hierarchical Configuration: New ToolConfig and ServerConfig structure
  • Environment Variables: Support for .env files and environment variables
  • Type Safety: Full type safety with Python 3.12 union syntax
  • API Key: Changed from OPENAI_API_KEY to LLM_API_KEY for consistency
  • Dependency Injection: Removed global variables, implemented proper DI

Enhanced Features

  • CLI Parameters: New --skip-ontology-critique and --skip-facts-rendering parameters
  • RDFGraph Operations: Improved __iadd__ method with proper prefix binding
  • Triple Store Management: Better separation between filesystem and external stores
  • Serialization Interface: Unified serialize() method for storing Ontology and RDFGraph objects
  • Error Handling: Improved error handling and validation

See CHANGELOG.md for complete details.


Examples

Basic Usage

from ontocast.config import Config
from ontocast.toolbox import ToolBox

# Load configuration
config = Config()

# Initialize tools
tools = ToolBox(config)

# Process documents
# ... (use tools for processing)

Server Usage

# Start server with custom configuration
ontocast \
    --env-path .env \
    --working-directory /data/working \
    --ontology-directory /data/ontologies \
    --skip-ontology-critique \
    --head-chunks 10

Contributing

We welcome contributions! Please see our Contributing Guide for details.


License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ontocast-0.2.5.tar.gz (16.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ontocast-0.2.5-py3-none-any.whl (13.0 MB view details)

Uploaded Python 3

File details

Details for the file ontocast-0.2.5.tar.gz.

File metadata

  • Download URL: ontocast-0.2.5.tar.gz
  • Upload date:
  • Size: 16.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for ontocast-0.2.5.tar.gz
Algorithm Hash digest
SHA256 f15538068215e0b6835f9b1070108d3efa46b5c60a28144e1a2b286ef6ed5e21
MD5 cb45b3f330f8d0bcf8790bdf03b3a574
BLAKE2b-256 43af78cf615e5725e63de48cc1496e6228da6e25a8454f46d1ace25d06baad65

See more details on using hashes here.

File details

Details for the file ontocast-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: ontocast-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for ontocast-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9a04640625dbcd95ed8ac8d0f26bcc654d4b58241cb38a4f553b4b56a56b7c2e
MD5 65b1428c379d6cd4b060b4bb30ffe89b
BLAKE2b-256 1610ab182484301cc30a9a4a855c5be82605c659e5493af996ae2b9aadee8f9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page