Skip to main content

๐Ÿง  Semantica - An Open Source Framework for building Semantic Layers and Knowledge Engineering

Project description

Semantica Logo

๐Ÿง  Semantica

A Framework for Building Context Graphs and Decision Intelligence Layers for AI

Python 3.8+ License: MIT PyPI Version Total Downloads CI Discord X Discord X

โญ Give us a Star โ€ข ๐Ÿด Fork us โ€ข ๐Ÿ’ฌ Join our Discord โ€ข ๐Ÿฆ Follow on X

Transform Chaos into Intelligence. Build AI systems with context graphs, decision tracking, and advanced knowledge engineering that are explainable, traceable, and trustworthy โ€” not black boxes.


The Problem

AI agents today are capable but not trustworthy:

  • No memory structure โ€” agents store embeddings, not meaning. Retrieval is fuzzy; there's no way to ask why something was recalled.
  • No decision trail โ€” agents make decisions continuously but record nothing. When something goes wrong, there's no history to debug or audit.
  • No provenance โ€” outputs cannot be traced back to source facts. In regulated industries, this is a compliance blocker.
  • No reasoning transparency โ€” black-box answers with no explanation of how a conclusion was reached.
  • No conflict detection โ€” contradictory facts silently coexist in vector stores, producing unpredictable answers.

These aren't edge cases. They are the reason AI cannot be deployed in healthcare, finance, legal, and government without custom guardrails built from scratch.

The Solution

Semantica is the context and intelligence layer you add to your AI stack:

  • Context Graphs โ€” structured graph of entities, relationships, and decisions your agent builds as it works. Queryable, traceable, persistent.
  • Decision Intelligence โ€” every decision is a first-class object: recorded, linked causally, searchable by precedent, and analyzable for downstream impact.
  • Provenance โ€” every fact links to its source. W3C PROV-O compliant. Full lineage from ingestion to inference.
  • Reasoning engines โ€” forward chaining, Rete networks, deductive, abductive, and SPARQL reasoning. Explainable inference paths, not black-box answers.
  • Deduplication & QA โ€” conflict detection, entity resolution, and validation built into the pipeline.

Works alongside LangChain, LlamaIndex, AutoGen, CrewAI, and any LLM provider โ€” Semantica is not a replacement, it's the accountability layer on top.

โšก Quick Installation

pip install semantica

What's New in v0.3.0

First stable release โ€” promoted to Production/Stable on PyPI. Full summary of everything shipped across 0.3.0-alpha โ†’ 0.3.0-beta โ†’ 0.3.0 stable.

Context Graph โ€” Feature Completeness

  • Temporal validity windows โ€” ContextNode and ContextEdge now carry valid_from / valid_until ISO datetime fields. Call node.is_active(at_time=None) to check whether a node is live at any point in time, or graph.find_active_nodes(node_type, at_time) to filter an entire graph by validity. Fields survive full serialisation round-trips via save_to_file() / load_from_file() and the to_dict() / from_dict() path.
  • Weighted multi-hop BFS โ€” get_neighbors(hops, min_weight=0.0) now accepts a minimum edge weight so you can confine traversal to high-confidence causal links and ignore noisy or low-trust relationships. Fully backward-compatible โ€” default 0.0 passes all edges.
  • Cross-graph navigation โ€” link_graph(other_graph, source_node, target_node) creates a navigable bridge between two separate ContextGraph instances and returns a link_id. Call navigate_to(link_id) to jump to the target graph and entry node. Links now survive save/load: save_to_file() writes a links section, and resolve_links({graph_id: instance}) reconnects them after reload. Each graph carries a stable graph_id UUID for this purpose.
  • Bug fixes โ€” is_active() now normalises tz-aware datetime inputs to tz-naive UTC (prevents TypeError); cross-graph marker nodes are correctly typed "cross_graph_link" instead of polluting the graph with phantom "entity" nodes; 14 new dedicated tests in tests/context/test_cross_graph_navigation.py.

Decision Intelligence & Agent Context (0.3.0-alpha / beta)

  • Complete decision lifecycle โ€” record_decision(), add_decision(), add_causal_relationship(), trace_decision_chain(), analyze_decision_impact(), analyze_decision_influence(), and find_similar_decisions() all working end-to-end with full audit trails.
  • Precedent search โ€” hybrid similarity search over past decisions combining vector, structural, and category similarity with configurable weights; find_precedents() and retrieve_decision_precedents() fixed for correct entity extraction behaviour.
  • PolicyEngine โ€” versioned policy nodes, compliance checking, check_decision_rules(), exception handling with PolicyException; falls back gracefully when no graph store is present.
  • AgentContext โ€” unified wrapper with granular feature flags (decision_tracking, kg_algorithms, graph_expansion), store(), retrieve(), get_conversation_history(), get_statistics(); capture_cross_system_inputs() for multi-agent pipelines.
  • AgentMemory โ€” working, conversation, and long-term memory tiers with statistics.
  • Multi-hop context assembly โ€” expand_context(), dynamic_context_traversal(), and multi_hop_context_assembly() all fixed for correct BFS and decision-query behaviour.

Knowledge Graph Algorithms (0.3.0-alpha)

  • Advanced analytics โ€” PageRank centrality (calculate_pagerank), betweenness centrality, clustering coefficient, community detection via Louvain; all return structured dicts.
  • Node embeddings โ€” Node2Vec via NodeEmbedder; compute_embeddings(graph, node_labels, relationship_types).
  • Link prediction โ€” LinkPredictor.score_link(graph, n1, n2, method=) for scoring potential new edges.
  • Similarity โ€” SimilarityCalculator.cosine_similarity(v1, v2).
  • Provenance โ€” ProvenanceTracker, GraphBuilderWithProvenance, AlgorithmTrackerWithProvenance with 9 domain-specific tracking methods; now correctly exported from semantica.kg.

Semantic Extraction (0.3.0-beta)

  • Multi-founder LLM extraction fix โ€” unmatched subjects/objects now produce a synthetic UNKNOWN entity instead of silently dropping the relation; all co-founders from LLM responses are preserved.
  • Reasoner inference fix โ€” _match_pattern rewritten to split on ?var placeholders first; pre-bound variables resolve to exact literals, repeated variables use backreferences, non-greedy matching prevents over-consumption.
  • Duplicate relation fix โ€” orphaned legacy block in _parse_relation_result that appended every relation twice has been removed.
  • LLM-typed extraction โ€” extraction_method parameter correctly sets "llm_typed" metadata on typed extraction paths.

Export & Storage (0.3.0-beta)

  • RDF export aliases โ€” RDFExporter now accepts "ttl", "nt", "xml", "rdf", and "json-ld" as format aliases; no API changes for existing callers.
  • ArangoDB AQL export โ€” full AQL INSERT statement generation for vertices and edges; batch processing; export_arango() convenience function; auto-detected from .aql extension.
  • Apache Parquet export โ€” columnar export with configurable compression (snappy, gzip, brotli, zstd, lz4); explicit Arrow schemas; export_parquet() convenience function; analytics-ready for Spark, Snowflake, BigQuery, Databricks.

Deduplication v2 (0.3.0-beta)

  • Candidate generation v2 โ€” blocking_v2 and hybrid_v2 strategies replace O(Nยฒ) pair enumeration with multi-key blocking, phonetic Soundex matching, and deterministic max_candidates_per_entity budgeting; 63.6% faster in worst-case scenarios.
  • Two-stage scoring prefilter โ€” fast type-mismatch, name-length-ratio, and token-overlap gates skip expensive semantic scoring for obvious non-matches; 18โ€“25% faster batch processing; configurable thresholds.
  • Semantic relationship deduplication v2 โ€” canonicalisation engine with predicate synonym mapping (works_for โ†’ employed_by), O(1) hash matching for exact canonical signatures, weighted scoring (60% predicate + 40% object); 6.98x faster than legacy mode.
  • dedup_triplets() fix โ€” critical infinite recursion bug fixed; function is now a first-class API in methods.py.

Incremental / Delta Processing (0.3.0-beta)

  • Delta computation โ€” native diff between graph snapshots using SPARQL; only changed data flows through the pipeline.
  • Version snapshot management โ€” graph URI tracking, metadata storage, snapshot retention with prune_versions().
  • Delta-aware pipelines โ€” delta_mode configuration in PipelineBuilder; processes only changes for near-real-time workloads.

Pipeline & Production (0.3.0-alpha / beta)

  • FailureHandler โ€” handle_failure(error, policy, retry_count) with LINEAR, EXPONENTIAL, and FIXED backoff strategies via RetryPolicy / RetryStrategy.
  • PipelineValidator โ€” validate(builder) returns ValidationResult(valid, errors, warnings); does not raise exceptions.
  • add_step() fix โ€” correctly returns the created PipelineStep object (return type annotation corrected to match).
  • Retry loop fix โ€” execution engine now iterates up to max_retries correctly.

Graph Database Backends (0.3.0-alpha)

  • Apache AGE โ€” PostgreSQL graph extension with openCypher via SQL; SQL injection vulnerabilities fixed; input validation added.
  • AWS Neptune โ€” Amazon Neptune with IAM authentication.
  • FalkorDB โ€” DecisionQuery and CausalChainAnalyzer work directly with FalkorDB row/header shapes.

Test Coverage

  • 886+ tests passing, 0 failures across all modules โ€” context (335), KG (~430), semantic extraction (70), reasoning (19), pipeline, export, deduplication.
  • Added 85 real-world comprehensive tests (test_030_realworld_comprehensive.py) covering tech companies, CEOs, investment chains, and healthcare scenarios end-to-end.

See the full CHANGELOG for the complete diff.


Features

Context & Decision Intelligence

  • Context Graphs โ€” structured graph of entities, relationships, and decisions; queryable, causal, persistent
  • Decision tracking โ€” record, link, and analyze every agent decision with add_decision(), record_decision()
  • Causal chains โ€” link decisions with add_causal_relationship(), trace lineage with trace_decision_chain()
  • Precedent search โ€” hybrid similarity search over past decisions with find_similar_decisions()
  • Influence analysis โ€” analyze_decision_impact(), analyze_decision_influence() โ€” understand downstream effects
  • Policy engine โ€” enforce business rules with check_decision_rules(); automated compliance validation
  • Agent memory โ€” AgentMemory with short/long-term storage, conversation history, and statistics
  • Cross-system context capture โ€” capture_cross_system_inputs() for multi-agent pipelines

Knowledge Graphs

  • Knowledge graph construction โ€” entities, relationships, properties, typed edges
  • Graph algorithms โ€” PageRank, betweenness centrality, clustering coefficient, community detection
  • Node embeddings โ€” Node2Vec embeddings via NodeEmbedder
  • Similarity โ€” cosine similarity via SimilarityCalculator
  • Link prediction โ€” score potential new edges via LinkPredictor
  • Temporal graphs โ€” time-aware nodes and edges
  • Incremental / delta processing โ€” update graphs without full recompute

Semantic Extraction

  • Entity extraction โ€” named entity recognition, normalization, classification
  • Relation extraction โ€” triplet generation from raw text using LLMs or rule-based methods
  • LLM-typed extraction โ€” extraction with typed relation metadata
  • Deduplication v1 โ€” Jaro-Winkler similarity, basic blocking
  • Deduplication v2 โ€” blocking_v2, hybrid_v2, semantic_v2 strategies with max_candidates_per_entity
  • Triplet deduplication โ€” dedup_triplets() for removing duplicate (subject, predicate, object) triples

Reasoning Engines

  • Forward chaining โ€” Reasoner with IF/THEN string rules and dict facts
  • Rete network โ€” ReteEngine for high-throughput production rule matching
  • Deductive reasoning โ€” DeductiveReasoner for classical inference
  • Abductive reasoning โ€” AbductiveReasoner for hypothesis generation from observations
  • SPARQL reasoning โ€” SPARQLReasoner for query-based inference over RDF graphs

Provenance & Auditability

  • Entity provenance โ€” ProvenanceTracker.track_entity(id, source_url, metadata)
  • Algorithm provenance โ€” AlgorithmTrackerWithProvenance tracks computation lineage
  • Graph builder provenance โ€” GraphBuilderWithProvenance records entity source lineage from URLs
  • W3C PROV-O compliant โ€” lineage tracking across all modules
  • Change management โ€” version control with checksums, audit trails, compliance support

Vector Store

  • Backends โ€” FAISS, Pinecone, Weaviate, Qdrant, Milvus, PgVector, in-memory
  • Semantic search โ€” top-k retrieval by embedding similarity
  • Hybrid search โ€” vector + keyword with configurable weights
  • Filtered search โ€” metadata-based filtering on any field
  • Custom similarity weights โ€” tune retrieval per use case

๐ŸŒ Graph Database Support

  • AWS Neptune โ€” Amazon Neptune graph database with IAM authentication
  • Apache AGE โ€” PostgreSQL graph extension with openCypher via SQL
  • FalkorDB โ€” native support; DecisionQuery and CausalChainAnalyzer work directly with FalkorDB row/header shapes

Data Ingestion

  • File formats โ€” PDF, DOCX, HTML, JSON, CSV, Excel, PPTX, archives
  • Web crawl โ€” WebIngestor with configurable depth
  • Databases โ€” DBIngestor with SQL query support
  • Snowflake โ€” SnowflakeIngestor with table/query ingestion, pagination, and key-pair/OAuth auth
  • Docling โ€” advanced document parsing with table and layout extraction (PDF, DOCX, PPTX, XLSX)
  • Media โ€” image OCR, audio/video metadata extraction

Export Formats

  • RDF โ€” Turtle (.ttl), JSON-LD, N-Triples (.nt), XML via RDFExporter
  • Parquet โ€” ParquetExporter for entities, relationships, and full KG export
  • ArangoDB AQL โ€” ready-to-run INSERT statements via ArangoAQLExporter
  • OWL ontologies โ€” export generated ontologies in Turtle or RDF/XML

Pipeline & Production

  • Pipeline builder โ€” PipelineBuilder with stage chaining and parallel workers
  • Validation โ€” PipelineValidator returns ValidationResult(valid, errors, warnings) before execution
  • Failure handling โ€” FailureHandler with RetryPolicy and RetryStrategy (exponential backoff, fixed, etc.)
  • Parallel processing โ€” configurable worker count per pipeline stage
  • LLM providers โ€” 100+ models via LiteLLM (OpenAI, Anthropic, Cohere, Mistral, Ollama, and more)

Ontology

  • Auto-generation โ€” derive OWL ontologies from knowledge graphs via OntologyGenerator
  • Import โ€” load existing OWL, RDF, Turtle, JSON-LD ontologies via OntologyImporter
  • Validation โ€” HermiT/Pellet compatible consistency checking

Modules

Module What it provides
semantica.context Context graphs, agent memory, decision tracking, causal analysis, precedent search, policy engine
semantica.kg Knowledge graph construction, graph algorithms, centrality, community detection, embeddings, link prediction, provenance
semantica.semantic_extract NER, relation extraction, event extraction, coreference, triplet generation, LLM-enhanced extraction
semantica.reasoning Forward chaining, Rete network, deductive, abductive, SPARQL reasoning, explanation generation
semantica.vector_store FAISS, Pinecone, Weaviate, Qdrant, Milvus, PgVector, in-memory; hybrid & filtered search
semantica.export RDF (Turtle/JSON-LD/N-Triples/XML), Parquet, ArangoDB AQL, CSV, YAML, OWL, graph formats
semantica.ingest Files (PDF, DOCX, CSV, HTML), web crawl, feeds, databases, Snowflake, MCP, email, repositories
semantica.ontology Auto-generation (6-stage pipeline), OWL/RDF export, import (OWL/RDF/Turtle/JSON-LD), validation, versioning
semantica.pipeline Pipeline DSL, parallel workers, validation, retry policies, failure handling, resource scheduling
semantica.graph_store Graph database backends โ€” Neo4j, FalkorDB, Apache AGE, Amazon Neptune; Cypher queries
semantica.embeddings Text embedding generation โ€” Sentence-Transformers, FastEmbed, OpenAI, BGE; similarity calculation
semantica.deduplication Entity deduplication, similarity scoring, merging, clustering; blocking and semantic strategies
semantica.provenance W3C PROV-O compliant end-to-end lineage tracking, source attribution, audit trails
semantica.parse Document parsing โ€” PDF, DOCX, PPTX, HTML, code, email, structured data, media with OCR
semantica.split Document chunking โ€” recursive, semantic, entity-aware, relation-aware, graph-based, ontology-aware
semantica.normalize Data normalization for text, entities, dates, numbers, quantities, languages, encodings
semantica.conflicts Multi-source conflict detection (value, type, relationship, temporal, logical) with resolution strategies
semantica.change_management Version storage, change tracking, checksums, audit trails, compliance support for KGs and ontologies
semantica.triplet_store RDF triplet store integration โ€” Blazegraph, Jena, RDF4J; SPARQL queries and bulk loading
semantica.visualization Interactive and static visualization of KGs, ontologies, embeddings, analytics, and temporal graphs
semantica.seed Seed data management for initial KG construction from CSV, JSON, databases, and APIs
semantica.core Framework orchestration, configuration management, knowledge base construction, plugin system
semantica.llms LLM provider integrations โ€” Groq, OpenAI, HuggingFace, LiteLLM
semantica.utils Shared utilities โ€” logging, validation, exception handling, constants, types, progress tracking

โšก Quick Start

import semantica
from semantica.context import AgentContext, ContextGraph
from semantica.vector_store import VectorStore

# Build an agent with structured context
context = AgentContext(
    vector_store=VectorStore(backend="faiss", dimension=768),
    knowledge_graph=ContextGraph(advanced_analytics=True),
    decision_tracking=True,
    kg_algorithms=True,
)

# Store memory
memory_id = context.store(
    "GPT-4 outperforms GPT-3.5 on reasoning benchmarks by 40%",
    conversation_id="research_session_1",
)

# Record a decision with full context
decision_id = context.record_decision(
    category="model_selection",
    scenario="Choose LLM for production reasoning pipeline",
    reasoning="GPT-4 benchmark advantage justifies 3x cost increase",
    outcome="selected_gpt4",
    confidence=0.91,
    entities=["gpt4", "gpt35", "reasoning_pipeline"],
)

# Find similar decisions from history
precedents = context.find_precedents("model selection reasoning", limit=5)

# Analyze downstream influence of this decision
influence = context.analyze_decision_influence(decision_id)

๐Ÿ“– Full Quick Start โ€ข ๐Ÿณ Cookbook Examples โ€ข ๐Ÿ’ฌ Join Discord โ€ข โญ Star Us


Core Value Proposition

Trustworthy Explainable Auditable
Conflict detection & validation Transparent reasoning paths Complete provenance tracking
Rule-based governance Entity relationships & ontologies W3C PROV-O compliant lineage
Production-grade QA Multi-hop graph reasoning Source tracking & integrity verification

Key Features & Benefits

Not Just Another Agentic Framework

Semantica complements LangChain, LlamaIndex, AutoGen, CrewAI, Google ADK, Agno, and other frameworks to enhance your agents with:

Feature Benefit
Context Graphs Structured knowledge representation with entity relationships and semantic context
Decision Tracking Complete decision lifecycle management with precedent search and causal analysis
KG Algorithms Advanced graph analytics including centrality, community detection, and embeddings
Vector Store Integration Hybrid search with custom similarity weights and advanced filtering
Auditable Complete provenance tracking with W3C PROV-O compliance
Explainable Transparent reasoning paths with entity relationships
Provenance-Aware End-to-end lineage from documents to responses
Validated Built-in conflict detection, deduplication, QA
Governed Rule-based validation and semantic consistency
Version Control Enterprise-grade change management with integrity verification

Perfect For High-Stakes Use Cases

๐Ÿฅ Healthcare ๐Ÿ’ฐ Finance โš–๏ธ Legal
Clinical decisions Fraud detection Evidence-backed research
Drug interactions Regulatory support Contract analysis
Patient safety Risk assessment Case law reasoning
๐Ÿ”’ Cybersecurity ๐Ÿ›๏ธ Government ๐Ÿญ Infrastructure ๐Ÿš— Autonomous
Threat attribution Policy decisions Power grids Decision logs
Incident response Classified info Transportation Safety validation

Powers Your AI Stack

  • Context Graphs โ€” Structured knowledge representation with entity relationships and semantic context
  • Decision Tracking Systems โ€” Complete decision lifecycle management with precedent search and causal analysis
  • GraphRAG Systems โ€” Retrieval with graph reasoning and hybrid search using KG algorithms
  • AI Agents โ€” Trustworthy, accountable multi-agent systems with semantic memory and decision history
  • Reasoning Models โ€” Explainable AI decisions with reasoning paths and influence analysis
  • Enterprise AI โ€” Governed, auditable platforms that support compliance and policy enforcement

Integrations

  • Docling Support โ€” Document parsing with table extraction (PDF, DOCX, PPTX, XLSX)
  • AWS Neptune โ€” Amazon Neptune graph database support with IAM authentication
  • Apache AGE โ€” PostgreSQL graph extension backend (openCypher via SQL)
  • Snowflake โ€” Native ingestion with SnowflakeIngestor; table/query ingestion, pagination, key-pair & OAuth auth
  • Custom Ontology Import โ€” Import existing ontologies (OWL, RDF, Turtle, JSON-LD)

Built for environments where every answer must be explainable and governed.


Context Graphs & Decision Tracking

Semantica's flagship module. Tracks every decision your agent makes as a structured graph node โ€” with causal links, precedent search, impact analysis, and policy enforcement.

from semantica.context import ContextGraph

graph = ContextGraph(advanced_analytics=True)

# Record a loan approval decision
loan_id = graph.add_decision(
    category="loan_approval",
    scenario="Mortgage application โ€” 780 credit score, 28% DTI",
    reasoning="Strong credit history, stable income for 8 years, low DTI",
    outcome="approved",
    confidence=0.95,
)

# Record a downstream dependent decision
rate_id = graph.add_decision(
    category="interest_rate",
    scenario="Set rate for approved mortgage",
    reasoning="Prime applicant qualifies for lowest tier rate",
    outcome="rate_set_6.2pct",
    confidence=0.98,
)

# Link the decisions causally
graph.add_causal_relationship(loan_id, rate_id, relationship_type="enables")

# Find similar past decisions using hybrid similarity
similar    = graph.find_similar_decisions("mortgage approval", max_results=5)
chain      = graph.trace_decision_chain(loan_id)
impact     = graph.analyze_decision_impact(loan_id)
compliance = graph.check_decision_rules({"category": "loan_approval", "confidence": 0.95})
insights   = graph.get_decision_insights()
from semantica.context import AgentContext, AgentMemory
from semantica.vector_store import VectorStore

context = AgentContext(
    vector_store=VectorStore(backend="faiss", dimension=768),
    knowledge_graph=ContextGraph(advanced_analytics=True),
    decision_tracking=True,
    graph_expansion=True,
    kg_algorithms=True,
)

context.store("Regulation EU 2024/1689 requires explainability for high-risk AI", conversation_id="compliance_review")
context.store("Our fraud model flags 0.3% of transactions", conversation_id="compliance_review")

results = context.retrieve("AI regulation explainability requirements", limit=3)
history = context.get_conversation_history("compliance_review")
stats   = context.get_statistics()

Knowledge Graphs

from semantica.kg import KnowledgeGraph, Entity, Relationship
from semantica.kg import CentralityAnalyzer, NodeEmbedder, LinkPredictor

kg = KnowledgeGraph()

kg.add_entity(Entity(id="transformer", label="Transformer", type="Architecture",
                     properties={"year": 2017, "paper": "Attention Is All You Need"}))
kg.add_entity(Entity(id="bert", label="BERT", type="Model",
                     properties={"year": 2018, "parameters": "340M"}))
kg.add_entity(Entity(id="gpt4", label="GPT-4", type="Model", properties={"year": 2023}))

kg.add_relationship(Relationship(source="bert", target="transformer", type="based_on"))
kg.add_relationship(Relationship(source="gpt4", target="transformer", type="based_on"))

# Graph algorithms
analyzer    = CentralityAnalyzer(kg)
centrality  = analyzer.compute_pagerank()
betweenness = analyzer.compute_betweenness()

# Node embeddings (Node2Vec)
embedder   = NodeEmbedder()
embeddings = embedder.compute_embeddings(kg, node_labels=["Model"], relationship_types=["based_on"])

# Link prediction
predictor = LinkPredictor()
score     = predictor.score_link(kg, "gpt4", "bert", method="common_neighbors")

models      = kg.find_nodes(type="Model")
descendants = kg.get_neighbors("transformer", direction="incoming")

Semantic Extraction

from semantica.semantic_extract import extract_entities, extract_relations, extract_triplets

text = """
OpenAI released GPT-4 in March 2023. Microsoft integrated GPT-4 into Azure OpenAI Service.
Anthropic, founded by former OpenAI researchers, released Claude as a competing model.
"""

entities = extract_entities(text)
# โ†’ [Entity(label="OpenAI", type="Organization"), Entity(label="GPT-4", type="Model"), ...]

relations = extract_relations(text)
# โ†’ [Relation(source="OpenAI", type="released", target="GPT-4"), ...]

triplets = extract_triplets(text)
from semantica.deduplication import DuplicateDetector

entities = [
    {"id": "e1", "name": "OpenAI Inc.", "type": "Organization"},
    {"id": "e2", "name": "Open AI",    "type": "Organization"},
    {"id": "e3", "name": "Anthropic",  "type": "Organization"},
]

detector   = DuplicateDetector()
duplicates = detector.detect_duplicates(entities, threshold=0.85)
# โ†’ [("e1", "e2")]

duplicates_v2 = detector.detect_duplicates(entities, threshold=0.85, strategy="semantic_v2")

Reasoning Engines

from semantica.reasoning import Reasoner

reasoner = Reasoner()
reasoner.add_rule("IF Person(?x) THEN Mortal(?x)")
reasoner.add_rule("IF Employee(?x) AND WorksAt(?x, ?y) THEN HasEmployer(?x, ?y)")

results = reasoner.infer_facts([
    "Person(Socrates)",
    "Employee(Alice)",
    {"source_name": "Alice", "target_name": "OpenAI", "type": "WorksAt"},
])
# โ†’ ["Mortal(Socrates)", "HasEmployer(Alice, OpenAI)"]
from semantica.reasoning import ReteEngine

rete = ReteEngine()
rete.add_rule({
    "name": "flag_high_risk_transaction",
    "conditions": [
        {"field": "amount",  "operator": ">",  "value": 10000},
        {"field": "country", "operator": "in", "value": ["IR", "KP", "SY"]},
    ],
    "action": "flag_for_compliance_review",
})
matches = rete.match({"amount": 15000, "country": "IR", "id": "txn_9921"})
from semantica.reasoning import DeductiveReasoner, AbductiveReasoner

deductive = DeductiveReasoner()
deductive.add_axiom("All transformers use attention mechanisms")
deductive.add_fact("BERT is a transformer")
conclusion = deductive.reason("Does BERT use attention?")

abductive = AbductiveReasoner()
abductive.add_observation("The model accuracy dropped 12% after deployment")
hypotheses = abductive.generate_hypotheses()
# โ†’ ["Distribution shift in production data", "Preprocessing pipeline mismatch", ...]

Provenance Tracking

W3C PROV-O compliant lineage tracking. Every fact traces back to its origin.

from semantica.kg import ProvenanceTracker, AlgorithmTrackerWithProvenance

tracker = ProvenanceTracker()
tracker.track_entity("gpt4_benchmark",
    source_url="https://openai.com/research/gpt-4",
    metadata={"metric": "MMLU", "score": 86.4})

algo_tracker = AlgorithmTrackerWithProvenance(provenance=True)
algo_tracker.track_graph_construction(
    algorithm="node2vec",
    input_data={"nodes": 1500, "edges": 4200},
    parameters={"dimensions": 128, "walk_length": 80},
)

sources      = tracker.get_all_sources("gpt4_benchmark")
all_entities = tracker.get_all_entities()

Vector Store & Hybrid Search

from semantica.vector_store import VectorStore

vs = VectorStore(backend="faiss", dimension=768)

vs.store("The Transformer architecture revolutionized NLP",
         metadata={"source": "arxiv", "year": 2017}, id="doc_001")
vs.store("BERT introduced bidirectional pre-training for language understanding",
         metadata={"source": "arxiv", "year": 2018}, id="doc_002")

results = vs.search("attention mechanisms in language models", top_k=5)

results = vs.hybrid_search(
    query="transformer pre-training",
    top_k=10,
    vector_weight=0.6,
    keyword_weight=0.4,
)

results = vs.search("pre-training", top_k=5, filter={"year": 2018})

Data Ingestion

from semantica.ingest import FileIngestor, WebIngestor, DBIngestor

file_ingestor = FileIngestor(recursive=True)
docs = file_ingestor.ingest("./research_papers/")

web_ingestor = WebIngestor(max_depth=2)
web_docs = web_ingestor.ingest("https://arxiv.org/abs/1706.03762")

db_ingestor = DBIngestor(connection_string="postgresql://user:pass@localhost/kg_db")
db_docs = db_ingestor.ingest(query="SELECT title, abstract FROM papers WHERE year >= 2020")

all_sources = docs + web_docs + db_docs
from semantica.parse import DoclingParser

# Advanced table and layout extraction
docling = DoclingParser()
parsed  = docling.parse("financial_report.pdf")
from semantica.ingest import SnowflakeIngestor

# Connect to Snowflake and ingest a table
ingestor = SnowflakeIngestor(
    account="myorg-myaccount",
    user="analyst",
    password="...",
    warehouse="COMPUTE_WH",
    database="ANALYTICS",
    schema="PUBLIC",
)

# Ingest a table with optional filtering and pagination
data = ingestor.ingest_table(
    table_name="customer_events",
    where="event_date >= '2024-01-01'",
    limit=10000,
)

# Or run a custom SQL query
data = ingestor.ingest_query(
    query="SELECT id, content, tags FROM knowledge_base WHERE active = TRUE",
    batch_size=500,
)

# Convert to Semantica documents for downstream pipeline use
docs = ingestor.export_as_documents(data, id_field="id", text_fields=["content"])

# Key-pair and OAuth auth are also supported via env vars:
# SNOWFLAKE_PRIVATE_KEY_PATH, SNOWFLAKE_TOKEN, SNOWFLAKE_AUTHENTICATOR

Export

from semantica.export import RDFExporter, ParquetExporter, ArangoAQLExporter

rdf_exporter = RDFExporter()
turtle   = rdf_exporter.export_to_rdf(kg, format="turtle")
jsonld   = rdf_exporter.export_to_rdf(kg, format="json-ld")
ntriples = rdf_exporter.export_to_rdf(kg, format="nt")

parquet_exporter = ParquetExporter()
parquet_exporter.export_entities(kg,        path="output/entities.parquet")
parquet_exporter.export_relationships(kg,   path="output/relationships.parquet")
parquet_exporter.export_knowledge_graph(kg, path="output/")

aql_exporter = ArangoAQLExporter()
aql_exporter.export(kg, path="output/insert.aql")

Pipeline Orchestration

from semantica.pipeline import PipelineBuilder, PipelineValidator, FailureHandler
from semantica.pipeline import RetryPolicy, RetryStrategy

builder = (
    PipelineBuilder()
    .add_stage("ingest",      FileIngestor(recursive=True))
    .add_stage("extract",     extract_triplets)
    .add_stage("deduplicate", DuplicateDetector())
    .add_stage("build_kg",    KnowledgeGraph())
    .add_stage("export",      RDFExporter())
    .with_parallel_workers(4)
)

validator = PipelineValidator()
result    = validator.validate(builder)
if result.valid:
    pipeline = builder.build()
    pipeline.run(input_path="./documents/")

retry_policy = RetryPolicy(strategy=RetryStrategy.EXPONENTIAL_BACKOFF, max_retries=3)
handler = FailureHandler()
handler.handle_failure(error=last_error, policy=retry_policy, retry_count=1)

Ontology

from semantica.ontology import OntologyGenerator, OntologyImporter

generator = OntologyGenerator()
ontology  = generator.generate(kg)
generator.export(ontology, path="domain_ontology.owl", format="turtle")

importer = OntologyImporter()
ontology = importer.load("existing_ontology.owl")
ontology = importer.load("schema.ttl", format="turtle")
ontology = importer.load("context.jsonld")

Integrations

Graph Databases

  • AWS Neptune โ€” Amazon Neptune with IAM authentication
  • Apache AGE โ€” PostgreSQL + openCypher via SQL
  • FalkorDB โ€” native support for decision queries and causal analysis

Vector Databases

  • FAISS โ€” high-performance dense vector search
  • Pinecone โ€” serverless and pod-based managed vector database (pip install semantica[vectorstore-pinecone])
  • Weaviate โ€” GraphQL-based vector store with rich schema management (pip install semantica[vectorstore-weaviate])
  • Qdrant โ€” collection-based store with payload filtering (pip install semantica[vectorstore-qdrant])
  • Milvus โ€” scalable store with partition support and multiple index types (pip install semantica[vectorstore-milvus])
  • PgVector โ€” PostgreSQL pgvector extension with JSONB metadata (pip install semantica[vectorstore-pgvector])
  • In-memory โ€” lightweight, zero-dependency store for development and testing

Data Sources

  • Snowflake โ€” SnowflakeIngestor for table/query ingestion, schema introspection, pagination, and multiple auth methods (password, key-pair, OAuth, SSO) (pip install semantica[db-snowflake])

Document Parsing

  • Docling โ€” PDF, DOCX, PPTX, XLSX with table and layout extraction

LLM Providers

  • 100+ models via LiteLLM โ€” OpenAI, Anthropic, Cohere, Mistral, Ollama, Azure, AWS Bedrock, and more

AI Frameworks

  • Complements LangChain, LlamaIndex, AutoGen, CrewAI, Google ADK

Export

  • RDF: Turtle, JSON-LD, N-Triples, XML ยท Parquet ยท ArangoDB AQL

Installation

# Core
pip install semantica

# With all optional dependencies
pip install semantica[all]

# Vector store backends (install only what you need)
pip install semantica[vectorstore-pinecone]
pip install semantica[vectorstore-weaviate]
pip install semantica[vectorstore-qdrant]
pip install semantica[vectorstore-milvus]
pip install semantica[vectorstore-pgvector]

# Snowflake ingestion
pip install semantica[db-snowflake]

# From source
git clone https://github.com/Hawksight-AI/semantica.git
cd semantica
pip install -e ".[dev]"

# Run tests
pytest tests/

๐Ÿค Community & Support

Join Our Community

Channel Purpose
Discord Real-time help, showcases
GitHub Discussions Q&A, feature requests

Learning Resources

Enterprise Support

Enterprise support, professional services, and commercial licensing will be available in the future. For now, we offer community support through Discord and GitHub Discussions.

Current Support:

Future Enterprise Offerings:

  • Professional support with SLA
  • Enterprise licensing
  • Custom development services
  • Priority feature requests
  • Dedicated support channels

Stay tuned for updates!

  • AI / ML engineers โ€” GraphRAG, explainable agents, decision tracing
  • Data engineers โ€” governed semantic pipelines with full provenance
  • Knowledge engineers โ€” ontology management and KG construction at scale
  • High-stakes domains โ€” healthcare, finance, legal, cybersecurity, government

Resources


Contributing

All contributions welcome โ€” bug fixes, new features, tests, and docs.

  1. Fork the repo and create a branch
  2. pip install -e ".[dev]"
  3. Write tests alongside your changes
  4. Open a PR and tag @KaifAhmad1 for review

See CONTRIBUTING.md for full guidelines.


MIT License ยท Built by Hawksight AI ยท โญ Star on GitHub

GitHub โ€ข Discord

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantica-0.3.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantica-0.3.0-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file semantica-0.3.0.tar.gz.

File metadata

  • Download URL: semantica-0.3.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semantica-0.3.0.tar.gz
Algorithm Hash digest
SHA256 83bf105f87b768c4ea5d1ebbba6bd8b2cc26443c93a74156081a9877d16dcc68
MD5 0a488d535512f9c305fe040b75d92d53
BLAKE2b-256 367c25683be804ed5c3b35c71dacd120ab81bdd7e3df0f5a447510be6490e5ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantica-0.3.0.tar.gz:

Publisher: release.yml on Hawksight-AI/semantica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semantica-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: semantica-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semantica-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a887a7bdf37cd3e0a013ab7add59bced61b2d2b25d857acd8057c7e0caac71ac
MD5 7280aeb4b22a234e6530e20d269d21ad
BLAKE2b-256 d53906a7a30f1047c142607780284019b766b52889943ead396201390c5e592e

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantica-0.3.0-py3-none-any.whl:

Publisher: release.yml on Hawksight-AI/semantica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page