๐ง Semantica - An Open Source Framework for building Semantic Layers and Knowledge Engineering
Project description
๐ง Semantica
Open Source Framework for Semantic Intelligence & Knowledge Engineering
Transform chaotic data into intelligent knowledge.
The missing fabric between raw data and AI engineering. A comprehensive open-source framework for building semantic layers and knowledge engineering systems that transform unstructured data into AI-ready knowledge โ powering Knowledge Graph-Powered RAG (GraphRAG), AI Agents, Multi-Agent Systems, and AI applications with structured semantic knowledge.
๐ 100% Open Source โข ๐ MIT Licensed โข ๐ Production Ready โข ๐ Community Driven
๐ What is Semantica?
Semantica is the first comprehensive open-source framework that bridges the critical gap between raw data chaos and AI-ready knowledge. It's not just another data processing libraryโit's a complete semantic intelligence platform that transforms unstructured information into structured, queryable knowledge graphs that power the next generation of AI applications.
The Vision
In the era of AI agents and autonomous systems, data alone isn't enough. Context is king. Semantica provides the semantic infrastructure that enables AI systems to truly understand, reason about, and act upon information with human-like comprehension.
What Makes Semantica Different?
| Traditional Approaches | Semantica's Approach |
|---|---|
| Process data as isolated documents | Understands semantic relationships across all content |
| Extract text and store vectors | Builds knowledge graphs with meaningful connections |
| Generic entity recognition | General-purpose ontology generation and validation |
| Manual schema definition | Automatic semantic modeling from content patterns |
| Disconnected data silos | Unified semantic layer across all data sources |
| Basic quality checks | Production-grade QA with conflict detection & resolution |
๐ฏ The Problem We Solve
The Data-to-AI Gap
Modern organizations face a fundamental challenge: the semantic gap between raw data and AI systems.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ THE SEMANTIC GAP โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Raw Data (What You Have) AI Systems (What They Need) โ
โ โโ PDFs, emails, docs โโ Structured entities โ
โ โโ Multiple formats โโ Semantic relationships โ
โ โโ Inconsistent schemas โโ Formal ontologies โ
โ โโ Siloed sources โโ Connected knowledge โ
โ โโ No semantic meaning โโ Context-aware reasoning โ
โ โโ Unvalidated content โโ Quality-assured knowledgeโ
โ โ
โ โ Missing: The Semantic Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Real-World Consequences
Without a semantic layer:
-
RAG Systems Fail ๐ด
- Vector search alone misses crucial relationships
- No graph traversal for context expansion
- 30% lower accuracy than hybrid approaches
-
AI Agents Hallucinate ๐ด
- No ontological constraints to validate actions
- Missing semantic routing for intent understanding
- No persistent memory across conversations
-
Multi-Agent Systems Can't Coordinate ๐ด
- No shared semantic models for collaboration
- Unable to validate actions against domain rules
- Conflicting knowledge representations
-
Knowledge Is Untrusted ๐ด
- Duplicate entities pollute graphs
- Conflicting facts from different sources
- No provenance tracking or validation
The Semantica Solution
Semantica fills this gap with a complete semantic intelligence framework:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SEMANTICA FRAMEWORK โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ๐ฅ Input Layer ๐ง Semantic Layer ๐ค Output Layerโ
โ โโ 50+ data formats โโ Entity extraction โโ Knowledge โ
โ โโ Live feeds โโ Relationship mapping โ graphs โ
โ โโ APIs & streams โโ Ontology generation โโ Vector โ
โ โโ Archives โโ Context engineering โ embeddings โ
โ โโ Multi-modal โโ Quality assurance โโ Ontologies โ
โ โ
โ โ
Powers: GraphRAG, AI Agents, Multi-Agent โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฆ Installation
Prerequisites
- Python: 3.8 or higher (3.9+ recommended)
- pip: Latest version
Install from Source (Current Method)
Since Semantica is currently in development, install from the local source:
# Navigate to the semantica directory
cd path/to/semantica
# Install in editable mode with core dependencies
pip install -e .
# Or install with all optional dependencies
pip install -e ".[all]"
Development Installation
# Clone the repository (if not already cloned)
git clone https://github.com/semantica-dev/semantica.git
cd semantica
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
Custom Installation
# Install specific extras as needed
pip install -e ".[llm-openai]" # LLM providers
pip install -e ".[graph-neo4j]" # Graph databases
pip install -e ".[vector-pinecone]" # Vector stores
pip install -e ".[dev]" # Development tools
pip install -e ".[gpu]" # GPU support
Verify Installation
python -c "import semantica; print(semantica.__version__)"
Note: Once published to PyPI, you'll be able to install with
pip install semantica
โจ Core Capabilities
1. ๐ Universal Data Ingestion
Process 50+ file formats with intelligent semantic extraction:
๐ Documents
|
๐ Web & Feeds
|
๐พ Structured Data
|
๐ง Communication
|
๐๏ธ Archives
|
๐ฌ Scientific
|
Example: Multi-Source Ingestion
from semantica.ingest import (
FileIngestor,
WebIngestor,
FeedIngestor,
DBIngestor,
StreamIngestor,
EmailIngestor
)
# Initialize ingestors with configuration
file_ingestor = FileIngestor(
recursive=True,
max_file_size=100 * 1024 * 1024, # 100MB
supported_formats=["pdf", "docx", "xlsx", "pptx", "txt", "md"]
)
web_ingestor = WebIngestor(
max_depth=3,
respect_robots_txt=True,
delay_between_requests=1.0
)
feed_ingestor = FeedIngestor(
max_items=1000,
update_interval=3600 # 1 hour
)
# Ingest from multiple sources
sources = []
# File ingestion
sources.extend(file_ingestor.ingest("documents/", formats=["pdf", "docx", "xlsx"]))
sources.extend(file_ingestor.ingest("data/archive.zip", extract_archives=True))
# Web ingestion
sources.extend(web_ingestor.ingest("https://example.com/articles"))
sources.extend(web_ingestor.ingest("https://blog.company.com", patterns=["*.html"]))
# Feed ingestion
sources.extend(feed_ingestor.ingest("https://example.com/rss"))
sources.extend(feed_ingestor.ingest("https://news.ycombinator.com/rss"))
# Database ingestion
db_ingestor = DBIngestor(connection_string="postgresql://user:pass@localhost/db")
sources.extend(db_ingestor.ingest(
query="SELECT title, content, author FROM articles",
metadata={"source": "articles_db", "version": "1.0"}
))
print(f"โ
Ingested {len(sources)} sources")
for source in sources[:5]:
print(f" - {source.filename} ({source.format}, {source.size} bytes)")
# Output:
# โ
Ingested 1,247 sources
# - document1.pdf (pdf, 245678 bytes)
# - report.docx (docx, 156789 bytes)
# - article.html (html, 89456 bytes)
# - feed_item.xml (rss, 12345 bytes)
# - db_record.json (json, 5678 bytes)
2. ๐ง Semantic Intelligence Engine
Transform raw text into structured semantic knowledge with state-of-the-art NLP and AI models.
Example: Complete Extraction Pipeline
from semantica import Semantica
from semantica.semantic_extract import (
NamedEntityRecognizer,
RelationExtractor,
EventDetector,
TripleExtractor,
CoreferenceResolver,
SemanticAnalyzer
)
# Sample text
text = """
Apple Inc., founded by Steve Jobs in 1976, announced its acquisition of Beats
Electronics for $3 billion on May 28, 2014. Dr. Dre and Jimmy Iovine, co-founders
of Beats, joined Apple's executive team. The acquisition included Beats Music
streaming service and Beats Electronics hardware.
"""
# Option 1: High-level API (recommended for quick start)
core = Semantica(
ner_model="transformer",
relation_strategy="hybrid",
enable_coreference=True
)
results = core.extract_semantics(text)
# Option 2: Low-level API (for fine-grained control)
ner = NamedEntityRecognizer(model="transformer", lang="en")
rel_extractor = RelationExtractor(strategy="hybrid", confidence_threshold=0.7)
event_detector = EventDetector()
triple_extractor = TripleExtractor()
coreference_resolver = CoreferenceResolver()
semantic_analyzer = SemanticAnalyzer()
# Extract with full pipeline
entities = ner.extract(text)
entities = coreference_resolver.resolve(text, entities)
relationships = rel_extractor.extract(text, entities)
events = event_detector.detect(text, entities)
triples = triple_extractor.extract(text, entities, relationships, events)
semantic_analysis = semantic_analyzer.analyze_semantics(text, entities, relationships)
# === EXTRACTED ENTITIES ===
print(f"Entities found: {len(results.entities)}\n")
for entity in results.entities:
print(f"- {entity.text} ({entity.type}, confidence={entity.confidence:.2f}, "
f"span=({entity.start}, {entity.end}))")
# Output:
# - Apple Inc. (Organization, confidence=0.98, span=(0, 10))
# - Steve Jobs (Person, confidence=0.97, span=(28, 38))
# - 1976 (Date, confidence=1.00, span=(42, 46))
# - Beats Electronics (Organization, confidence=0.95, span=(85, 102))
# - $3 billion (Money, confidence=0.99, span=(107, 117))
# - May 28, 2014 (Date, confidence=0.98, span=(121, 133))
# - Dr. Dre (Person, confidence=0.97, span=(135, 142))
# - Jimmy Iovine (Person, confidence=0.94, span=(147, 159))
# === EXTRACTED RELATIONSHIPS ===
print(f"\nRelationships found: {len(results.relationships)}\n")
for rel in results.relationships[:3]:
print(f"{rel.subject} --[{rel.predicate}]--> {rel.object} "
f"(confidence={rel.confidence:.2f})")
# Output:
# Apple Inc. --[founded_by]--> Steve Jobs (confidence=0.95)
# Apple Inc. --[acquired]--> Beats Electronics (confidence=0.92)
# Dr. Dre --[co-founded]--> Beats Electronics (confidence=0.89)
# === DETECTED EVENTS ===
print(f"\nEvents detected: {len(events)}\n")
for event in events[:2]:
print(f"- {event.type}: {event.description} "
f"(participants={[p.name for p in event.participants]})")
# === GENERATED TRIPLES ===
print(f"\nTriples generated: {len(results.triples)}\n")
for triple in results.triples[:5]:
print(f" {triple.subject} {triple.predicate} {triple.object}")
# Output:
# <Apple_Inc> <founded_by> <Steve_Jobs>
# <Apple_Inc> <acquired> <Beats_Electronics>
# <acquisition_1> <amount> "$3B"
# <acquisition_1> <date> "2014-05-28"
# <Dr_Dre> <co-founded> <Beats_Electronics>
Advanced Extraction with Custom Models and Configuration
from semantica.semantic_extract import (
NamedEntityRecognizer,
RelationExtractor,
EventDetector,
TripleExtractor,
CoreferenceResolver,
SemanticAnalyzer,
LLMEnhancer,
ExtractionValidator
)
# Initialize specialized extractors with custom configuration
ner = NamedEntityRecognizer(
model="transformer", # or "spacy", "stanza", "custom"
lang="en",
entities=["PERSON", "ORG", "LOC", "DATE", "MONEY"],
confidence_threshold=0.7,
use_llm_enhancement=True
)
rel_extractor = RelationExtractor(
strategy="hybrid", # "rule-based", "ml-based", "hybrid", "llm-based"
confidence_threshold=0.7,
max_relationships_per_entity=10
)
event_detector = EventDetector(
event_types=["ACQUISITION", "FOUNDING", "PARTNERSHIP", "ANNOUNCEMENT"],
min_confidence=0.75
)
triple_extractor = TripleExtractor(
format="rdf", # "rdf", "property_graph", "custom"
validate_triples=True
)
coreference_resolver = CoreferenceResolver(
method="neural", # "rule-based", "neural", "hybrid"
resolve_pronouns=True
)
llm_enhancer = LLMEnhancer(
provider="openai",
model="gpt-4",
temperature=0.1
)
validator = ExtractionValidator(
validate_entities=True,
validate_relationships=True,
schema_validation=True
)
# Extract with full pipeline
entities = ner.extract(text)
entities = coreference_resolver.resolve(text, entities)
entities = llm_enhancer.enhance_entities(text, entities)
relationships = rel_extractor.extract(text, entities)
relationships = llm_enhancer.enhance_relationships(text, relationships)
events = event_detector.detect(text, entities)
triples = triple_extractor.extract(text, entities, relationships, events)
# Validate extractions
validation_results = validator.validate(
text=text,
entities=entities,
relationships=relationships,
triples=triples
)
# Semantic analysis
semantic_analyzer = SemanticAnalyzer()
analysis = semantic_analyzer.analyze_semantics(
text=text,
entities=entities,
relationships=relationships
)
print(f"โ
Entities: {len(entities)} (validated: {validation_results.entities_valid})")
print(f"โ
Relationships: {len(relationships)} (validated: {validation_results.relationships_valid})")
print(f"โ
Events: {len(events)}")
print(f"โ
Triples: {len(triples)} (validated: {validation_results.triples_valid})")
print(f"โ
Semantic coherence: {analysis.coherence_score:.2f}")
3. ๐ธ๏ธ Knowledge Graph Construction
Build production-ready knowledge graphs from any data source with automatic entity resolution, relationship inference, and graph optimization.
Example: Building Knowledge Graph
from semantica import Semantica
from semantica.kg import (
GraphBuilder,
EntityResolver,
GraphAnalyzer,
CentralityCalculator,
CommunityDetector
)
from semantica.export import RDFExporter, JSONExporter
# Sample documents
documents = [
"""Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
The company is headquartered in Cupertino, California.""",
"""In 2014, Apple acquired Beats Electronics for $3 billion. Dr. Dre and
Jimmy Iovine joined Apple's executive team.""",
"""Tim Cook became CEO in 2011 after Jobs stepped down. Under Cook's leadership,
Apple expanded into services generating over $80 billion annually."""
]
# Option 1: High-level API (recommended for quick start)
core = Semantica(
graph_db="neo4j", # or "networkx", "rdflib", "memgraph"
merge_entities=True,
resolve_conflicts=True
)
kg = core.build_knowledge_graph(
sources=documents,
merge_entities=True,
resolve_conflicts=True,
generate_embeddings=True
)
# Option 2: Low-level API (for fine-grained control)
graph_builder = GraphBuilder(
merge_entities=True,
entity_resolution_strategy="fuzzy",
resolve_conflicts=True,
enable_temporal=True, # Enable temporal knowledge graph features
temporal_granularity="day",
track_history=True,
version_snapshots=True
)
entity_resolver = EntityResolver(
similarity_threshold=0.85,
merge_strategy="highest_confidence"
)
# Build graph step by step
kg = graph_builder.build(
sources=documents,
entity_resolver=entity_resolver
)
# Resolve entities
kg = entity_resolver.resolve(kg)
# Graph Statistics
print("=== GRAPH STATISTICS ===")
print(f"Nodes: {kg.node_count}")
print(f"Edges: {kg.edge_count}")
print(f"Entity Types: {sorted(kg.entity_types)}")
print(f"Relationship Types: {sorted(kg.relationship_types)}")
print(f"Graph Density: {kg.density:.3f}")
print(f"Connected Components: {kg.connected_components}\n")
# Output:
# Nodes: 25
# Edges: 38
# Entity Types: ['Date', 'Location', 'Money', 'Organization', 'Person', 'Product']
# Relationship Types: ['acquired', 'became', 'expanded_into', 'founded', 'headquartered_in', 'joined', 'works_for']
# Graph Density: 0.127
# Connected Components: 1
# Query the graph
result = kg.query(
"Who founded Apple Inc.?",
return_format="structured"
)
print(f"Q: Who founded Apple Inc.?")
print(f"A: {result.answer}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Supporting Entities: {[e.name for e in result.supporting_entities]}")
print(f"Evidence Paths: {result.evidence_paths}\n")
# Output:
# Q: Who founded Apple Inc.?
# A: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
# Confidence: 0.98
# Supporting Entities: ['Steve Jobs', 'Steve Wozniak', 'Ronald Wayne', 'Apple Inc.']
# Evidence Paths: [['Apple Inc.', 'founded_by', 'Steve Jobs'], ...]
# Export to multiple formats
rdf_exporter = RDFExporter()
rdf_exporter.export(kg, "output.ttl", format="turtle")
json_exporter = JSONExporter()
json_exporter.export(kg, "output.jsonld", format="json-ld")
# Export to graph databases
kg.to_neo4j("bolt://localhost:7687", "neo4j", "password")
kg.to_memgraph("localhost", 7687, username="admin", password="password")
print("โ
Graph exported to multiple formats!")
Temporal Knowledge Graph Example
from semantica import Semantica
from semantica.kg import (
GraphBuilder,
TemporalGraphQuery,
TemporalPatternDetector,
TemporalVersionManager,
GraphAnalyzer
)
from datetime import datetime, timedelta
# Initialize with temporal support
core = Semantica(
graph_db="neo4j",
enable_temporal=True,
temporal_granularity="day"
)
# Build temporal knowledge graph
graph_builder = GraphBuilder(
enable_temporal=True,
temporal_granularity="day",
track_history=True,
version_snapshots=True
)
kg = graph_builder.build(
sources=documents,
entity_resolver=entity_resolver
)
# Add temporal edges with validity periods
graph_builder.add_temporal_edge(
graph=kg,
source="Apple Inc.",
target="Steve Jobs",
relationship="founded_by",
valid_from="1976-04-01",
valid_until=None, # Ongoing relationship
temporal_metadata={"timezone": "UTC", "precision": "day"}
)
graph_builder.add_temporal_edge(
graph=kg,
source="Apple Inc.",
target="Beats Electronics",
relationship="acquired",
valid_from="2014-05-28",
valid_until="2014-08-01", # Acquisition completed
temporal_metadata={"amount": "$3B", "status": "completed"}
)
# Create temporal snapshot
version_manager = TemporalVersionManager(
snapshot_interval=timedelta(days=30),
auto_snapshot=True
)
snapshot = version_manager.create_version(
graph=kg,
timestamp="2024-01-15",
version_label="Q1_2024",
metadata={"description": "Q1 2024 knowledge graph snapshot"}
)
# Query temporal graph
temporal_query = TemporalGraphQuery(
enable_temporal_reasoning=True,
temporal_granularity="day"
)
# Query at specific time point
results_at_2014 = temporal_query.query_at_time(
graph=kg,
query="Who founded Apple Inc.?",
at_time="2014-06-15",
include_history=True
)
# Query within time range
results_range = temporal_query.query_time_range(
graph=kg,
query="What acquisitions did Apple make?",
start_time="2010-01-01",
end_time="2020-12-31",
temporal_aggregation="union"
)
# Analyze temporal evolution
analyzer = GraphAnalyzer(enable_temporal=True)
evolution = analyzer.analyze_temporal_evolution(
graph=kg,
start_time="2000-01-01",
end_time="2024-12-31",
metrics=["node_count", "edge_count", "density", "communities"],
interval=timedelta(days=365) # Yearly snapshots
)
print("=== TEMPORAL EVOLUTION ===")
for snapshot in evolution.snapshots:
print(f"{snapshot.timestamp}: {snapshot.metrics}")
# Detect temporal patterns
pattern_detector = TemporalPatternDetector()
patterns = pattern_detector.detect_temporal_patterns(
graph=kg,
pattern_type="sequence",
min_frequency=2,
time_window=timedelta(days=365)
)
print(f"\nโ
Detected {len(patterns)} temporal patterns")
# Find temporal paths
temporal_paths = temporal_query.find_temporal_paths(
graph=kg,
source="Apple Inc.",
target="Beats Electronics",
start_time="2010-01-01",
end_time="2015-12-31",
max_path_length=3
)
print(f"\nโ
Found {len(temporal_paths)} temporal paths")
Advanced Graph Analytics
from semantica.kg import (
GraphAnalyzer,
CentralityCalculator,
CommunityDetector,
ConnectivityAnalyzer
)
analyzer = GraphAnalyzer(kg)
# Centrality analysis
centrality_calc = CentralityCalculator(kg)
pagerank_scores = centrality_calc.pagerank()
betweenness_scores = centrality_calc.betweenness_centrality()
closeness_scores = centrality_calc.closeness_centrality()
eigenvector_scores = centrality_calc.eigenvector_centrality()
print("\nMost Influential Entities (PageRank):")
for entity, score in sorted(pagerank_scores.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" {entity}: {score:.3f}")
# Community detection
community_detector = CommunityDetector(kg)
communities = community_detector.detect(algorithm="louvain") # or "leiden", "greedy_modularity"
print(f"\nCommunities detected: {len(communities)}")
for i, community in enumerate(communities[:3], 1):
print(f" Community {i}: {len(community)} entities - {community[:3]}...")
# Connectivity analysis
connectivity = ConnectivityAnalyzer(kg)
shortest_paths = connectivity.find_shortest_paths("Apple Inc.", "Dr. Dre", max_length=3)
all_paths = connectivity.find_all_paths("Apple Inc.", "Dr. Dre", max_length=4)
print(f"\nShortest paths found: {len(shortest_paths)}")
for path in shortest_paths[:3]:
print(f" {' โ '.join(str(node) for node in path)}")
# Graph metrics
metrics = analyzer.compute_metrics()
print(f"\nGraph Metrics:")
print(f" Average degree: {metrics['avg_degree']:.2f}")
print(f" Clustering coefficient: {metrics['clustering']:.3f}")
print(f" Diameter: {metrics['diameter']}")
print(f" Average path length: {metrics['avg_path_length']:.2f}")
4. ๐ Ontology Generation & Management
Generate formal ontologies automatically using a 6-stage LLM-based pipeline that transforms unstructured content into W3C-compliant OWL ontologies.
The 6-Stage Pipeline:
Stage 1: Semantic Network Parsing โ Extract domain concepts
Stage 2: YAML-to-Definition โ Transform into class definitions
Stage 3: Definition-to-Types โ Map to OWL types
Stage 4: Hierarchy Generation โ Build taxonomic structures
Stage 5: TTL Generation โ Generate OWL/Turtle syntax
Stage 6: Symbolic Validation โ HermiT/Pellet reasoning (F1 up to 0.99)
Example: Automatic Ontology Generation
from semantica.ontology import (
OntologyGenerator,
OntologyValidator,
ClassInferrer,
PropertyGenerator,
OWLGenerator,
OntologyEvaluator,
RequirementsSpec
)
# Sample domain documents
documents = [
"""Apple Inc. is a technology company that designs and manufactures consumer
electronics, software, and online services. Products include iPhone, iPad, Mac.""",
"""Companies can acquire other companies. Apple acquired Beats Electronics for
$3 billion. Acquisitions involve financial transactions and integration."""
]
# Step 1: Define requirements and competency questions
requirements = RequirementsSpec()
requirements.add_competency_question(
"What companies exist in the domain?",
category="entity_identification"
)
requirements.add_competency_question(
"What are the relationships between companies?",
category="relationship_modeling"
)
# Step 2: Initialize generator with full configuration
generator = OntologyGenerator(
llm_provider="openai",
model="gpt-4",
validation_mode="hybrid", # LLM + symbolic reasoner
enable_class_inference=True,
enable_property_generation=True,
quality_threshold=0.95
)
# Step 3: Generate ontology using 6-stage pipeline
ontology = generator.generate_from_documents(
sources=documents,
requirements=requirements,
quality_threshold=0.95,
namespace="https://example.org/ontology#",
prefix="ex"
)
print("=== ONTOLOGY GENERATION RESULTS ===")
print(f"Classes: {len(ontology.classes)}")
print(f"Properties: {len(ontology.properties)}")
print(f"Axioms: {len(ontology.axioms)}")
print(f"Validation Score: {ontology.validation_score:.2f}")
print(f"Namespace: {ontology.namespace}\n")
# Step 4: Display generated classes with hierarchy
print("=== GENERATED CLASSES ===")
for cls in ontology.classes[:5]:
print(f"\nClass: {cls.name} ({cls.iri})")
print(f" Superclasses: {', '.join(cls.superclasses) if cls.superclasses else 'owl:Thing'}")
print(f" Subclasses: {len(cls.subclasses)}")
print(f" Properties: {len(cls.properties)}")
for prop in cls.properties[:3]:
print(f" - {prop.name} ({prop.type})")
if cls.annotations:
print(f" Annotations: {cls.annotations}")
# Step 5: Display properties with domain and range
print("\n=== GENERATED PROPERTIES ===")
object_props = [p for p in ontology.properties if p.type == 'ObjectProperty']
datatype_props = [p for p in ontology.properties if p.type == 'DatatypeProperty']
print(f"Object Properties: {len(object_props)}")
for prop in object_props[:3]:
print(f" {prop.name}: {prop.domain} โ {prop.range}")
if prop.characteristics:
print(f" Characteristics: {prop.characteristics}")
print(f"\nDatatype Properties: {len(datatype_props)}")
for prop in datatype_props[:3]:
print(f" {prop.name}: {prop.domain} โ {prop.range}")
# Step 6: Validate with symbolic reasoner
validator = OntologyValidator(reasoner="hermit") # or "pellet", "fact++"
validation_report = validator.validate(ontology)
print("\n=== VALIDATION REPORT ===")
if validation_report.is_consistent:
print("โ
Ontology is logically consistent")
print(f"โ
All {len(validation_report.checks)} checks passed")
print(f"โ
Satisfiability: {validation_report.is_satisfiable}")
print(f"โ
Classification: {validation_report.classification_complete}")
# Generate OWL/Turtle file
owl_generator = OWLGenerator()
owl_generator.generate(ontology, "domain_ontology.ttl", format="turtle")
print("\nโ
Saved to domain_ontology.ttl")
else:
print("โ Inconsistencies found:")
for issue in validation_report.issues:
print(f" - {issue.severity}: {issue.message}")
print(f" Location: {issue.location}")
# Step 7: Evaluate ontology quality
evaluator = OntologyEvaluator()
evaluation = evaluator.evaluate(ontology)
print("\n=== ONTOLOGY QUALITY EVALUATION ===")
print(f"Completeness: {evaluation.completeness:.2f}")
print(f"Consistency: {evaluation.consistency:.2f}")
print(f"Clarity: {evaluation.clarity:.2f}")
print(f"Coherence: {evaluation.coherence:.2f}")
print(f"Overall Score: {evaluation.overall_score:.2f}")
5. ๐ Context Engineering for AI Agents
Formalize context as graphs to enable AI agents with memory, tools, and purpose:
The Three Layers of Context:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 1: Prompting (Natural Language Programming) โ
โ โโ Define agent goals and behaviors โ
โ โโ Template-based prompt construction โ
โ โโ Dynamic context injection โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 2: Memory (RAG + Knowledge Graphs) โ
โ โโ Vector databases for semantic similarity โ
โ โโ Knowledge graphs for relationship traversal โ
โ โโ Persistent context across conversations โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 3: Tools (Standardized Interfaces) โ
โ โโ MCP-compatible tool registry โ
โ โโ Semantic tool discovery โ
โ โโ Consistent tool access patterns โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Example: Building Context-Aware Agent
from semantica.context import (
ContextGraphBuilder,
AgentMemory,
ContextRetriever,
EntityLinker
)
from semantica.prompting import PromptBuilder
from semantica.agents import ToolRegistry
from semantica.vector_store import VectorStore, PineconeAdapter
from semantica.kg import GraphBuilder
# Build context graph from conversations
context_builder = ContextGraphBuilder(
extract_entities=True,
extract_relationships=True,
link_external_entities=True
)
context_graph = context_builder.build_from_conversations(
conversations=["conv_1.json", "conv_2.json"],
link_entities=True,
extract_intents=True,
extract_sentiments=True
)
# Initialize vector store for memory
vector_store = VectorStore(adapter=PineconeAdapter(
api_key="your-api-key",
index_name="agent-memory",
environment="us-east-1"
))
# Initialize agent memory with full configuration
memory = AgentMemory(
vector_store=vector_store,
knowledge_graph=context_graph,
retention_policy="30_days",
max_memory_size=10000
)
# Store context with metadata
memory.store(
content="User prefers technical documentation over tutorials",
metadata={
"user_id": "user_123",
"session": "session_456",
"timestamp": "2024-01-15T10:30:00Z",
"category": "preferences"
},
entities=["User", "Documentation", "Tutorials"],
relationships=[("prefers", "User", "Documentation")]
)
# Store additional context
memory.store(
content="User is interested in machine learning and NLP topics",
metadata={"user_id": "user_123", "category": "interests"},
entities=["User", "Machine Learning", "NLP"]
)
# Initialize context retriever
context_retriever = ContextRetriever(
memory_store=memory,
use_graph_expansion=True,
max_expansion_hops=2
)
# Retrieve relevant context
relevant_context = context_retriever.retrieve(
query="What are the user's learning preferences?",
max_results=5,
use_graph_expansion=True,
min_relevance_score=0.7
)
print("=== RETRIEVED CONTEXT ===")
for ctx in relevant_context:
print(f"- {ctx.content} (score: {ctx.score:.2f})")
if ctx.related_entities:
print(f" Related: {[e.name for e in ctx.related_entities[:3]]}")
# Entity linking for context
entity_linker = EntityLinker(
knowledge_graph=context_graph,
similarity_threshold=0.8
)
linked_entities = entity_linker.link(
text="Create a learning plan for technical documentation",
context=relevant_context
)
# Build context-aware prompt
prompt_builder = PromptBuilder(
template_engine="jinja2",
include_context=True,
include_entities=True
)
prompt = prompt_builder.build(
template="agent_task",
context=relevant_context,
entities=linked_entities,
user_query="Create a learning plan",
system_instructions="You are a helpful learning assistant."
)
print("\n=== GENERATED PROMPT ===")
print(prompt)
# Tool registry for agent capabilities
tool_registry = ToolRegistry()
tool_registry.register_tool(
name="create_learning_plan",
description="Creates a personalized learning plan",
parameters={"topics": "list", "preferences": "dict"}
)
# Get available tools based on context
available_tools = tool_registry.get_relevant_tools(
query="Create a learning plan",
context=relevant_context
)
print(f"\n=== AVAILABLE TOOLS ===")
for tool in available_tools:
print(f"- {tool.name}: {tool.description}")
6. ๐ฏ Knowledge Graph-Powered RAG (GraphRAG)
Combine vector search speed with knowledge graph precision for 30% accuracy improvements.
Example: GraphRAG Query
from semantica.qa_rag import (
GraphRAGEngine,
HybridRetriever,
RAGManager,
ContextBuilder,
MemoryStore
)
from semantica.vector_store import VectorStore, PineconeAdapter
from semantica.kg import GraphBuilder
# Initialize components
vector_store = VectorStore(adapter=PineconeAdapter(
api_key="your-api-key",
index_name="semantic-index",
environment="us-east-1"
))
kg = GraphBuilder().load_from_neo4j(
uri="bolt://localhost:7687",
username="neo4j",
password="password"
)
# Initialize GraphRAG with full configuration
graphrag = GraphRAGEngine(
vector_store=vector_store,
knowledge_graph=kg,
embedding_model="text-embedding-3-large",
embedding_dimension=3072,
rerank_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
max_context_length=4000
)
# Alternative: Use RAGManager for higher-level operations
rag_manager = RAGManager(
graphrag_engine=graphrag,
context_builder=ContextBuilder(max_context_size=4000),
memory_store=MemoryStore(retention_days=30)
)
# User query
query = "Who founded Apple and what major acquisitions did they make?"
# === STEP 1: VECTOR SEARCH ===
print("Step 1: Vector Search")
vector_results = graphrag.vector_search(
query=query,
top_k=20,
filter_metadata={"source": "company_data"},
include_metadata=True
)
print(f"โ
Found {len(vector_results)} similar chunks")
print(f" Top result score: {vector_results[0].score:.3f}\n")
# === STEP 2: ENTITY EXTRACTION ===
print("Step 2: Entity Extraction")
entities = graphrag.extract_entities(
vector_results,
min_confidence=0.7,
entity_types=["PERSON", "ORG"]
)
print(f"โ
Extracted {len(entities)} unique entities")
print(f" Entities: {[e.name for e in entities[:5]]}\n")
# === STEP 3: GRAPH EXPANSION ===
print("Step 3: Graph Expansion")
expanded_context = graphrag.expand_graph(
seed_entities=entities,
max_hops=2,
relationship_types=["founded", "acquired", "co-founded"],
max_nodes=100,
include_edge_weights=True
)
print(f"โ
Expanded from {len(entities)} to {len(expanded_context.nodes)} nodes")
print(f" Added {len(expanded_context.edges)} edges\n")
# === STEP 4: HYBRID RETRIEVAL ===
print("Step 4: Hybrid Retrieval")
hybrid_retriever = HybridRetriever(
vector_store=vector_store,
knowledge_graph=kg,
rerank=True
)
results = hybrid_retriever.retrieve(
query=query,
vector_top_k=20,
graph_top_k=10,
expand_graph=True,
max_hops=2,
rerank=True,
final_top_k=5,
fusion_method="reciprocal_rank" # or "weighted", "rrf"
)
# === DISPLAY RESULTS ===
print("\n=== GRAPHRAG RESULTS ===\n")
for i, result in enumerate(results, 1):
print(f"Result {i} (Score: {result.score:.3f})")
print(f"Text: {result.text[:150]}...")
print(f"\nGraph Paths ({len(result.graph_paths)}):")
for path in result.graph_paths[:2]:
print(f" {' โ '.join(path)}")
print(f"\nRelated Entities: {[e.name for e in result.related_entities[:3]]}")
print(f"Sources: {result.source_documents}")
print(f"Metadata: {result.metadata}\n")
print("-" * 80 + "\n")
# === STEP 5: GENERATE ANSWER (with RAG Manager) ===
print("Step 5: Answer Generation")
answer = rag_manager.generate_answer(
query=query,
retrieved_results=results,
temperature=0.1,
max_tokens=500
)
print(f"Answer: {answer.text}")
print(f"Confidence: {answer.confidence:.2f}")
print(f"Citations: {len(answer.citations)}")
# Store in memory for future queries
rag_manager.memory_store.store(
query=query,
answer=answer,
retrieved_context=results
)
Performance Comparison:
| Approach | Accuracy | Speed | Context Quality |
|---|---|---|---|
| Vector-Only RAG | 70% | โก 50ms | โญโญโญ |
| Graph-Only | 75% | ๐ 300ms | โญโญโญโญ |
| GraphRAG (Hybrid) | 91% โญ | โก 80ms | โญโญโญโญโญ |
7. ๐ค Multi-Agent System Infrastructure
Enable AI agents to coordinate through shared semantic models.
Example: Multi-Agent Coordination
from semantica.agents import MultiAgentSystem, AgentCoordinator
from semantica.ontology import SharedOntologyManager
# Load shared ontology
ontology_manager = SharedOntologyManager()
ontology = ontology_manager.load("domain_ontology.ttl")
# Initialize multi-agent system
mas = MultiAgentSystem(
shared_ontology=ontology,
coordination_mode="semantic"
)
# Create specialized agents
research_agent = mas.create_agent(
role="researcher",
capabilities=["web_search", "document_analysis"],
constraints=ontology_manager.get_constraints("research_operations")
)
analysis_agent = mas.create_agent(
role="analyst",
capabilities=["data_analysis", "visualization"],
constraints=ontology_manager.get_constraints("analysis_operations")
)
writing_agent = mas.create_agent(
role="writer",
capabilities=["content_generation", "summarization"],
constraints=ontology_manager.get_constraints("writing_operations")
)
# Coordinate workflow
coordinator = AgentCoordinator(
agents=[research_agent, analysis_agent, writing_agent],
workflow_graph=workflow_definition
)
# Execute coordinated task
result = coordinator.execute_workflow(
task="Create a comprehensive market analysis report",
validation_mode="ontology_based"
)
print(f"โ
Workflow completed")
print(f"Tasks executed: {len(result.completed_tasks)}")
print(f"Validation status: {result.validation_status}")
8. ๐ง Production-Ready Quality Assurance
Enterprise-grade validation, conflict detection, and quality scoring.
The Four Critical QA Features
1. Schema Template Enforcement
from semantica.templates import SchemaTemplate
# Define business schema
company_schema = SchemaTemplate(
name="company_knowledge_graph",
entities={
"Company": {
"required_properties": ["name", "industry", "founded_year"],
"optional_properties": ["revenue", "employee_count"]
},
"Person": {
"required_properties": ["name", "role"],
"optional_properties": ["email", "department"]
}
},
relationships={
"works_for": {"domain": "Person", "range": "Company"},
"produces": {"domain": "Company", "range": "Product"}
}
)
# Enforce schema during extraction
kb = core.build_knowledge_base(
sources=documents,
schema_template=company_schema,
strict_mode=True
)
print(f"โ
Schema enforcement: {kb.compliance_rate:.1f}% compliant")
2. Seed Data System
from semantica.seed import SeedManager
seed_manager = SeedManager()
# Load verified data
seed_manager.load_from_csv("verified_companies.csv")
seed_manager.load_from_json("hr_database.json")
# Build foundation graph
foundation_graph = seed_manager.build_foundation_graph(schema=company_schema)
# Build on verified foundation
kb = core.build_knowledge_base(
sources=["new_documents/"],
foundation_graph=foundation_graph
)
print(f"โ
Foundation entities: {foundation_graph.node_count}")
print(f"โ
New entities: {kb.node_count - foundation_graph.node_count}")
3. Advanced Deduplication
from semantica.deduplication import DuplicateDetector, EntityMerger
# Detect duplicates
detector = DuplicateDetector()
duplicates = detector.find_duplicates(
entities=kb.entities,
similarity_threshold=0.85
)
# Merge duplicates
merger = EntityMerger()
merged = merger.merge_duplicates(
duplicates=duplicates,
strategy="highest_confidence"
)
print(f"โ
Found {len(duplicates)} duplicate groups")
print(f"โ
Merged into {len(merged)} canonical entities")
4. Conflict Detection & Resolution
from semantica.conflicts import ConflictDetector, ConflictResolver
# Detect conflicts
detector = ConflictDetector()
conflicts = detector.detect_conflicts(
entities=kb.entities,
properties=["revenue", "employee_count"]
)
print(f"โ ๏ธ Found {len(conflicts)} conflicts\n")
for conflict in conflicts:
print(f"Conflict: {conflict.entity.name}.{conflict.property}")
print(f" Values: {conflict.values}")
print(f" Sources: {conflict.sources}\n")
# Resolve conflict
resolver = ConflictResolver()
resolution = resolver.resolve(
conflict=conflict,
strategy="most_recent"
)
print(f" โ
Resolved: {resolution.chosen_value}\n")
Comprehensive Quality Scoring
from semantica.kg_qa import QualityAssessor
# Assess quality
assessor = QualityAssessor()
report = assessor.assess(kb)
print("=== QUALITY REPORT ===")
print(f"Overall Score: {report.overall_score}/100\n")
print("Detailed Scores:")
print(f" Completeness: {report.completeness_score}/100")
print(f" Consistency: {report.consistency_score}/100")
print(f" Accuracy: {report.accuracy_score}/100\n")
print("Issues:")
print(f" Duplicates: {report.duplicate_count}")
print(f" Conflicts: {report.conflict_count}")
print(f" Missing properties: {report.missing_property_count}")
๐๏ธ Architecture Overview
System Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SEMANTICA FRAMEWORK โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ DATA INGESTION LAYER โ โ
โ โ โโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโ โ โ
โ โ โ Files โ Web โ Feeds โ APIs โStreams โ Archives โ โ โ
โ โ โโโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโ โ โ
โ โ 50+ Formats โข Real-time โข Multi-modal โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ SEMANTIC PROCESSING LAYER โ โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Parse โ Normalize โ Extract โ Build Graph โ โ โ
โ โ โ โ โ Semantics โ โ โ โ
โ โ โโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ โ โ
โ โ NLP โข Embeddings โข Ontologies โข Quality Assurance โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ APPLICATION LAYER โ โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ GraphRAG โ AI Agents โMulti-Agent โ Analytics โ โ โ
โ โ โ โ โ Systems โ Copilots โ โ โ
โ โ โโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ โ โ
โ โ Hybrid Retrieval โข Context Engineering โข Reasoning โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Module Architecture
29 Production-Ready Modules Organized into Logical Layers:
Core & Infrastructure (5 modules)
semantica.core - Framework orchestration
Semantica- Main framework classOrchestrator- Pipeline coordination engineConfigManager- Configuration managementPluginRegistry- Plugin management systemLifecycleManager- System lifecycle management
semantica.pipeline - Pipeline management
PipelineBuilder- Pipeline construction DSLExecutionEngine- Pipeline execution enginePipelineValidator- Pipeline validationParallelismManager- Parallel execution managementResourceScheduler- Resource scheduling and allocationFailureHandler- Error handling and recovery
semantica.utils - Shared utilities
Validators- Input validation utilitiesHelpers- Common helper functionsLogging- Logging utilitiesExceptions- Custom exception classesTypes- Type definitions and annotationsConstants- Framework constants
semantica.monitoring - System monitoring
MetricsCollector- Metrics collectionPerformanceMonitor- Performance monitoringHealthChecker- Health checksAlertManager- Alert managementAnalyticsDashboard- Analytics dashboardQualityAssurance- Quality monitoring
semantica.security - Access control
AccessControl- Access control system- Authentication and authorization utilities
Data Processing (5 modules)
semantica.ingest - Universal data ingestion
FileIngestor- Local and cloud file processingWebIngestor- Web scraping and crawlingFeedIngestor- RSS/Atom feed processingStreamIngestor- Real-time stream processingRepoIngestor- Git repository processingEmailIngestor- Email protocol handlingDBIngestor- Database export handling
semantica.parse - Document parsing
DocumentParser- PDF, DOCX, PPTX parsingWebParser- HTML, XML, XHTML parsingStructuredDataParser- JSON, CSV, YAML parsingEmailParser- EML, MSG, MBOX parsingCodeParser- Source code parsingMediaParser- Image and media parsingExcelParser- Excel file parsing
semantica.normalize - Data normalization
TextNormalizer- Text normalizationTextCleaner- Text cleaning utilitiesEntityNormalizer- Entity name normalizationDateNormalizer- Date format normalizationNumberNormalizer- Number format normalizationEncodingHandler- Character encoding handlingLanguageDetector- Language detectionDataCleaner- General data cleaning
semantica.split - Document chunking
SemanticChunker- Semantic-aware chunkingStructuralChunker- Structure-based chunkingSlidingWindowChunker- Sliding window chunkingTableChunker- Table-aware chunkingChunkValidator- Chunk validationProvenanceTracker- Chunk provenance tracking
semantica.streaming - Real-time processing
StreamProcessor- Main streaming processorKafkaAdapter- Kafka integrationRabbitMQAdapter- RabbitMQ integrationKinesisAdapter- AWS Kinesis integrationPulsarAdapter- Apache Pulsar integrationCheckpointManager- Stream checkpointingBackpressureHandler- Backpressure managementExactlyOnce- Exactly-once processing guarantees
Semantic Intelligence (4 modules)
semantica.semantic_extract - Entity & relation extraction
NamedEntityRecognizer- NER with multiple modelsRelationExtractor- Relationship extractionEventDetector- Event detection and extractionCoreferenceResolver- Coreference resolutionTripleExtractor- RDF triple extractionSemanticAnalyzer- Semantic analysis engineNERExtractor- Alternative NER implementationLLMEnhancer- LLM-based extraction enhancementExtractionValidator- Extraction validationSemanticNetworkExtractor- Semantic network extraction
semantica.embeddings - Vector embeddings
EmbeddingGenerator- Main embedding generatorTextEmbedder- Text embedding generationImageEmbedder- Image embedding generationAudioEmbedder- Audio embedding generationMultiModalEmbedder- Multi-modal embeddingsEmbeddingOptimizer- Embedding optimizationContextManager- Context-aware embeddingsPoolingStrategies- Embedding pooling strategiesProviderAdapters- Provider-specific adapters
semantica.ontology - Ontology generation
OntologyGenerator- 6-stage ontology generation pipelineClassInferrer- Class discovery and hierarchy buildingPropertyGenerator- Property inferenceOntologyValidator- Validation with symbolic reasonersOWLGenerator- OWL/Turtle generationOntologyEvaluator- Ontology quality evaluationRequirementsSpec- Requirements specificationCompetencyQuestions- Competency question managementReuseManager- Ontology reuse managementVersionManager- Ontology versioningNamespaceManager- Namespace managementNamingConventions- Naming convention enforcementModuleManager- Ontology module managementDomainOntologies- Domain ontology managementOntologyDocumentation- Documentation generation
semantica.vocabulary - Vocabulary management
VocabularyManager- Controlled vocabulary managementControlledVocabulary- Controlled vocabulary implementation
Knowledge Graph (3 modules)
semantica.kg - Graph construction & analysis
GraphBuilder- Knowledge graph construction with temporal supportEntityResolver- Entity resolution and deduplicationGraphAnalyzer- Graph analytics engine with temporal evolution analysisTemporalGraphQuery- Time-aware graph queryingTemporalPatternDetector- Temporal pattern detectionTemporalVersionManager- Temporal versioning and snapshotsCentralityCalculator- Centrality measuresCommunityDetector- Community detectionConnectivityAnalyzer- Connectivity analysisGraphValidator- Graph validationDeduplicator- Graph deduplicationProvenanceTracker- Provenance trackingConflictDetector- Conflict detection in graphsSeedManager- Seed data management for graphs
semantica.triple_store - RDF storage
TripleManager- Triple store managementQueryEngine- SPARQL query engineBulkLoader- Bulk loading utilitiesJenaAdapter- Apache Jena adapterBlazegraphAdapter- Blazegraph adapterVirtuosoAdapter- Virtuoso adapterRDF4JAdapter- Eclipse RDF4J adapter
semantica.vector_store - Vector storage
VectorStore- Main vector store interfaceFAISSAdapter- FAISS adapterPineconeAdapter- Pinecone adapterWeaviateAdapter- Weaviate adapterQdrantAdapter- Qdrant adapterMilvusAdapter- Milvus adapterHybridSearch- Hybrid search implementationNamespaceManager- Namespace managementMetadataStore- Metadata storage
AI Applications (6 modules)
semantica.qa_rag - GraphRAG engine
RAGManager- RAG system managementHybridRetriever- Hybrid retrieval (vector + graph)ContextBuilder- Context building for RAGMemoryStore- Agent memory storage
semantica.context - Context engineering
ContextGraphBuilder- Context graph constructionAgentMemory- Agent memory managementContextRetriever- Context retrievalEntityLinker- Entity linking for context
semantica.prompting - Prompt engineering
PromptBuilder- Prompt construction and templating
semantica.agents - Agent infrastructure
ToolRegistry- MCP-compatible tool registry
semantica.reasoning - Reasoning & inference
InferenceEngine- Main inference engineDeductiveReasoner- Deductive reasoningAbductiveReasoner- Abductive reasoningRuleManager- Rule managementReteEngine- RETE algorithm implementationSPARQLReasoner- SPARQL-based reasoningExplanationGenerator- Explanation generation
semantica.quality - Quality assurance
QualityEngine- Quality assessment engine
Quality Assurance (5 modules)
semantica.templates - Schema templates
SchemaTemplate- Schema template definition and enforcement
semantica.seed - Seed data management
SeedManager- Seed data loading and management
semantica.deduplication - Entity deduplication
DuplicateDetector- Duplicate detectionEntityMerger- Entity merging strategiesSimilarityCalculator- Similarity calculationClusterBuilder- Duplicate cluster buildingMergeStrategy- Merge strategy implementations
semantica.conflicts - Conflict detection
ConflictDetector- Conflict detectionConflictResolver- Conflict resolutionConflictAnalyzer- Conflict analysisSourceTracker- Source tracking for conflictsInvestigationGuide- Conflict investigation utilities
semantica.kg_qa - Knowledge graph QA
QualityAssessor- Knowledge graph quality assessment
Export & Utilities (1 module)
semantica.export - Multi-format export
RDFExporter- RDF/Turtle exportJSONExporter- JSON/JSON-LD exportCSVExporter- CSV exportGraphExporter- Graph format exportYAMLExporter- YAML export for semantic networksReportGenerator- Quality and analysis reports
๐ Quick Start
Quick Start Examples
Example 1: Process Single Document
from semantica import Semantica
from semantica.parse import DocumentParser
from semantica.semantic_extract import NamedEntityRecognizer, RelationExtractor
# Initialize with configuration
core = Semantica(
ner_model="transformer",
relation_strategy="hybrid",
enable_quality_assurance=True
)
# Process document
result = core.process(
"company_news.txt",
extract_entities=True,
extract_relationships=True,
generate_triples=True
)
# Display results
print(f"Entities: {len(result.entities)}")
print(f"Relationships: {len(result.relationships)}")
print(f"Triples: {len(result.triples)}")
for entity in result.entities[:5]:
print(f"- {entity.text} ({entity.type}, confidence={entity.confidence:.2f})")
# Export results
result.export("output.json", format="json")
result.export("output.ttl", format="turtle")
Example 2: Build Knowledge Graph
from semantica import Semantica
from semantica.kg import GraphBuilder, EntityResolver
from semantica.export import RDFExporter
# Multiple documents
documents = ["doc1.txt", "doc2.txt", "doc3.txt"]
# Build graph with entity resolution
core = Semantica(
graph_db="neo4j",
merge_entities=True,
resolve_conflicts=True
)
kg = core.build_knowledge_graph(
documents,
merge_entities=True,
resolve_conflicts=True,
generate_embeddings=True
)
# Statistics
print(f"Nodes: {kg.node_count}")
print(f"Edges: {kg.edge_count}")
print(f"Entity Types: {sorted(kg.entity_types)}")
# Query with structured response
result = kg.query(
"Who founded the company?",
return_format="structured"
)
print(f"Answer: {result.answer}")
print(f"Confidence: {result.confidence:.2f}")
# Export graph
exporter = RDFExporter()
exporter.export(kg, "output.ttl", format="turtle")
Example 3: GraphRAG Setup
from semantica import Semantica
from semantica.qa_rag import GraphRAGEngine, HybridRetriever
from semantica.vector_store import VectorStore, PineconeAdapter
from semantica.kg import GraphBuilder
# Initialize with stores
core = Semantica(
vector_store="pinecone",
graph_db="neo4j",
embedding_model="text-embedding-3-large"
)
# Build knowledge base
kb = core.build_knowledge_base(
sources=["documents/"],
generate_embeddings=True,
build_graph=True
)
# Initialize GraphRAG with configuration
vector_store = VectorStore(adapter=PineconeAdapter(
api_key="your-api-key",
index_name="knowledge-base",
environment="us-east-1"
))
graphrag = GraphRAGEngine(
vector_store=kb.vector_store,
knowledge_graph=kb.graph,
embedding_model="text-embedding-3-large",
rerank=True
)
# Query with hybrid retrieval
response = graphrag.query(
"What are the main findings?",
top_k=5,
expand_graph=True,
max_hops=2
)
print(f"Answer: {response.answer}")
print(f"Confidence: {response.confidence:.2f}")
print(f"Sources: {len(response.sources)}")
Example 4: Production Setup with QA
from semantica import Semantica
from semantica.templates import SchemaTemplate
from semantica.seed import SeedManager
from semantica.kg_qa import QualityAssessor
from semantica.deduplication import DuplicateDetector, EntityMerger
from semantica.conflicts import ConflictDetector, ConflictResolver
# Load schema and seed data
schema = SchemaTemplate.from_file("schema.yaml")
seed_manager = SeedManager()
seed_manager.load_from_database("postgresql://user:pass@localhost/db")
seed_manager.load_from_csv("verified_data.csv")
foundation = seed_manager.create_foundation(schema)
# Build with comprehensive QA
core = Semantica(
quality_assurance=True,
merge_entities=True,
resolve_conflicts=True
)
kb = core.build_knowledge_base(
sources=["data/"],
schema_template=schema,
foundation_graph=foundation,
enable_all_qa=True,
deduplication_threshold=0.85,
conflict_resolution_strategy="highest_confidence"
)
# Comprehensive quality assessment
assessor = QualityAssessor()
report = assessor.assess(
kb,
check_completeness=True,
check_consistency=True,
check_accuracy=True,
check_duplicates=True,
check_conflicts=True
)
print("=== QUALITY REPORT ===")
print(f"Overall Score: {report.overall_score}/100")
print(f"Completeness: {report.completeness_score}/100")
print(f"Consistency: {report.consistency_score}/100")
print(f"Accuracy: {report.accuracy_score}/100")
print(f"Duplicates Found: {report.duplicate_count}")
print(f"Conflicts Found: {report.conflict_count}")
# Additional QA checks
duplicate_detector = DuplicateDetector()
duplicates = duplicate_detector.find_duplicates(
entities=kb.entities,
similarity_threshold=0.85
)
conflict_detector = ConflictDetector()
conflicts = conflict_detector.detect_conflicts(
entities=kb.entities,
properties=["name", "date", "value"]
)
print(f"\nDuplicates: {len(duplicates)} groups")
print(f"Conflicts: {len(conflicts)} issues")
๐ฏ Use Cases
1. ๐ข Enterprise Knowledge Engineering
Challenge: Process diverse enterprise data sources and build unified knowledge graphs.
from semantica import Semantica
from semantica.ingest import FileIngestor, WebIngestor, DBIngestor
# Initialize
core = Semantica(graph_db="neo4j")
# Multi-source ingestion
sources = [
*FileIngestor().ingest("/shared/documents/"),
*WebIngestor().ingest("https://confluence.company.com/api"),
*DBIngestor().ingest("postgresql://db", query="SELECT * FROM articles")
]
# Build unified graph
kg = core.build_knowledge_graph(
sources=sources,
merge_entities=True,
resolve_conflicts=True
)
print(f"โ
Enterprise knowledge graph: {kg.node_count} nodes")
Impact: 80% faster information discovery, automatic cross-reference detection
2. ๐ค AI Agents & Autonomous Systems
Challenge: Build AI agents with access to structured knowledge.
from semantica import Semantica
from semantica.agents import AgentManager
# Build knowledge base
core = Semantica()
kb = core.build_knowledge_base(
sources=["documents/"],
extract_entities=True,
build_graph=True
)
# Create agent with knowledge
agent_manager = AgentManager(knowledge_graph=kb.graph)
agent = agent_manager.create_agent(
role="data_analyst",
capabilities=["query_graph", "generate_reports"]
)
# Agent analyzes data
result = agent.analyze("Show me trends in the data")
print(result.report)
3. ๐ Multi-Format Document Processing
Challenge: Process various document formats uniformly.
from semantica import Semantica
from semantica.ingest import FileIngestor
# Ingest multiple formats
ingestor = FileIngestor()
sources = [
*ingestor.ingest("*.pdf"),
*ingestor.ingest("*.docx"),
*ingestor.ingest("*.xlsx"),
*ingestor.ingest("*.json")
]
# Process all through unified pipeline
core = Semantica()
kb = core.build_knowledge_base(sources)
print(f"โ
Processed {len(sources)} documents")
print(f"โ
Knowledge graph: {kb.graph.node_count} nodes")
4. ๐ Data Pipeline Processing
Challenge: Build custom processing pipelines.
from semantica.pipeline import PipelineBuilder
from semantica.ingest import FileIngestor
from semantica.semantic_extract import NamedEntityRecognizer
# Build pipeline
pipeline = PipelineBuilder() \
.add_step("ingest", {"ingestor": FileIngestor()}) \
.add_step("extract", {"ner": NamedEntityRecognizer()}) \
.add_step("build_graph", {"merge_entities": True}) \
.set_parallelism(4) \
.build()
# Execute
results = pipeline.run()
print(f"โ
Pipeline completed: {results.document_count} documents")
5. ๐ Multi-Source Knowledge Graph
Challenge: Combine data from files, web, and databases.
from semantica import Semantica
from semantica.ingest import FileIngestor, WebIngestor, DBIngestor
# Collect diverse sources
sources = [
*FileIngestor().ingest("documents/*.pdf"),
*WebIngestor().ingest("https://example.com/api/articles"),
*DBIngestor().ingest("postgresql://localhost/db")
]
# Build unified graph
core = Semantica()
kg = core.build_knowledge_graph(sources, merge_entities=True)
print(f"โ
Unified graph: {kg.node_count} nodes, {kg.edge_count} edges")
๐ฌ Advanced Features
1. Incremental Updates
from semantica.streaming import StreamProcessor
# Stream processor
stream = StreamProcessor(
knowledge_graph=core.graph,
update_mode="incremental"
)
stream.connect("kafka://localhost:9092/topic")
stream.start()
# Automatic real-time updates
2. Multi-Language Support
core = Semantica(
languages=["en", "es", "fr", "de", "zh"],
auto_detect_language=True,
translate_to="en"
)
kb = core.build_knowledge_base([
"documents_english/",
"documentos_espaรฑol/",
"documents_franรงais/"
])
# Unified multilingual knowledge graph
3. Custom Ontology Import
from semantica.ontology import OntologyManager
manager = OntologyManager()
manager.import_ontology("schema.org")
manager.import_ontology("custom_domain.ttl", format="turtle")
# Extend with custom classes
manager.add_class(
name="CustomEntity",
parent="schema:Thing",
properties=["customProperty1"]
)
core = Semantica(ontology=manager.ontology)
4. Advanced Reasoning
from semantica.reasoning import ReasoningEngine
reasoning = ReasoningEngine(
reasoning_types=["deductive", "inductive", "abductive"],
reasoner="hermit"
)
# Apply reasoning
inferred_triples = reasoning.infer(kg)
print(f"Original: {len(kg.triples)}")
print(f"Inferred: {len(inferred_triples)}")
5. Graph Analytics
from semantica.analytics import GraphAnalytics
analytics = GraphAnalytics(kg)
# Centrality analysis
influential = analytics.compute_centrality(
methods=["pagerank", "betweenness"]
)
# Community detection
communities = analytics.detect_communities(algorithm="louvain")
# Path finding
paths = analytics.find_shortest_paths("Entity A", "Entity B")
print(f"Influential entities: {len(influential)}")
print(f"Communities: {len(communities)}")
6. Custom Pipelines
from semantica.pipeline import PipelineBuilder
pipeline = PipelineBuilder()
pipeline.add_stage("parse", parser="custom_parser")
pipeline.add_stage("extract_entities", model="custom_ner")
pipeline.add_stage("validate", validator="custom_validator")
pipeline.add_stage("store", destination="custom_db")
results = pipeline.execute(input_data)
7. API Integration
from semantica.integrations import APIIntegration
api = APIIntegration()
api.register_endpoint(
name="crunchbase",
url="https://api.crunchbase.com/v4/",
auth_token=token
)
# Enrich entities
enriched = api.enrich_entities(
entities=kg.entities,
endpoint="crunchbase",
fields=["funding", "employees"]
)
๐ญ Production Deployment
Docker Deployment
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y \
build-essential libpq-dev \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "app.py"]
# docker-compose.yml
version: '3.8'
services:
semantica:
build: .
ports: ["8000:8000"]
environment:
- NEO4J_URI=bolt://neo4j:7687
- PINECONE_API_KEY=${PINECONE_API_KEY}
depends_on: [neo4j, redis]
neo4j:
image: neo4j:5.13
ports: ["7474:7474", "7687:7687"]
environment:
- NEO4J_AUTH=neo4j/password
volumes: [neo4j_data:/data]
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes: [redis_data:/data]
volumes:
neo4j_data:
redis_data:
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: semantica
spec:
replicas: 3
selector:
matchLabels:
app: semantica
template:
metadata:
labels:
app: semantica
spec:
containers:
- name: semantica
image: semantica:latest
ports:
- containerPort: 8000
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: semantica-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: semantica
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Cloud Deployment
AWS:
from semantica.cloud import AWSDeployment
aws = AWSDeployment(
region="us-east-1",
graph_db="neptune",
vector_db="opensearch"
)
aws.deploy(stack_name="semantica-prod", auto_scaling=True)
Azure:
from semantica.cloud import AzureDeployment
azure = AzureDeployment(
subscription_id="...",
graph_db="cosmos_gremlin"
)
azure.deploy(location="eastus")
GCP:
from semantica.cloud import GCPDeployment
gcp = GCPDeployment(
project_id="semantica-project",
graph_db="neo4j_aura"
)
gcp.deploy(region="us-central1")
Monitoring
from semantica.monitoring import Monitor, MetricsCollector
# Initialize monitoring
monitor = Monitor(
prometheus_endpoint="http://prometheus:9090",
grafana_endpoint="http://grafana:3000"
)
# Collect metrics
metrics = MetricsCollector()
metrics.enable_metrics([
"processing_rate",
"extraction_accuracy",
"graph_size",
"query_latency"
])
# Set alerts
monitor.add_alert(
name="high_error_rate",
condition="error_rate > 0.05",
severity="critical"
)
๐ Performance Benchmarks
Processing Speed
| Document Type | Docs/Hour | Entities/Sec | Triples/Sec |
|---|---|---|---|
| PDF (10 pages) | 1,200 | 450 | 800 |
| DOCX (5 pages) | 2,500 | 600 | 1,100 |
| HTML (articles) | 5,000 | 1,200 | 2,000 |
| JSON (structured) | 10,000 | 2,500 | 4,000 |
AWS c5.4xlarge (16 vCPU, 32GB RAM)
Accuracy Metrics
| Task | Precision | Recall | F1 Score |
|---|---|---|---|
| Entity Extraction | 0.94 | 0.91 | 0.92 |
| Relationship Extraction | 0.89 | 0.85 | 0.87 |
| Ontology Generation | 0.96 | 0.93 | 0.94 |
| Duplicate Detection | 0.97 | 0.95 | 0.96 |
GraphRAG Performance
| System | Accuracy | Latency | Context |
|---|---|---|---|
| Vector-Only | 70% | 50ms | โญโญโญ |
| Graph-Only | 75% | 300ms | โญโญโญโญ |
| Semantica GraphRAG | 91% โญ | 80ms | โญโญโญโญโญ |
30% accuracy improvement over vector-only RAG
๐บ๏ธ Roadmap
Q1 2025
- Core framework (v1.0)
- GraphRAG engine
- 6-stage ontology pipeline
- Quality assurance modules
- Enhanced multi-language support
- Real-time streaming improvements
Q2 2025
- Multi-modal processing
- Advanced reasoning v2
- AutoML for NER models
- Federated knowledge graphs
- Enterprise SSO
Q3 2025
- Temporal knowledge graphs
- Probabilistic reasoning
- Automated ontology alignment
- Graph neural networks
- Mobile SDK
Q4 2025
- Quantum-ready algorithms
- Neuromorphic computing
- Blockchain provenance
- Privacy-preserving techniques
- Version 2.0 release
๐ค Community & Support
๐ฌ Join Our Community
| Channel | Purpose |
|---|---|
| Discord | Real-time help, showcases |
| GitHub Discussions | Q&A, feature requests |
| Updates, tips | |
| YouTube | Tutorials, webinars |
๐ Learning Resources
- ๐ Documentation
- ๐ฏ Tutorials
- ๐ก Examples
- ๐ Academy
- ๐ Blog
๐ข Enterprise Support
| Tier | Features | SLA | Price |
|---|---|---|---|
| Community | Public support | Best effort | Free |
| Professional | Email support | 48h | Contact |
| Enterprise | 24/7 support | 4h | Contact |
| Premium | Phone, custom dev | 1h | Contact |
Contact: enterprise@semantica.io
๐ค Contributing
How to Contribute
# Fork and clone
git clone https://github.com/your-username/semantica.git
cd semantica
# Create branch
git checkout -b feature/your-feature
# Install dev dependencies
pip install -e ".[dev,test]"
# Make changes and test
pytest tests/
black semantica/
flake8 semantica/
# Commit and push
git commit -m "Add feature"
git push origin feature/your-feature
Contribution Types
- Code - New features, bug fixes
- Documentation - Improvements, tutorials
- Bug Reports - Create issue
- Feature Requests - Request feature
Recognition
Contributors receive:
- ๐ Recognition in CONTRIBUTORS.md
- ๐ GitHub badges
- ๐ Semantica swag
- ๐ Featured showcases
๐ License
Semantica is licensed under the MIT License - see the LICENSE file for details.
Built with โค๏ธ by the Semantica Community
Website โข Documentation โข GitHub โข Discord
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantica-0.0.1.tar.gz.
File metadata
- Download URL: semantica-0.0.1.tar.gz
- Upload date:
- Size: 714.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cebde300f6b46389221426173468944cc459384e9570cd6bb5bb8d072affecb7
|
|
| MD5 |
f3bc589a448cff38f57048dfc92ff63f
|
|
| BLAKE2b-256 |
c963cbc622af427161aa634b55e501b56c24d499227c0856454fbf107d6b83c0
|
Provenance
The following attestation bundles were made for semantica-0.0.1.tar.gz:
Publisher:
publish.yml on Hawksight-AI/semantica
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantica-0.0.1.tar.gz -
Subject digest:
cebde300f6b46389221426173468944cc459384e9570cd6bb5bb8d072affecb7 - Sigstore transparency entry: 713864250
- Sigstore integration time:
-
Permalink:
Hawksight-AI/semantica@af989bcc0e4e0f75a09e9087744bc6f8dc4b82d8 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/Hawksight-AI
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@af989bcc0e4e0f75a09e9087744bc6f8dc4b82d8 -
Trigger Event:
release
-
Statement type:
File details
Details for the file semantica-0.0.1-py3-none-any.whl.
File metadata
- Download URL: semantica-0.0.1-py3-none-any.whl
- Upload date:
- Size: 855.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c6cc4b0a5860860c09ae1f1b8707a49f60b2c819ff901533faef4799b32044f
|
|
| MD5 |
5a28dfe0e88c0e19a80c2c2ca7890c9e
|
|
| BLAKE2b-256 |
d6d437af29a2cdde504de8c73a8b61f9480d2ff7540dcd87723d2e59e2ad1a70
|
Provenance
The following attestation bundles were made for semantica-0.0.1-py3-none-any.whl:
Publisher:
publish.yml on Hawksight-AI/semantica
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semantica-0.0.1-py3-none-any.whl -
Subject digest:
5c6cc4b0a5860860c09ae1f1b8707a49f60b2c819ff901533faef4799b32044f - Sigstore transparency entry: 713864253
- Sigstore integration time:
-
Permalink:
Hawksight-AI/semantica@af989bcc0e4e0f75a09e9087744bc6f8dc4b82d8 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/Hawksight-AI
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@af989bcc0e4e0f75a09e9087744bc6f8dc4b82d8 -
Trigger Event:
release
-
Statement type: