Skip to main content

Semantic data model for LLM-consumable data catalog

Project description

Nomox LLM Semantic Model

Internal python package for describing semantics model used across Nomox.

Installation

To use the packge from anywhere run:

pip install git+https://github.com/MiraZzle/nomox-semantics-package.git

For development:

pip install -e .

Architecture

The semantic model is organized into three layers:

Level 1: Source-Scoped Semantics

Produced by the Level 1 Indexer Agent. Contains:

  • DataSource: Top-level container for a data source (Trino catalog.schema)
  • Table: Tables and views with semantic roles and temporal information
  • Column: Columns with semantic types, profiling, and sample values
  • InternalRelationship: Foreign key relationships within a source

Level 2: Cross-Source Semantics

Produced by the Level 2 Aggregator Agent. Contains:

  • SemanticEntity: Canonical business concepts (Customer, Order, Product)
  • EntityManifestation: Where entities appear across sources
  • UnifiedAttribute: Logical attributes sourced from multiple places
  • EntityRelationship: Relationships between entities with join paths
  • IdentityResolution: How to match entities across sources

Shared Components

  • GlossaryTerm: Business terminology definitions
  • ConfidenceScore: Confidence scoring for all elements
  • ExpertOverride: Human corrections and enhancements
  • IndexingState: Tracking of indexing jobs and status

Quick Start

from semantic_model import (
    SemanticModel,
    DataSource,
    Table,
    Column,
    SemanticType,
    SemanticCategory,
    SourceType,
    create_empty_model,
    save_model,
    load_model,
)

# Create an empty model
model = create_empty_model(
    model_id="my-org-model",
    organization_id="my-org",
)

# Create a data source
source = DataSource(
    id="sales-db",
    name="Sales Database",
    trino_catalog="analytics",
    trino_schema="sales",
    fully_qualified_prefix="analytics.sales",
    source_type=SourceType.ANALYTICAL,
    description="Sales transaction data warehouse",
    domain="Sales",
)

# Create a table
orders_table = Table(
    id="orders",
    name="orders",
    fully_qualified_name="analytics.sales.orders",
    description="Fact table containing one row per order",
    columns=[
        Column(
            id="order_id",
            name="order_id",
            ordinal_position=0,
            data_type="VARCHAR",
            is_primary_key=True,
            semantic_type=SemanticType.identifier(subtype="uuid"),
            description="Unique order identifier",
        ),
        Column(
            id="customer_id",
            name="customer_id",
            ordinal_position=1,
            data_type="VARCHAR",
            is_foreign_key=True,
            semantic_type=SemanticType.identifier(subtype="uuid"),
            description="ID of the customer who placed the order",
        ),
        Column(
            id="total_amount",
            name="total_amount",
            ordinal_position=2,
            data_type="DECIMAL(12,2)",
            semantic_type=SemanticType(
                category=SemanticCategory.CURRENCY,
                confidence=0.95,
            ),
            unit="USD",
            description="Total order value including tax",
        ),
    ],
)

# Add table to source
source = source.add_table(orders_table)

# Add source to model
model = model.add_source(source)

# Save the model
save_model(model, "semantic_model.json")

# Load the model
loaded_model = load_model("semantic_model.json")

# Generate prompt context for LLM
prompt_context = model.to_prompt_format(
    include_sources=True,
    include_entities=True,
    include_glossary=True,
)
print(prompt_context)

Working with Confidence Scores

from semantic_model import ConfidenceScore, LowConfidenceItem, ConfidenceObjectType

# Create a confidence score
confidence = ConfidenceScore(
    overall=0.75,
    threshold=0.8,
    schema_understanding=0.9,
    semantic_typing=0.7,
    description_quality=0.65,
    low_confidence_items=[
        LowConfidenceItem(
            object_type=ConfidenceObjectType.COLUMN,
            object_id="status_code",
            object_name="status_code",
            score=0.4,
            reason="Unknown categorical values",
            suggested_clarification="What do status codes 'P', 'A', 'R' mean?",
        ),
    ],
)

# Check if meets threshold
if not confidence.meets_threshold:
    print("Source needs expert review")
    for item in confidence.low_confidence_items:
        print(f"  - {item.object_name}: {item.reason}")

Expert Overrides

from semantic_model import ExpertOverride, ReindexScope

# Create an override
override = ExpertOverride(
    id="override-001",
    created_by="domain-expert@company.com",
    field_path="description",
    original_value="Unknown table",
    override_value="Customer master data from CRM system",
    reason="Clarified based on CRM documentation",
    reindex_scope=ReindexScope.THIS_SOURCE,
)

# Apply to a table
table.expert_overrides.append(override)

Semantic Entities (Level 2)

from semantic_model import (
    SemanticEntity,
    EntityManifestation,
    ManifestationRole,
    UnifiedAttribute,
    EntityRelationship,
    JoinPath,
    JoinStep,
)

# Create a semantic entity
customer_entity = SemanticEntity(
    id="customer",
    name="Customer",
    description="A customer is any individual or organization with an account",
    canonical_id_name="customer_id",
    canonical_id_format="UUID",
    domain="Sales",
    manifestations=[
        EntityManifestation(
            source_id="crm-db",
            table_id="accounts",
            fully_qualified_name="crm.public.accounts",
            role=ManifestationRole.PRIMARY,
            key_column_id="account_id",
            usage_guidance="Use for real-time customer master data",
        ),
        EntityManifestation(
            source_id="analytics-db",
            table_id="customer_360",
            fully_qualified_name="analytics.customers.customer_360",
            role=ManifestationRole.DERIVED,
            key_column_id="customer_id",
            usage_guidance="Use for analytics with pre-computed metrics",
        ),
    ],
)

# Add to model
model = model.add_entity(customer_entity)

Serialization

from semantic_model import save_model, load_model
from semantic_model.serialization import ModelExporter, save_model_yaml

# Save as JSON
save_model(model, "model.json")

# Save as YAML (requires PyYAML)
save_model_yaml(model, "model.yaml")

# Export utilities
exporter = ModelExporter(model)

# Get prompt-ready context
context = exporter.to_prompt_context(max_tokens=4000)

# Get source summary
summary = exporter.to_source_summary()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nomox_semantic_model-0.1.0.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nomox_semantic_model-0.1.0-py3-none-any.whl (34.8 kB view details)

Uploaded Python 3

File details

Details for the file nomox_semantic_model-0.1.0.tar.gz.

File metadata

  • Download URL: nomox_semantic_model-0.1.0.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nomox_semantic_model-0.1.0.tar.gz
Algorithm Hash digest
SHA256 07d96139780c6cb8c2cdb1d78899d1c4373d1443cafbd9bf6c8ac700b14856b9
MD5 008ed2710d6ec486b803e56602275006
BLAKE2b-256 f893d751abe4bd996789fa7f76d7e68cdabf237c53fbc8a7b6251c2dc7fb355a

See more details on using hashes here.

Provenance

The following attestation bundles were made for nomox_semantic_model-0.1.0.tar.gz:

Publisher: publish.yml on MiraZzle/nomox-semantics-package

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nomox_semantic_model-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for nomox_semantic_model-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf66afb7b0a24fb4cbb7af3aef551591d1ab88fa4126a008e83922024ba9881b
MD5 7e8ebf28c4b89655834df5fa54dc1e20
BLAKE2b-256 301a607b588d4718ec15aa5e2c95ea1f4ef0915543b87e040d79ae90a4781e5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for nomox_semantic_model-0.1.0-py3-none-any.whl:

Publisher: publish.yml on MiraZzle/nomox-semantics-package

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page