Skip to main content

AI-powered metadata enhancement for Hasura DDN schema files

Project description

DDN Metadata Bootstrap

PyPI version Python versions License: MIT

AI-powered metadata enhancement for Hasura DDN (Data Delivery Network) schema files. Automatically generate high-quality descriptions and detect sophisticated relationships in your YAML/HML schema definitions using advanced AI with comprehensive configuration management.

๐Ÿš€ Features

๐Ÿค– Multi-Provider AI Support

  • Anthropic Claude: Default provider with claude-3-haiku, claude-3-sonnet, and claude-3-opus models
  • OpenAI GPT: Support for gpt-3.5-turbo, gpt-4, gpt-4o-mini, and latest models
  • Google Gemini: Support for gemini-pro, gemini-1.5-pro, and gemini-1.5-flash models
  • Automatic Fallback: Graceful degradation between providers with configurable priorities
  • Provider-Specific Optimization: Model-specific prompting and parameter tuning

๐ŸŽฏ Granular Feature Control

  • Individual Feature Flags: Control each processing feature independently
  • Flexible Processing Modes: Choose between all, forward-only, or none for relationships
  • Selective Enhancement: Process only descriptions, only relationships, or both
  • Rebuild Capabilities: Rebuild existing relationships from scratch when needed

๐Ÿง  Advanced AI Generation

  • Quality Assessment with Retry Logic: Multi-attempt generation with configurable scoring thresholds
  • Context-Aware Business Descriptions: Domain-specific system prompts with industry context
  • Smart Field Analysis: Automatically detects and skips self-explanatory, generic, or cryptic fields
  • Configurable Length Controls: Precise control over description length and token usage

๐Ÿง  Intelligent Caching System

  • Similarity-Based Matching: Reuses descriptions for similar fields across entities (85% similarity threshold)
  • Performance Optimization: Reduces API calls by up to 70% on large schemas through intelligent caching
  • Cache Statistics: Real-time performance monitoring with hit rates and API cost savings tracking
  • Type-Aware Matching: Considers field types and entity context for better cache accuracy

๐Ÿ” WordNet-Based Linguistic Analysis

  • Generic Term Detection: Uses NLTK and WordNet for sophisticated term analysis to skip meaningless fields
  • Semantic Density Analysis: Evaluates conceptual richness and specificity of field names
  • Definition Quality Scoring: Ensures meaningful, non-circular descriptions through linguistic validation
  • Abstraction Level Calculation: Determines appropriate description depth based on semantic analysis

๐Ÿ“ Enhanced Acronym Expansion

  • Comprehensive Mappings: 200+ pre-configured acronyms for technology, finance, and business domains
  • Context-Aware Expansion: Industry-specific acronym interpretation based on domain context
  • Pre-Generation Enhancement: Expands acronyms BEFORE AI generation for better context
  • Custom Domain Support: Fully configurable acronym mappings via YAML configuration

๐Ÿ”— Advanced Relationship Detection

  • Template-Based FK Detection: Sophisticated foreign key detection with confidence scoring and semantic validation
  • Shared Business Key Relationships: Many-to-many relationships via shared field analysis with FK-aware filtering
  • Cross-Subgraph Intelligence: Smart entity matching across different subgraphs
  • Configurable Templates: Flexible FK template patterns with placeholders for complex naming conventions
  • Advanced Relationship Blocking: Precision rule-based system to prevent inappropriate cross-connector relationships

โš™๏ธ Comprehensive Configuration System

  • YAML-First Configuration: Central config.yaml file for all settings with full documentation
  • Waterfall Precedence: CLI args > Environment variables > config.yaml > defaults
  • Configuration Validation: Comprehensive validation with helpful error messages and source tracking
  • Feature Toggles: Granular control over processing features with clear flag names

๐ŸŽฏ Advanced Quality Controls

  • Buzzword Detection: Avoids corporate jargon and meaningless generic terms
  • Pattern-Based Filtering: Regex-based rejection of poor description formats
  • Technical Language Translation: Converts technical terms to business-friendly language
  • Length Optimization: Multiple validation layers with hard limits and target lengths

๐Ÿ” Intelligent Field Selection

  • Generic Field Detection: Skips overly common fields that don't benefit from descriptions
  • Cryptic Abbreviation Handling: Configurable handling of unclear field names with vowel analysis
  • Self-Explanatory Pattern Recognition: Automatically identifies fields that don't need descriptions
  • Value Assessment: Only generates descriptions that add meaningful business value

๐Ÿ“ฆ Installation

From PyPI (Recommended)

pip install ddn-metadata-bootstrap

Provider-Specific Dependencies

The tool supports multiple AI providers. Install the dependencies for your chosen provider:

# For Anthropic Claude (default)
pip install ddn-metadata-bootstrap[anthropic]
# or separately:
pip install anthropic

# For OpenAI GPT  
pip install ddn-metadata-bootstrap[openai]
# or separately:
pip install openai

# For Google Gemini
pip install ddn-metadata-bootstrap[gemini]
# or separately: 
pip install google-generativeai

# Install all providers
pip install ddn-metadata-bootstrap[all]

From Source

git clone https://github.com/hasura/ddn-metadata-bootstrap.git
cd ddn-metadata-bootstrap
pip install -e .

๐Ÿƒ Quick Start

1. Choose Your AI Provider

Option A: Anthropic Claude (Default - Recommended)

export ANTHROPIC_API_KEY="your-anthropic-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"  # Optional (default)
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-haiku-20240307"  # Optional

Option B: OpenAI GPT

export OPENAI_API_KEY="your-openai-api-key"  
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_OPENAI_MODEL="gpt-3.5-turbo"  # Optional

Option C: Google Gemini

export GEMINI_API_KEY="your-gemini-api-key"
# or alternatively:
export GOOGLE_API_KEY="your-gemini-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="gemini"
export METADATA_BOOTSTRAP_GEMINI_MODEL="gemini-pro"  # Optional

2. Set up your directories

export METADATA_BOOTSTRAP_INPUT_DIR="./app/metadata"
export METADATA_BOOTSTRAP_OUTPUT_DIR="./enhanced_metadata"

3. Create a configuration file (Recommended)

Create a config.yaml file in your project directory:

# config.yaml - DDN Metadata Bootstrap Configuration

# =============================================================================
# GLOBAL PROCESSING CONFIGURATION
# =============================================================================
# Controls which features are enabled and basic processing behavior

# Feature Flags - what processing to perform
create_fk: all                           # all|forward|none - FK relationships
create_shared_keys: all                  # all|forward|none - Shared key relationships
create_command_relationship_hints: true  # true|false - Command relationship hints
create_descriptions: true                # true|false - AI-generated descriptions
rebuild_relationships: false             # true|false - Rebuild existing relationships from scratch

enable_quality_assessment: true         # Enable AI to score and improve its own descriptions

# AI Provider Configuration
ai_provider: "anthropic"  # Choose: anthropic, openai, gemini

# Provider-specific API keys (alternatively set via environment variables)
# anthropic_api_key: "your-anthropic-key"
# openai_api_key: "your-openai-key" 
# gemini_api_key: "your-gemini-key"

# Provider-specific models
anthropic_model: "claude-3-haiku-20240307"  # claude-3-sonnet-20240229, claude-3-opus-20240229
openai_model: "gpt-3.5-turbo"               # gpt-4, gpt-4o-mini, gpt-4-turbo-preview
gemini_model: "gemini-pro"                  # gemini-1.5-pro-latest, gemini-1.5-flash

# =============================================================================
# DESCRIPTION GENERATION CONFIGURATION
# =============================================================================

# Domain-specific system prompt for your organization
system_prompt: |
  You generate concise field descriptions for database schema metadata at a global financial services firm.
  
  DOMAIN CONTEXT:
  - Organization: Global bank
  - Department: Cybersecurity operations  
  - Use case: Risk management and security compliance
  - Regulatory environment: Financial services (SOX, Basel III, GDPR, etc.)
  
  Think: "What would a cybersecurity analyst at a bank need to know about this field?"

# Token and length limits
field_tokens: 25                    # Max tokens AI can generate for field descriptions
kind_tokens: 50                     # Max tokens AI can generate for kind descriptions
field_desc_max_length: 120          # Maximum total characters for field descriptions
kind_desc_max_length: 250           # Maximum total characters for entity descriptions

# Quality thresholds
minimum_description_score: 70       # Minimum score (0-100) to accept a description
max_description_retry_attempts: 3   # How many times to retry for better quality

# =============================================================================
# ENHANCED ACRONYM EXPANSION
# =============================================================================
acronym_mappings:
  # Technology & Computing
  api: "Application Programming Interface"
  ui: "User Interface"
  db: "Database"
  
  # Security & Access Management
  mfa: "Multi-Factor Authentication"
  sso: "Single Sign-On"
  iam: "Identity and Access Management"
  siem: "Security Information and Event Management"
  
  # Financial Services & Compliance
  pci: "Payment Card Industry"
  sox: "Sarbanes-Oxley Act"
  kyc: "Know-Your-Customer"
  aml: "Anti-Money Laundering"
  # ... 200+ total mappings available

# =============================================================================
# INTELLIGENT FIELD SELECTION
# =============================================================================
# Fields to skip entirely - these will not get descriptions at all
skip_field_patterns:
  - "^id$"
  - "^_id$"
  - "^uuid$"
  - "^created_at$"
  - "^updated_at$"
  - "^debug_.*"
  - "^test_.*"
  - "^temp_.*"

# Generic fields - won't get unique descriptions (too common)
generic_fields:
  - "id"
  - "key"
  - "uid"
  - "guid"
  - "name"

# Self-explanatory fields - simple patterns that don't need descriptions
self_explanatory_patterns:
  - '^id$'
  - '^_id$'
  - '^guid$'
  - '^uuid$'
  - '^key$'

# Cryptic Field Handling
skip_cryptic_abbreviations: true   # Skip fields with unclear abbreviations
skip_ultra_short_fields: true      # Skip very short field names that are likely abbreviations
max_cryptic_field_length: 4        # Field names this length or shorter are considered cryptic

# Content quality controls
buzzwords: [
  'synergy', 'leverage', 'paradigm', 'ecosystem',
  'contains', 'stores', 'holds', 'represents'
]

forbidden_patterns: [
  'this\\s+field\\s+represents',
  'used\\s+to\\s+(track|manage|identify)',
  'business.*information'
]

# =============================================================================
# RELATIONSHIP DETECTION
# =============================================================================
# FK Template Patterns for relationship detection
# Format: "{pk_pattern}|{fk_pattern}"
# Placeholders: {gi}=generic_id, {pt}=primary_table, {ps}=primary_subgraph, {pm}=prefix_modifier
fk_templates:
  - "{gi}|{pm}_{pt}_{gi}"           # active_service_name โ†’ Services.name
  - "{gi}|{pt}_{gi}"                # user_id โ†’ Users.id
  - "{pt}_{gi}|{pm}_{pt}_{gi}"      # user_id โ†’ ActiveUsers.active_user_id

# =============================================================================
# ADVANCED RELATIONSHIP BLOCKING
# =============================================================================
# Precision rule-based system to prevent inappropriate relationships
# Uses bidirectional validation with data_connector + entity + field pattern matching
fk_key_blacklist:
  # Block cross-cloud provider connections with infrastructure fields
  - entity_pattern_a:
      data_connector: "^(gcp|arg|various)$"  # Google Cloud Platform, Azure Resource Graph, Various
      entity: "^(gcp_|google_).*"            # Google/GCP entities
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^(gcp|arg|various)$"  
      entity: "^(az_|azure_).*"              # Azure entities
      field: ".*(resource|project|policy|storage|compute).*"  # Infrastructure fields only
    logic: "and"
    reason: "Block google/gcp entities from connecting to azure entities with infrastructure-related fields"
  
  # Complete isolation between major cloud platforms
  - entity_pattern_a:
      data_connector: "^gcp$"                # Google Cloud Platform connector
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^arg$"                # Azure Resource Graph connector  
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    logic: "and"
    reason: "Block all connections between Google Cloud Platform and Azure Resource Graph connectors"

# Shared relationship limits
max_shared_relationships: 10000
max_shared_per_entity: 10
min_shared_confidence: 30

# Shared Key Rejection Patterns - fields matching these won't be used for shared relationships
shared_key_rejection_patterns:
  # Private/Technical Fields (leading underscore indicates internal use)
  - "^_.*$"
  # Primary Identifiers (too generic for meaningful relationships)
  - "^_?(id|key)$"
  # Generic Classification Fields (overly broad categorization)
  - "^(name|type|category|title|code|level|kind)$"
  # State/Status Fields (frequently changing, not structural)
  - "^(status|state|active)$"
  # Audit Fields - Temporal Only (timestamp-based, not relational)
  - "^(created|updated|modified)(_at|_date|_time|_timestamp)?$"

4. Run the tool with your chosen provider

# Use default provider (Anthropic) with default settings
ddn-metadata-bootstrap

# Use OpenAI explicitly
ddn-metadata-bootstrap --ai-provider openai --openai-api-key your-key

# Use Gemini with specific model
ddn-metadata-bootstrap --ai-provider gemini --gemini-model gemini-1.5-pro

# Show configuration including AI provider setup
ddn-metadata-bootstrap --show-config

# Test your AI provider connection
ddn-metadata-bootstrap --test-provider

# Process only relationships (skip descriptions)
ddn-metadata-bootstrap --create-descriptions false

# Process only descriptions (skip relationships)
ddn-metadata-bootstrap --create-fk none --create-shared-keys none

# Rebuild all relationships from scratch
ddn-metadata-bootstrap --rebuild-relationships

# Use custom configuration file
ddn-metadata-bootstrap --config custom-config.yaml

# Enable verbose logging to see AI provider selection and caching
ddn-metadata-bootstrap --verbose

๐ŸŽฏ Feature Control System

The tool provides granular control over each processing feature through clean, intuitive flags:

Core Processing Features

Feature Config Key Values Description
FK Relationships create_fk all, forward, none Foreign key relationship detection
Shared Key Relationships create_shared_keys all, forward, none Shared field relationship detection
Command Hints create_command_relationship_hints true, false Command relationship hints
Descriptions create_descriptions true, false AI-generated descriptions
Rebuild Mode rebuild_relationships true, false Rebuild existing relationships

Processing Modes

All Mode (all)

  • Creates relationships in both directions
  • Full bidirectional relationship graph
  • Best for comprehensive schema understanding

Forward Mode (forward)

  • Creates relationships in forward direction only
  • Reduces relationship complexity
  • Useful for directed schema analysis

None Mode (none)

  • Skips the feature entirely
  • Fastest processing
  • Use when feature not needed

Feature Combinations

# Only generate descriptions (no relationships)
create_fk: none
create_shared_keys: none
create_descriptions: true

# Only generate FK relationships (no descriptions or shared keys)
create_fk: all
create_shared_keys: none
create_descriptions: false

# Minimal processing (relationships only, forward direction)
create_fk: forward
create_shared_keys: forward
create_descriptions: false

# Full processing with rebuild
create_fk: all
create_shared_keys: all
create_descriptions: true
rebuild_relationships: true

๐Ÿ”— Advanced Relationship Blocking System

The tool includes a sophisticated bidirectional relationship blocking system that prevents inappropriate foreign key relationships from being generated. This is particularly important in enterprise environments with multiple data connectors, cloud providers, and security boundaries.

Key Features

Precision Pattern Matching

Each blocking rule uses three-part patterns for maximum precision:

  • Data Connector: Regex pattern matching the connector name (e.g., ^gcp$, ^(test|dev)_.*)
  • Entity Name: Regex pattern matching the entity/table name (e.g., ^google_.*, ^azure_storage.*)
  • Field Name: Regex pattern matching the field name (e.g., .*resource.*, .*secret.*)

Bidirectional Validation

Rules automatically check both directions of a relationship:

  • Pattern A โ†’ Pattern B: google_compute โ†’ azure_storage_resource
  • Pattern B โ†’ Pattern A: azure_vm โ†’ google_analytics_data

Both directions are blocked by a single rule definition.

Flexible Logic Operators

  • AND Logic: All patterns (connector AND entity AND field) must match for both sides
  • OR Logic: Either side matching its full pattern triggers the block

Real-World Examples

Cross-Cloud Security Isolation

# Block Google Cloud from Azure Resource Graph
- entity_pattern_a:
    data_connector: "^gcp$"        # Google Cloud Platform
    entity: ".*"                   # Any GCP entity
    field: ".*"                    # Any field
  entity_pattern_b:
    data_connector: "^arg$"        # Azure Resource Graph  
    entity: ".*"                   # Any Azure entity
    field: ".*"                    # Any field
  logic: "and"
  reason: "Complete isolation between cloud providers for security compliance"

Environment Separation

# Block test environments from production sensitive data
- entity_pattern_a:
    data_connector: "^(test|dev)_.*"
    entity: ".*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^prod_.*"
    entity: ".*"
    field: ".*(pii|ssn|credit_card|password).*"
  logic: "and"
  reason: "Prevent test/dev access to production sensitive data"

Infrastructure Boundaries

# Block legacy systems from modern cloud infrastructure
- entity_pattern_a:
    data_connector: "^legacy_.*"
    entity: "^(mainframe|cobol)_.*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^(gcp|aws|azure)_.*"
    entity: "^(kubernetes|container|serverless)_.*"
    field: ".*"
  logic: "and"
  reason: "Prevent direct legacy-to-cloud connections without proper integration layer"

Configuration Validation

The system includes comprehensive validation:

# Validate your FK blacklist rules
ddn-metadata-bootstrap --validate-config

# Test specific blocking scenarios
ddn-metadata-bootstrap --test-fk-blocking

# Show compiled regex patterns
ddn-metadata-bootstrap --show-config --verbose

๐Ÿค– AI Provider Comparison

Performance & Cost Comparison

Provider Speed Cost Quality Best For
Anthropic Claude Haiku โšกโšกโšก Very Fast ๐Ÿ’ฐ Low โญโญโญโญ High Development, High Volume
Anthropic Claude Sonnet โšกโšก Fast ๐Ÿ’ฐ๐Ÿ’ฐ Medium โญโญโญโญโญ Excellent Production, Balanced
Anthropic Claude Opus โšก Medium ๐Ÿ’ฐ๐Ÿ’ฐ๐Ÿ’ฐ High โญโญโญโญโญ Excellent Critical Schemas
OpenAI GPT-3.5 Turbo โšกโšกโšก Very Fast ๐Ÿ’ฐ Very Low โญโญโญ Good Development, Budget
OpenAI GPT-4o Mini โšกโšกโšก Very Fast ๐Ÿ’ฐ Low โญโญโญโญ High Production, Cost-Optimized
OpenAI GPT-4 โšกโšก Fast ๐Ÿ’ฐ๐Ÿ’ฐ๐Ÿ’ฐ High โญโญโญโญโญ Excellent Premium Quality
Google Gemini Pro โšกโšก Fast ๐Ÿ’ฐ Very Low โญโญโญโญ High Large Scale, Budget
Google Gemini 1.5 Flash โšกโšกโšก Very Fast ๐Ÿ’ฐ Low โญโญโญ Good High Throughput

Provider-Specific Configuration Examples

Anthropic Claude (Recommended)

ai_provider: "anthropic"
anthropic_model: "claude-3-haiku-20240307"  # Fast & cost-effective
# anthropic_model: "claude-3-sonnet-20240229"  # Balanced
# anthropic_model: "claude-3-opus-20240229"    # Highest quality

# Anthropic-optimized settings
field_tokens: 30
system_prompt: |
  Generate concise, business-focused field descriptions.
  Focus on practical utility and clear business meaning.

OpenAI GPT (Cost-Optimized)

ai_provider: "openai"
openai_model: "gpt-4o-mini"  # Best balance of cost and quality
# openai_model: "gpt-3.5-turbo"     # Most cost-effective
# openai_model: "gpt-4-turbo-preview"  # Highest quality

# OpenAI-optimized settings
field_tokens: 25
system_prompt: |
  You are a technical writer creating database field descriptions.
  Be concise, specific, and business-focused.

Google Gemini (High Volume)

ai_provider: "gemini"
gemini_model: "gemini-1.5-flash"  # High throughput
# gemini_model: "gemini-pro"           # Balanced
# gemini_model: "gemini-1.5-pro-latest"  # Highest quality

# Gemini-optimized settings
field_tokens: 35
system_prompt: |
  Create clear, professional descriptions for database schema fields.
  Focus on business value and practical understanding.

๐Ÿ“ Enhanced Examples

Multi-Provider Description Generation

Input Schema (HML)

kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  fields:
    - name: riskId
      type: String!
    - name: mfaEnabled
      type: Boolean!
    - name: ssoConfig
      type: String
    - name: iamPolicy
      type: String

Output with Different Providers

Anthropic Claude (Business-Focused)
kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Security risk evaluation and compliance status tracking for 
    organizational threat management and regulatory oversight.
  fields:
    - name: riskId
      type: String!
      description: Risk assessment identifier for tracking security evaluations.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication enablement status for security policy compliance.
    - name: ssoConfig
      type: String
      description: Single Sign-On configuration settings for identity management.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy governing user permissions.

Feature Control Examples

Descriptions Only (No Relationships)

# CLI
ddn-metadata-bootstrap --create-fk none --create-shared-keys none --create-descriptions true

# Config YAML
create_fk: none
create_shared_keys: none
create_command_relationship_hints: false
create_descriptions: true

Relationships Only (No Descriptions)

# CLI
ddn-metadata-bootstrap --create-descriptions false

# Config YAML
create_fk: all
create_shared_keys: all
create_command_relationship_hints: true
create_descriptions: false

Forward-Only Relationships (Reduced Complexity)

# Config YAML
create_fk: forward
create_shared_keys: forward
create_command_relationship_hints: true
create_descriptions: true

Rebuild Mode (Start Fresh)

# CLI
ddn-metadata-bootstrap --rebuild-relationships

# Config YAML
rebuild_relationships: true

๐Ÿ Python API with Multi-Provider Support

from ddn_metadata_bootstrap import BootstrapperConfig, MetadataBootstrapper
from ddn_metadata_bootstrap.description_generator import DescriptionGenerator
import logging

# Configure logging to see provider selection and caching
logging.basicConfig(level=logging.INFO)

# Method 1: Use configuration file
config = BootstrapperConfig(config_file="./config.yaml")

# Method 2: Programmatic feature control
config = BootstrapperConfig()
config.ai_provider = "openai"
config.openai_api_key = "your-openai-key"
config.openai_model = "gpt-4o-mini"

# Feature control
config.create_descriptions = True
config.create_fk = "all"  # all|forward|none
config.create_shared_keys = "forward"  # all|forward|none
config.create_command_relationship_hints = True
config.rebuild_relationships = False

# Method 3: Direct generator creation with provider
generator = DescriptionGenerator(
    api_key="your-api-key",
    model="claude-3-haiku-20240307",
    provider="anthropic"  # or "openai", "gemini"
)

# Create bootstrapper with feature control
bootstrapper = MetadataBootstrapper(config)

# Process directory with configured features
results = bootstrapper.process_directory(
    input_dir="./app/metadata",
    output_dir="./enhanced_metadata"
)

# Check what features were processed
processing_summary = config.get_processing_summary()
print(f"Processed: {processing_summary}")

# Get provider-specific statistics
stats = bootstrapper.get_statistics()
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model Used: {stats['model_used']}")
print(f"Provider API Calls: {stats['provider_api_calls']}")
print(f"Provider Cost: ${stats['estimated_provider_cost']:.2f}")

๐Ÿ“Š Enhanced Statistics & Monitoring

# Feature-specific performance tracking
stats = bootstrapper.get_statistics()

# Feature processing summary
print(f"Processing Summary: {config.get_processing_summary()}")
print(f"Features Enabled:")
print(f"  - Descriptions: {config.should_create_descriptions()}")
print(f"  - FK Relationships: {config.should_create_fk_relationships()}")
print(f"  - Shared Key Relationships: {config.should_create_shared_key_relationships()}")
print(f"  - Command Hints: {config.should_create_command_relationship_hints()}")
print(f"  - Rebuild Mode: {config.should_rebuild_relationships()}")

# AI Provider metrics
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model: {stats['model_used']}")
print(f"Provider API calls: {stats['provider_api_calls']}")
print(f"Average response time: {stats['avg_response_time_ms']}ms")
print(f"Provider cost: ${stats['estimated_provider_cost']:.3f}")

# Relationship blocking statistics
if 'relationship_stats' in stats:
    rel_stats = stats['relationship_stats']
    print(f"Relationships considered: {rel_stats['relationships_considered']}")
    print(f"Relationships blocked: {rel_stats['relationships_blocked']}")
    print(f"FK blacklist hits: {rel_stats['fk_blacklist_hits']}")
    print(f"Cross-connector blocks: {rel_stats['cross_connector_blocks']}")

๐Ÿš€ Provider-Specific Performance Improvements

Real-World Performance by Provider

Anthropic Claude

Provider: Anthropic Claude Haiku
Processing Features: descriptions, FK relationships (all), shared keys (forward)
Processing 500 fields...
โœ… Strengths:
- Excellent business context understanding
- Consistent quality across attempts
- Good acronym expansion integration
- Fast response times (avg 850ms)

๐Ÿ“Š Results:
- API calls: 127 (after caching)
- Processing time: 2.1 minutes  
- Average quality score: 82
- Cost: $0.89

Configuration-Based Performance

Feature Set: Descriptions only (relationships disabled)
Provider: OpenAI GPT-4o Mini
Processing 500 fields...
โœ… Results:
- API calls: 89 (descriptions only)
- Processing time: 1.2 minutes
- Average quality score: 78
- Cost: $0.31

Feature Set: Relationships only (descriptions disabled)
Provider: Local processing
Processing 500 fields...
โœ… Results:
- API calls: 0 (no AI needed)
- Processing time: 0.3 minutes
- Relationships generated: 247
- Cost: $0.00

โš™๏ธ Advanced Multi-Provider Configuration

Environment-Based Provider Selection

# Development environment - fast and cheap
export ENVIRONMENT="development"
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_CREATE_DESCRIPTIONS="true"
export METADATA_BOOTSTRAP_CREATE_FK="forward"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="none"

# Staging environment - balanced  
export ENVIRONMENT="staging"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export METADATA_BOOTSTRAP_CREATE_FK="all"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="forward"

# Production environment - comprehensive
export ENVIRONMENT="production"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-sonnet-20240229"
export METADATA_BOOTSTRAP_CREATE_FK="all"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="all"
export METADATA_BOOTSTRAP_REBUILD_RELATIONSHIPS="true"

๐Ÿงช Testing Multi-Provider Features

# Test all providers
pytest tests/test_multi_provider.py -v

# Test feature control system
pytest tests/test_feature_flags.py -v

# Test provider-specific optimizations
pytest tests/test_provider_optimization.py -v

# Test configuration validation for all providers
pytest tests/test_provider_config.py -v

# Test FK blacklist functionality
pytest tests/test_fk_blacklist.py -v

# Run performance benchmarks across providers
pytest tests/benchmark_providers.py -v --benchmark-only

๐Ÿค Contributing

Multi-Provider Development Areas

  1. Provider Integration

    • Additional AI provider support (Claude-4, GPT-5, etc.)
    • Provider-specific optimization algorithms
    • Custom model fine-tuning support
  2. Feature Control Enhancements

    • Advanced processing pipelines
    • Conditional feature dependencies
    • Performance profiling per feature
  3. Performance Optimization

    • Provider-specific prompt engineering
    • Dynamic provider selection based on workload
    • Cost optimization strategies
  4. Quality Assessment

    • Provider-specific quality metrics
    • Cross-provider quality comparison
    • A/B testing frameworks
  5. Relationship Blocking

    • Visual rule builder for FK blacklists
    • Rule impact analysis and testing
    • Advanced pattern matching algorithms

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ†˜ Support

๐Ÿท๏ธ Version History

See CHANGELOG.md for complete version history and breaking changes.

โญ Acknowledgments

  • Built for Hasura DDN
  • Powered by Anthropic Claude, OpenAI GPT, and Google Gemini
  • Linguistic analysis powered by NLTK and WordNet
  • Inspired by the GraphQL and OpenAPI communities
  • Caching algorithms inspired by database query optimization techniques
  • Relationship blocking patterns inspired by enterprise security frameworks

Made with โค๏ธ by the Hasura team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ddn_metadata_bootstrap-1.0.15.tar.gz (156.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ddn_metadata_bootstrap-1.0.15-py3-none-any.whl (149.6 kB view details)

Uploaded Python 3

File details

Details for the file ddn_metadata_bootstrap-1.0.15.tar.gz.

File metadata

  • Download URL: ddn_metadata_bootstrap-1.0.15.tar.gz
  • Upload date:
  • Size: 156.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for ddn_metadata_bootstrap-1.0.15.tar.gz
Algorithm Hash digest
SHA256 122e5e49846a352800f52637422d9c01e2c331419144a66f3e8d61bee990c417
MD5 34f1a3ac396e04d8c460f37b9af46b6f
BLAKE2b-256 5dbd18602d49c8f817ba2b1f74f61655fda01bba7f338cff5f18d5dc3013776a

See more details on using hashes here.

File details

Details for the file ddn_metadata_bootstrap-1.0.15-py3-none-any.whl.

File metadata

File hashes

Hashes for ddn_metadata_bootstrap-1.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 b8306c367af77e4b04046c19730413c95faea3a167f46076366347d07335ed2d
MD5 fb3be37597f2105d6dd3fd0583015714
BLAKE2b-256 84d253715718b66f6802d7299ea623f15454f7d2d06b44ae024ca964f953792d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page