AI-powered metadata enhancement for Hasura DDN schema files

These details have not been verified by PyPI

Project links

Project description

DDN Metadata Bootstrap

AI-powered metadata enhancement for Hasura DDN (Data Delivery Network) schema files. Automatically generate high-quality descriptions and detect sophisticated relationships in your YAML/HML schema definitions using advanced AI with comprehensive configuration management.

🚀 Features

🤖 Multi-Provider AI Support

Anthropic Claude: Default provider with claude-3-haiku, claude-3-sonnet, and claude-3-opus models
OpenAI GPT: Support for gpt-3.5-turbo, gpt-4, gpt-4o-mini, and latest models
Google Gemini: Support for gemini-pro, gemini-1.5-pro, and gemini-1.5-flash models
Automatic Fallback: Graceful degradation between providers with configurable priorities
Provider-Specific Optimization: Model-specific prompting and parameter tuning

🧠 Advanced AI Generation

Quality Assessment with Retry Logic: Multi-attempt generation with configurable scoring thresholds
Context-Aware Business Descriptions: Domain-specific system prompts with industry context
Smart Field Analysis: Automatically detects and skips self-explanatory, generic, or cryptic fields
Configurable Length Controls: Precise control over description length and token usage

🧠 Intelligent Caching System

Similarity-Based Matching: Reuses descriptions for similar fields across entities (85% similarity threshold)
Performance Optimization: Reduces API calls by up to 70% on large schemas through intelligent caching
Cache Statistics: Real-time performance monitoring with hit rates and API cost savings tracking
Type-Aware Matching: Considers field types and entity context for better cache accuracy

🔍 WordNet-Based Linguistic Analysis

Generic Term Detection: Uses NLTK and WordNet for sophisticated term analysis to skip meaningless fields
Semantic Density Analysis: Evaluates conceptual richness and specificity of field names
Definition Quality Scoring: Ensures meaningful, non-circular descriptions through linguistic validation
Abstraction Level Calculation: Determines appropriate description depth based on semantic analysis

📝 Enhanced Acronym Expansion

Comprehensive Mappings: 200+ pre-configured acronyms for technology, finance, and business domains
Context-Aware Expansion: Industry-specific acronym interpretation based on domain context
Pre-Generation Enhancement: Expands acronyms BEFORE AI generation for better context
Custom Domain Support: Fully configurable acronym mappings via YAML configuration

🔗 Advanced Relationship Detection

Template-Based FK Detection: Sophisticated foreign key detection with confidence scoring and semantic validation
Shared Business Key Relationships: Many-to-many relationships via shared field analysis with FK-aware filtering
Cross-Subgraph Intelligence: Smart entity matching across different subgraphs
Configurable Templates: Flexible FK template patterns with placeholders for complex naming conventions
Advanced Relationship Blocking: Precision rule-based system to prevent inappropriate cross-connector relationships

⚙️ Comprehensive Configuration System

YAML-First Configuration: Central config.yaml file for all settings with full documentation
Waterfall Precedence: CLI args > Environment variables > config.yaml > defaults
Configuration Validation: Comprehensive validation with helpful error messages and source tracking
Feature Toggles: Granular control over processing features (descriptions vs relationships)

🎯 Advanced Quality Controls

Buzzword Detection: Avoids corporate jargon and meaningless generic terms
Pattern-Based Filtering: Regex-based rejection of poor description formats
Technical Language Translation: Converts technical terms to business-friendly language
Length Optimization: Multiple validation layers with hard limits and target lengths

🔍 Intelligent Field Selection

Generic Field Detection: Skips overly common fields that don't benefit from descriptions
Cryptic Abbreviation Handling: Configurable handling of unclear field names with vowel analysis
Self-Explanatory Pattern Recognition: Automatically identifies fields that don't need descriptions
Value Assessment: Only generates descriptions that add meaningful business value

📦 Installation

From PyPI (Recommended)

pip install ddn-metadata-bootstrap

Provider-Specific Dependencies

The tool supports multiple AI providers. Install the dependencies for your chosen provider:

# For Anthropic Claude (default)
pip install ddn-metadata-bootstrap[anthropic]
# or separately:
pip install anthropic

# For OpenAI GPT  
pip install ddn-metadata-bootstrap[openai]
# or separately:
pip install openai

# For Google Gemini
pip install ddn-metadata-bootstrap[gemini]
# or separately: 
pip install google-generativeai

# Install all providers
pip install ddn-metadata-bootstrap[all]

From Source

git clone https://github.com/hasura/ddn-metadata-bootstrap.git
cd ddn-metadata-bootstrap
pip install -e .

🏃 Quick Start

1. Choose Your AI Provider

Option A: Anthropic Claude (Default - Recommended)

export ANTHROPIC_API_KEY="your-anthropic-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"  # Optional (default)
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-haiku-20240307"  # Optional

Option B: OpenAI GPT

export OPENAI_API_KEY="your-openai-api-key"  
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_OPENAI_MODEL="gpt-3.5-turbo"  # Optional

Option C: Google Gemini

export GEMINI_API_KEY="your-gemini-api-key"
# or alternatively:
export GOOGLE_API_KEY="your-gemini-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="gemini"
export METADATA_BOOTSTRAP_GEMINI_MODEL="gemini-pro"  # Optional

2. Set up your directories

export METADATA_BOOTSTRAP_INPUT_DIR="./app/metadata"
export METADATA_BOOTSTRAP_OUTPUT_DIR="./enhanced_metadata"

3. Create a configuration file (Recommended)

Create a config.yaml file in your project directory:

# config.yaml - DDN Metadata Bootstrap Configuration

# =============================================================================
# AI PROVIDER CONFIGURATION
# =============================================================================
ai_provider: "anthropic"  # Choose: anthropic, openai, gemini

# Provider-specific API keys (alternatively set via environment variables)
# anthropic_api_key: "your-anthropic-key"
# openai_api_key: "your-openai-key" 
# gemini_api_key: "your-gemini-key"

# Provider-specific models
anthropic_model: "claude-3-haiku-20240307"  # claude-3-sonnet-20240229, claude-3-opus-20240229
openai_model: "gpt-3.5-turbo"               # gpt-4, gpt-4o-mini, gpt-4-turbo-preview
gemini_model: "gemini-pro"                  # gemini-1.5-pro-latest, gemini-1.5-flash

# =============================================================================
# FEATURE CONTROL
# =============================================================================
relationships_only: false          # Set to true to only generate relationships, skip descriptions
enable_quality_assessment: true    # Enable AI quality scoring and retry logic

# =============================================================================
# AI GENERATION SETTINGS
# =============================================================================
# Domain-specific system prompt for your organization
system_prompt: |
  You generate concise field descriptions for database schema metadata at a global financial services firm.
  
  DOMAIN CONTEXT:
  - Organization: Global bank
  - Department: Cybersecurity operations  
  - Use case: Risk management and security compliance
  - Regulatory environment: Financial services (SOX, Basel III, GDPR, etc.)
  
  Think: "What would a cybersecurity analyst at a bank need to know about this field?"

# Token and length limits
field_tokens: 25                    # Max tokens AI can generate for field descriptions
kind_tokens: 50                     # Max tokens AI can generate for kind descriptions
field_desc_max_length: 120          # Maximum total characters for field descriptions
kind_desc_max_length: 250           # Maximum total characters for entity descriptions

# Quality thresholds
minimum_description_score: 70       # Minimum score (0-100) to accept a description
max_description_retry_attempts: 3   # How many times to retry for better quality

# =============================================================================
# ENHANCED ACRONYM EXPANSION
# =============================================================================
acronym_mappings:
  # Technology & Computing
  api: "Application Programming Interface"
  ui: "User Interface"
  db: "Database"
  
  # Security & Access Management
  mfa: "Multi-Factor Authentication"
  sso: "Single Sign-On"
  iam: "Identity and Access Management"
  siem: "Security Information and Event Management"
  
  # Financial Services & Compliance
  pci: "Payment Card Industry"
  sox: "Sarbanes-Oxley Act"
  kyc: "Know-Your-Customer"
  aml: "Anti-Money Laundering"
  # ... 200+ total mappings available

# =============================================================================
# INTELLIGENT FIELD SELECTION
# =============================================================================
# Fields to skip entirely - these will not get descriptions at all
skip_field_patterns:
  - "^id$"
  - "^_id$"
  - "^uuid$"
  - "^created_at$"
  - "^updated_at$"
  - "^debug_.*"
  - "^test_.*"
  - "^temp_.*"

# Generic fields - won't get unique descriptions (too common)
generic_fields:
  - "id"
  - "key"
  - "uid"
  - "guid"
  - "name"

# Self-explanatory fields - simple patterns that don't need descriptions
self_explanatory_patterns:
  - '^id$'
  - '^_id$'
  - '^guid$'
  - '^uuid$'
  - '^key$'

# Cryptic Field Handling
skip_cryptic_abbreviations: true   # Skip fields with unclear abbreviations
skip_ultra_short_fields: true      # Skip very short field names that are likely abbreviations
max_cryptic_field_length: 4        # Field names this length or shorter are considered cryptic

# Content quality controls
buzzwords: [
  'synergy', 'leverage', 'paradigm', 'ecosystem',
  'contains', 'stores', 'holds', 'represents'
]

forbidden_patterns: [
  'this\\s+field\\s+represents',
  'used\\s+to\\s+(track|manage|identify)',
  'business.*information'
]

# =============================================================================
# RELATIONSHIP DETECTION
# =============================================================================
# FK Template Patterns for relationship detection
# Format: "{pk_pattern}|{fk_pattern}"
# Placeholders: {gi}=generic_id, {pt}=primary_table, {ps}=primary_subgraph, {pm}=prefix_modifier
fk_templates:
  - "{gi}|{pm}_{pt}_{gi}"           # active_service_name → Services.name
  - "{gi}|{pt}_{gi}"                # user_id → Users.id
  - "{pt}_{gi}|{pm}_{pt}_{gi}"      # user_id → ActiveUsers.active_user_id

# =============================================================================
# ADVANCED RELATIONSHIP BLOCKING
# =============================================================================
# Precision rule-based system to prevent inappropriate relationships
# Uses bidirectional validation with data_connector + entity + field pattern matching
fk_key_blacklist:
  # Block cross-cloud provider connections with infrastructure fields
  - entity_pattern_a:
      data_connector: "^(gcp|arg|various)$"  # Google Cloud Platform, Azure Resource Graph, Various
      entity: "^(gcp_|google_).*"            # Google/GCP entities
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^(gcp|arg|various)$"  
      entity: "^(az_|azure_).*"              # Azure entities
      field: ".*(resource|project|policy|storage|compute).*"  # Infrastructure fields only
    logic: "and"
    reason: "Block google/gcp entities from connecting to azure entities with infrastructure-related fields"
  
  # Complete isolation between major cloud platforms
  - entity_pattern_a:
      data_connector: "^gcp$"                # Google Cloud Platform connector
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^arg$"                # Azure Resource Graph connector  
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    logic: "and"
    reason: "Block all connections between Google Cloud Platform and Azure Resource Graph connectors"
  
  # Block legacy systems from modern cloud entities
  - entity_pattern_a:
      data_connector: "^legacy_.*"           # Any legacy connector
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^(gcp|azure|aws)_.*"  # Modern cloud connectors
      entity: "^(google_|azure_|aws_).*"     # Modern cloud entities
      field: ".*"                            # Any field
    logic: "and"
    reason: "Prevent legacy systems from directly connecting to modern cloud infrastructure"
  
  # Block test/dev environments from production data
  - entity_pattern_a:
      data_connector: "^(test|dev)_.*"       # Test/dev connectors
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^prod_.*"             # Production connectors
      entity: ".*"                           # Any entity  
      field: ".*(sensitive|private|confidential|secret).*"  # Sensitive fields
    logic: "and"
    reason: "Prevent test/dev environments from accessing production sensitive data"

# Shared relationship limits
max_shared_relationships: 10000
max_shared_per_entity: 10
min_shared_confidence: 30

4. Run the tool with your chosen provider

# Use default provider (Anthropic)
ddn-metadata-bootstrap

# Use OpenAI explicitly
ddn-metadata-bootstrap --ai-provider openai --openai-api-key your-key

# Use Gemini with specific model
ddn-metadata-bootstrap --ai-provider gemini --gemini-model gemini-1.5-pro

# Show configuration including AI provider setup
ddn-metadata-bootstrap --show-config

# Test your AI provider connection
ddn-metadata-bootstrap --test-provider

# Process only relationships (skip descriptions)
ddn-metadata-bootstrap --relationships-only

# Use custom configuration file
ddn-metadata-bootstrap --config custom-config.yaml

# Enable verbose logging to see AI provider selection and caching
ddn-metadata-bootstrap --verbose

🔗 Advanced Relationship Blocking System

The tool now includes a sophisticated bidirectional relationship blocking system that prevents inappropriate foreign key relationships from being generated. This is particularly important in enterprise environments with multiple data connectors, cloud providers, and security boundaries.

Key Features

Precision Pattern Matching

Each blocking rule uses three-part patterns for maximum precision:

Data Connector: Regex pattern matching the connector name (e.g., ^gcp$, ^(test|dev)_.*)
Entity Name: Regex pattern matching the entity/table name (e.g., ^google_.*, ^azure_storage.*)
Field Name: Regex pattern matching the field name (e.g., .*resource.*, .*secret.*)

Bidirectional Validation

Rules automatically check both directions of a relationship:

Pattern A → Pattern B: google_compute → azure_storage_resource
Pattern B → Pattern A: azure_vm → google_analytics_data

Both directions are blocked by a single rule definition.

Flexible Logic Operators

AND Logic: All patterns (connector AND entity AND field) must match for both sides
OR Logic: Either side matching its full pattern triggers the block

Real-World Examples

Cross-Cloud Security Isolation

# Block Google Cloud from Azure Resource Graph
- entity_pattern_a:
    data_connector: "^gcp$"        # Google Cloud Platform
    entity: ".*"                   # Any GCP entity
    field: ".*"                    # Any field
  entity_pattern_b:
    data_connector: "^arg$"        # Azure Resource Graph  
    entity: ".*"                   # Any Azure entity
    field: ".*"                    # Any field
  logic: "and"
  reason: "Complete isolation between cloud providers for security compliance"

Environment Separation

# Block test environments from production sensitive data
- entity_pattern_a:
    data_connector: "^(test|dev)_.*"
    entity: ".*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^prod_.*"
    entity: ".*"
    field: ".*(pii|ssn|credit_card|password).*"
  logic: "and"
  reason: "Prevent test/dev access to production sensitive data"

Infrastructure Boundaries

# Block legacy systems from modern cloud infrastructure
- entity_pattern_a:
    data_connector: "^legacy_.*"
    entity: "^(mainframe|cobol)_.*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^(gcp|aws|azure)_.*"
    entity: "^(kubernetes|container|serverless)_.*"
    field: ".*"
  logic: "and"
  reason: "Prevent direct legacy-to-cloud connections without proper integration layer"

Validation Process

When the tool evaluates a potential relationship like gcp_users.user_id → azure_storage.user_reference:

Rule Matching: Check if both entities match any blocking rule patterns
Bidirectional Check: Test both A→B and B→A directions
Pattern Validation: Verify connector + entity + field patterns all match
Logic Application: Apply AND/OR logic to determine if rule triggers
Blocking Decision: Block relationship if any rule matches

Configuration Validation

The system includes comprehensive validation:

# Validate your FK blacklist rules
ddn-metadata-bootstrap --validate-config

# Test specific blocking scenarios
ddn-metadata-bootstrap --test-fk-blocking

# Show compiled regex patterns
ddn-metadata-bootstrap --show-config --verbose

Performance Impact

The blocking system is highly optimized:

Regex Compilation: Patterns compiled once at startup
Hash-Based Lookups: Connector names normalized for fast matching
Early Exit: Rules skip quickly when connector patterns don't match
Minimal Overhead: Adds <1ms per relationship evaluation

Enterprise Security Benefits

Compliance Support

SOX Compliance: Prevent cross-environment data access
GDPR Compliance: Block PII data from reaching test systems
Industry Regulations: Enforce data residency and access controls

Cloud Security

Multi-Cloud Isolation: Prevent accidental cross-cloud data references
Vendor Lock-in Prevention: Maintain clear cloud provider boundaries
Zero Trust Architecture: Explicit relationship approval required

Operational Safety

Environment Isolation: Prevent production data leaks to development
Legacy System Protection: Control access to deprecated systems
Change Management: Explicit approval required for cross-boundary relationships

🤖 AI Provider Comparison

Performance & Cost Comparison

Provider	Speed	Cost	Quality	Best For
Anthropic Claude Haiku	⚡⚡⚡ Very Fast	💰 Low	⭐⭐⭐⭐ High	Development, High Volume
Anthropic Claude Sonnet	⚡⚡ Fast	💰💰 Medium	⭐⭐⭐⭐⭐ Excellent	Production, Balanced
Anthropic Claude Opus	⚡ Medium	💰💰💰 High	⭐⭐⭐⭐⭐ Excellent	Critical Schemas
OpenAI GPT-3.5 Turbo	⚡⚡⚡ Very Fast	💰 Very Low	⭐⭐⭐ Good	Development, Budget
OpenAI GPT-4o Mini	⚡⚡⚡ Very Fast	💰 Low	⭐⭐⭐⭐ High	Production, Cost-Optimized
OpenAI GPT-4	⚡⚡ Fast	💰💰💰 High	⭐⭐⭐⭐⭐ Excellent	Premium Quality
Google Gemini Pro	⚡⚡ Fast	💰 Very Low	⭐⭐⭐⭐ High	Large Scale, Budget
Google Gemini 1.5 Flash	⚡⚡⚡ Very Fast	💰 Low	⭐⭐⭐ Good	High Throughput

Provider-Specific Configuration Examples

Anthropic Claude (Recommended)

ai_provider: "anthropic"
anthropic_model: "claude-3-haiku-20240307"  # Fast & cost-effective
# anthropic_model: "claude-3-sonnet-20240229"  # Balanced
# anthropic_model: "claude-3-opus-20240229"    # Highest quality

# Anthropic-optimized settings
field_tokens: 30
system_prompt: |
  Generate concise, business-focused field descriptions.
  Focus on practical utility and clear business meaning.

OpenAI GPT (Cost-Optimized)

ai_provider: "openai"
openai_model: "gpt-4o-mini"  # Best balance of cost and quality
# openai_model: "gpt-3.5-turbo"     # Most cost-effective
# openai_model: "gpt-4-turbo-preview"  # Highest quality

# OpenAI-optimized settings
field_tokens: 25
system_prompt: |
  You are a technical writer creating database field descriptions.
  Be concise, specific, and business-focused.

Google Gemini (High Volume)

ai_provider: "gemini"
gemini_model: "gemini-1.5-flash"  # High throughput
# gemini_model: "gemini-pro"           # Balanced
# gemini_model: "gemini-1.5-pro-latest"  # Highest quality

# Gemini-optimized settings
field_tokens: 35
system_prompt: |
  Create clear, professional descriptions for database schema fields.
  Focus on business value and practical understanding.

📝 Enhanced Examples

Multi-Provider Description Generation

Input Schema (HML)

kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  fields:
    - name: riskId
      type: String!
    - name: mfaEnabled
      type: Boolean!
    - name: ssoConfig
      type: String
    - name: iamPolicy
      type: String

Output with Different Providers

Anthropic Claude (Business-Focused)

kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Security risk evaluation and compliance status tracking for 
    organizational threat management and regulatory oversight.
  fields:
    - name: riskId
      type: String!
      description: Risk assessment identifier for tracking security evaluations.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication enablement status for security policy compliance.
    - name: ssoConfig
      type: String
      description: Single Sign-On configuration settings for identity management.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy governing user permissions.

OpenAI GPT (Technical-Focused)

kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Cybersecurity threat assessment data structure containing risk metrics
    and security configuration parameters for compliance monitoring.
  fields:
    - name: riskId
      type: String!
      description: Unique identifier for security risk assessment records.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication activation flag for access control.
    - name: ssoConfig
      type: String
      description: Single Sign-On system configuration parameters.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy specification document.

Google Gemini (Comprehensive)

kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Comprehensive security threat assessment record containing risk analysis,
    authentication configurations, and access management policies for enterprise security.
  fields:
    - name: riskId
      type: String!
      description: Risk assessment record identifier for security threat tracking.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication status indicator for enhanced security protocols.
    - name: ssoConfig
      type: String
      description: Single Sign-On integration configuration for unified authentication.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy definition for authorization control.

Provider Fallback and Testing

# Test provider connectivity
ddn-metadata-bootstrap --test-provider
# Output:
# 🧪 Testing ANTHROPIC provider connection...
# ✅ ANTHROPIC connection successful
#    Model: claude-3-haiku-20240307
#    Response: Hello

# Test specific provider
ddn-metadata-bootstrap --ai-provider openai --test-provider
# Output:
# 🧪 Testing OPENAI provider connection...
# ✅ OPENAI connection successful
#    Model: gpt-3.5-turbo
#    Response: Hello

# Show detailed provider configuration
ddn-metadata-bootstrap --show-config
# Output:
# 📋 Configuration Sources:
# ai_provider                    = anthropic              [defaults]
# anthropic_api_key              = ***masked***           [env:ANTHROPIC_API_KEY]
# anthropic_model                = claude-3-haiku-20240307 [defaults]
# 
# 🤖 AI Provider Configuration:
#    Provider: anthropic
#    Model: claude-3-haiku-20240307
#    API Key: ***configured***

Performance with Caching Across Providers

# Provider performance comparison with caching
🔄 Processing with ANTHROPIC (claude-3-haiku-20240307)...
Processing 500 fields across 50 entities...
Cache hits: 298 (70.1% hit rate)
API calls made: 127
Processing time: 2.1 minutes
Provider cost: $0.89

🔄 Processing with OPENAI (gpt-4o-mini)...
Processing 500 fields across 50 entities...
Cache hits: 298 (70.1% hit rate)  # Same cache used!
API calls made: 127
Processing time: 1.8 minutes
Provider cost: $0.52

🔄 Processing with GEMINI (gemini-1.5-flash)...
Processing 500 fields across 50 entities...
Cache hits: 298 (70.1% hit rate)  # Same cache used!
API calls made: 127
Processing time: 2.3 minutes
Provider cost: $0.31

⚙️ Advanced Multi-Provider Configuration

Provider-Specific Optimization

# Development configuration - prioritize speed and cost
ai_provider: "openai"
openai_model: "gpt-4o-mini"
field_tokens: 20
minimum_description_score: 60
enable_quality_assessment: false

# Production configuration - prioritize quality
ai_provider: "anthropic"  
anthropic_model: "claude-3-sonnet-20240229"
field_tokens: 35
minimum_description_score: 80
max_description_retry_attempts: 5

# High-volume configuration - prioritize throughput
ai_provider: "gemini"
gemini_model: "gemini-1.5-flash"
field_tokens: 25
minimum_description_score: 65
enable_quality_assessment: true

Environment-Based Provider Selection

# Development environment
export ENVIRONMENT="development"
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export OPENAI_API_KEY="your-dev-key"

# Staging environment  
export ENVIRONMENT="staging"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export ANTHROPIC_API_KEY="your-staging-key"

# Production environment
export ENVIRONMENT="production"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export ANTHROPIC_API_KEY="your-prod-key"
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-sonnet-20240229"

🐍 Python API with Multi-Provider Support

from ddn_metadata_bootstrap import BootstrapperConfig, MetadataBootstrapper
from ddn_metadata_bootstrap.description_generator import DescriptionGenerator
import logging

# Configure logging to see provider selection and caching
logging.basicConfig(level=logging.INFO)

# Method 1: Use configuration file
config = BootstrapperConfig(config_file="./config.yaml")

# Method 2: Programmatic provider selection
config = BootstrapperConfig()
config.ai_provider = "openai"
config.openai_api_key = "your-openai-key"
config.openai_model = "gpt-4o-mini"

# Method 3: Direct generator creation with provider
generator = DescriptionGenerator(
    api_key="your-api-key",
    model="claude-3-haiku-20240307",
    provider="anthropic"  # or "openai", "gemini"
)

# Create bootstrapper with multi-provider support
bootstrapper = MetadataBootstrapper(config)

# Process directory with provider-optimized settings
results = bootstrapper.process_directory(
    input_dir="./app/metadata",
    output_dir="./enhanced_metadata"
)

# Get provider-specific statistics
stats = bootstrapper.get_statistics()
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model Used: {stats['model_used']}")
print(f"Provider API Calls: {stats['provider_api_calls']}")
print(f"Provider Cost: ${stats['estimated_provider_cost']:.2f}")

# Switch providers dynamically
for provider in ['anthropic', 'openai', 'gemini']:
    try:
        test_generator = DescriptionGenerator(
            api_key=f"your-{provider}-key",
            provider=provider
        )
        print(f"✅ {provider.upper()} available")
    except ImportError as e:
        print(f"❌ {provider.upper()} unavailable: {e}")

📊 Enhanced Statistics & Monitoring

# Provider-specific performance tracking
stats = bootstrapper.get_statistics()

# AI Provider metrics
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model: {stats['model_used']}")
print(f"Provider API calls: {stats['provider_api_calls']}")
print(f"Average response time: {stats['avg_response_time_ms']}ms")
print(f"Provider cost: ${stats['estimated_provider_cost']:.3f}")

# Quality comparison across providers
print(f"Average quality score: {stats['average_quality_score']}")
print(f"Quality retries: {stats['quality_retries']}")
print(f"Provider-specific quality: {stats['provider_quality_metrics']}")

# Cross-provider caching efficiency
if 'cache_stats' in stats:
    cache_stats = stats['cache_stats']
    print(f"Cache hit rate: {cache_stats['hit_rate']:.1%}")
    print(f"Cross-provider cache reuse: {cache_stats['cross_provider_reuse']}")
    print(f"Provider switching savings: ${cache_stats['switching_savings']:.2f}")

# Relationship blocking statistics
if 'relationship_stats' in stats:
    rel_stats = stats['relationship_stats']
    print(f"Relationships considered: {rel_stats['relationships_considered']}")
    print(f"Relationships blocked: {rel_stats['relationships_blocked']}")
    print(f"FK blacklist hits: {rel_stats['fk_blacklist_hits']}")
    print(f"Cross-connector blocks: {rel_stats['cross_connector_blocks']}")

🚀 Provider-Specific Performance Improvements

Real-World Performance by Provider

Anthropic Claude

Provider: Anthropic Claude Haiku
Processing 500 fields...
✅ Strengths:
- Excellent business context understanding
- Consistent quality across attempts
- Good acronym expansion integration
- Fast response times (avg 850ms)

📊 Results:
- API calls: 127 (after caching)
- Processing time: 2.1 minutes  
- Average quality score: 82
- Cost: $0.89

OpenAI GPT

Provider: OpenAI GPT-4o Mini
Processing 500 fields...
✅ Strengths:
- Very fast response times (avg 650ms)
- Excellent technical accuracy
- Cost-effective for high volume
- Good structured output

📊 Results:
- API calls: 127 (after caching)
- Processing time: 1.8 minutes
- Average quality score: 78
- Cost: $0.52

Google Gemini

Provider: Google Gemini 1.5 Flash
Processing 500 fields...
✅ Strengths:
- Lowest cost per operation
- Good multilingual support
- Generous rate limits
- Comprehensive descriptions

📊 Results:
- API calls: 127 (after caching)
- Processing time: 2.3 minutes
- Average quality score: 76
- Cost: $0.31

🧪 Testing Multi-Provider Features

# Test all providers
pytest tests/test_multi_provider.py -v

# Test provider switching
pytest tests/test_provider_switching.py -v

# Test provider-specific optimizations
pytest tests/test_provider_optimization.py -v

# Test configuration validation for all providers
pytest tests/test_provider_config.py -v

# Test FK blacklist functionality
pytest tests/test_fk_blacklist.py -v

# Run performance benchmarks across providers
pytest tests/benchmark_providers.py -v --benchmark-only

🤝 Contributing

Multi-Provider Development Areas

Provider Integration
- Additional AI provider support (Claude-4, GPT-5, etc.)
- Provider-specific optimization algorithms
- Custom model fine-tuning support
Performance Optimization
- Provider-specific prompt engineering
- Dynamic provider selection based on workload
- Cost optimization strategies
Quality Assessment
- Provider-specific quality metrics
- Cross-provider quality comparison
- A/B testing frameworks
Caching Enhancements
- Provider-aware cache invalidation
- Cross-provider description comparison
- Quality-based cache prioritization
Relationship Blocking
- Visual rule builder for FK blacklists
- Rule impact analysis and testing
- Advanced pattern matching algorithms

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

🏷️ Version History

See CHANGELOG.md for complete version history and breaking changes.

⭐ Acknowledgments

Built for Hasura DDN
Powered by Anthropic Claude, OpenAI GPT, and Google Gemini
Linguistic analysis powered by NLTK and WordNet
Inspired by the GraphQL and OpenAPI communities
Caching algorithms inspired by database query optimization techniques
Relationship blocking patterns inspired by enterprise security frameworks

Made with ❤️ by the Hasura team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.16

Jul 20, 2025

1.0.15

Jul 20, 2025

This version

1.0.14

Jul 19, 2025

1.0.13

Jul 17, 2025

1.0.12

Jul 16, 2025

1.0.11

Jul 12, 2025

1.0.9

Jul 1, 2025

1.0.8

Jun 21, 2025

1.0.6

May 23, 2025

1.0.5

May 23, 2025

1.0.4

May 23, 2025

1.0.3

May 23, 2025

1.0.2

May 23, 2025

1.0.1

May 23, 2025

1.0.0

May 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ddn_metadata_bootstrap-1.0.14.tar.gz (152.2 kB view details)

Uploaded Jul 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ddn_metadata_bootstrap-1.0.14-py3-none-any.whl (144.3 kB view details)

Uploaded Jul 19, 2025 Python 3

File details

Details for the file ddn_metadata_bootstrap-1.0.14.tar.gz.

File metadata

Download URL: ddn_metadata_bootstrap-1.0.14.tar.gz
Upload date: Jul 19, 2025
Size: 152.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for ddn_metadata_bootstrap-1.0.14.tar.gz
Algorithm	Hash digest
SHA256	`ed0095538b7911766a1acbefb9df8848ae9fa23a7d817ba98fc46a42f40a14d2`
MD5	`f7a9f5377bcfc068ae5219fecf490f59`
BLAKE2b-256	`0d4be1c614d95e294241decb4e3147ab8e3a0d0cb98337846aed4b9780b49d5a`

See more details on using hashes here.

File details

Details for the file ddn_metadata_bootstrap-1.0.14-py3-none-any.whl.

File metadata

Download URL: ddn_metadata_bootstrap-1.0.14-py3-none-any.whl
Upload date: Jul 19, 2025
Size: 144.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for ddn_metadata_bootstrap-1.0.14-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ff87e13b8f2eb4afb56fb536643166edf61be3b249c0c700cbcdae40baf47fa`
MD5	`9b080e10dc141b09d670a33dd59f5c05`
BLAKE2b-256	`502ddb79aefcd53041a9edf85cc870a63ef7b75e2a3a981659af61102ad38718`

See more details on using hashes here.

ddn-metadata-bootstrap 1.0.14

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DDN Metadata Bootstrap

🚀 Features

🤖 Multi-Provider AI Support

🧠 Advanced AI Generation

🧠 Intelligent Caching System

🔍 WordNet-Based Linguistic Analysis

📝 Enhanced Acronym Expansion

🔗 Advanced Relationship Detection

⚙️ Comprehensive Configuration System

🎯 Advanced Quality Controls

🔍 Intelligent Field Selection

📦 Installation

From PyPI (Recommended)

Provider-Specific Dependencies

From Source

🏃 Quick Start

1. Choose Your AI Provider

Option A: Anthropic Claude (Default - Recommended)

Option B: OpenAI GPT

Option C: Google Gemini

2. Set up your directories

3. Create a configuration file (Recommended)

4. Run the tool with your chosen provider

🔗 Advanced Relationship Blocking System

Key Features

Precision Pattern Matching

Bidirectional Validation

Flexible Logic Operators

Real-World Examples

Cross-Cloud Security Isolation

Environment Separation

Infrastructure Boundaries

Validation Process

Configuration Validation

Performance Impact

Enterprise Security Benefits

Compliance Support

Cloud Security

Operational Safety

🤖 AI Provider Comparison

Performance & Cost Comparison

Provider-Specific Configuration Examples

Anthropic Claude (Recommended)

OpenAI GPT (Cost-Optimized)

Google Gemini (High Volume)

📝 Enhanced Examples

Multi-Provider Description Generation

Input Schema (HML)

Output with Different Providers

Anthropic Claude (Business-Focused)

OpenAI GPT (Technical-Focused)

Google Gemini (Comprehensive)

Provider Fallback and Testing

Performance with Caching Across Providers

⚙️ Advanced Multi-Provider Configuration

Provider-Specific Optimization

Environment-Based Provider Selection

🐍 Python API with Multi-Provider Support

📊 Enhanced Statistics & Monitoring

🚀 Provider-Specific Performance Improvements

Real-World Performance by Provider

Anthropic Claude

OpenAI GPT

Google Gemini

🧪 Testing Multi-Provider Features

🤝 Contributing

Multi-Provider Development Areas

📄 License

🆘 Support

🏷️ Version History

⭐ Acknowledgments