Skip to main content

Multi-provider LLM orchestration with fallback chains and three-tier cognition

Project description

netrun-llm

Multi-provider LLM orchestration with automatic fallback chains and three-tier cognition system.

Features

  • Multi-Adapter Fallback Chains: Automatic failover between LLM providers (Claude -> GPT-4 -> Llama3)
  • Three-Tier Cognition: Fast ack (<100ms), RAG response (<2s), Deep insight (<5s)
  • Circuit Breaker Protection: Per-adapter circuit breakers prevent cascade failures
  • Cost Tracking: Automatic cost estimation and tracking across all providers
  • Async-First: Full async support with sync wrappers for compatibility
  • Project-Agnostic: No Wilbur-specific dependencies, works in any Python project

Installation

# Base installation (Ollama support only)
pip install netrun-llm

# With Claude/Anthropic support
pip install netrun-llm[anthropic]

# With OpenAI support
pip install netrun-llm[openai]

# Full installation (all providers)
pip install netrun-llm[all]

Quick Start

Basic Usage with Fallback Chain

from netrun_llm import LLMFallbackChain

# Create default chain: Claude -> OpenAI -> Ollama
chain = LLMFallbackChain()

# Execute with automatic fallback
response = chain.execute("Explain quantum computing in 3 sentences")

print(f"Response: {response.content}")
print(f"Handled by: {response.adapter_name}")
print(f"Cost: ${response.cost_usd:.6f}")
print(f"Fallbacks used: {response.metadata.get('fallback_attempts', 0)}")

Three-Tier Cognition (Streaming)

import asyncio
from netrun_llm import ThreeTierCognition, CognitionTier

async def main():
    cognition = ThreeTierCognition()

    async for response in cognition.stream_response("What is machine learning?"):
        if response.tier == CognitionTier.FAST_ACK:
            print(f"[Thinking...] {response.content}")
        elif response.tier == CognitionTier.RAG:
            print(f"[Context] {response.content}")
        elif response.tier == CognitionTier.DEEP:
            print(f"[Answer] {response.content}")

asyncio.run(main())

Individual Adapters

from netrun_llm import ClaudeAdapter, OpenAIAdapter, OllamaAdapter

# Claude adapter
claude = ClaudeAdapter()
response = claude.execute("Write a haiku about Python")
print(response.content)

# OpenAI adapter
openai = OpenAIAdapter()
response = openai.execute("What is 2+2?")
print(response.content)

# Ollama adapter (local, free)
ollama = OllamaAdapter(model="llama3")
if ollama.check_availability():
    response = ollama.execute("Hello, world!")
    print(response.content)

Configuration

Environment Variables

# API Keys (use placeholders in code, set actual values in env)
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
OLLAMA_HOST=http://localhost:11434

# Optional: Default models
CLAUDE_DEFAULT_MODEL=claude-sonnet-4-5-20250929
OPENAI_DEFAULT_MODEL=gpt-4-turbo
OLLAMA_DEFAULT_MODEL=llama3

# Optional: Timeouts and limits
LLM_REQUEST_TIMEOUT=30
LLM_DEFAULT_MAX_TOKENS=4096

Using Placeholders (Security Best Practice)

from netrun_llm import ClaudeAdapter, LLMConfig

# Placeholders are resolved from environment at runtime
config = LLMConfig(
    anthropic_api_key="{{ANTHROPIC_API_KEY}}",  # Resolved from env
    openai_api_key="{{OPENAI_API_KEY}}",
    ollama_host="{{OLLAMA_HOST}}",
)

# Validate configuration
issues = config.validate()
if issues:
    print(f"Configuration issues: {issues}")

Adapters

ClaudeAdapter (Anthropic)

from netrun_llm import ClaudeAdapter

adapter = ClaudeAdapter(
    default_model="claude-sonnet-4-5-20250929",
    max_tokens=4096,
)

response = adapter.execute(
    "Analyze this code",
    context={
        "model": "claude-3-opus-20240229",  # Override model
        "temperature": 0.7,
        "system": "You are a code reviewer.",
    }
)

Supported Models:

  • claude-sonnet-4-5-20250929 (recommended)
  • claude-3-5-sonnet-20241022
  • claude-3-opus-20240229
  • claude-3-sonnet-20240229
  • claude-3-haiku-20240307

OpenAIAdapter

from netrun_llm import OpenAIAdapter

adapter = OpenAIAdapter(
    default_model="gpt-4-turbo",
    max_tokens=4096,
    timeout=30,
)

response = adapter.execute(
    "Write a Python function to sort a list",
    context={
        "model": "gpt-4o",
        "temperature": 0.5,
    }
)

Supported Models:

  • gpt-4-turbo (recommended)
  • gpt-4o, gpt-4o-mini
  • gpt-4
  • gpt-3.5-turbo

OllamaAdapter (Local/Free)

from netrun_llm import OllamaAdapter

adapter = OllamaAdapter(
    model="llama3",
    host="http://localhost:11434",
    fallback_hosts=["http://backup-server:11434"],
)

# Check if Ollama is running
if adapter.check_availability():
    response = adapter.execute("Hello!")
    print(response.content)
    print(f"Cost: ${response.cost_usd}")  # Always $0.00

# List available models
models = adapter.list_available_models()
print(f"Available: {models}")

Supported Models:

  • llama3, llama3.1, llama3.2
  • codellama
  • mistral
  • phi-3
  • gemma2
  • qwen2

Fallback Chain

Default Chain

from netrun_llm import LLMFallbackChain

# Default: Claude -> OpenAI -> Ollama
chain = LLMFallbackChain()

Custom Chain

from netrun_llm import LLMFallbackChain, ClaudeAdapter, OpenAIAdapter, OllamaAdapter

# Cost-optimized: Free first, premium last
chain = LLMFallbackChain(adapters=[
    OllamaAdapter(model="llama3"),      # Free
    OpenAIAdapter(default_model="gpt-3.5-turbo"),  # Cheap
    ClaudeAdapter(),                     # Premium fallback
])

response = chain.execute("Simple question")
print(f"Cost: ${response.cost_usd}")  # Likely $0.00 if Ollama available

Chain Metrics

metrics = chain.get_metrics()
print(f"Success rate: {metrics['success_rate']:.1f}%")
print(f"Fallback rate: {metrics['fallback_rate']:.1f}%")
print(f"Total cost: ${metrics['total_cost_usd']:.4f}")
print(f"Adapter usage: {metrics['adapter_usage']}")

Three-Tier Cognition

The cognition system provides progressive response generation with latency targets:

Tier Target Latency Purpose
FAST_ACK <100ms Immediate acknowledgment
RAG <2s Knowledge-enhanced response
DEEP <5s Full LLM reasoning

Streaming Mode

import asyncio
from netrun_llm import ThreeTierCognition, CognitionTier

async def chat():
    cognition = ThreeTierCognition()

    async for response in cognition.stream_response("Explain quantum computing"):
        print(f"[{response.tier.name}] {response.content}")
        print(f"  Latency: {response.latency_ms}ms, Final: {response.is_final}")

asyncio.run(chat())

Blocking Mode

async def quick_answer():
    cognition = ThreeTierCognition()

    # Returns best response within timeout
    response = await cognition.execute("What is 2+2?", min_confidence=0.5)
    print(f"Answer: {response.content}")
    print(f"Tier: {response.tier.name}, Confidence: {response.confidence}")

asyncio.run(quick_answer())

With RAG Integration

from netrun_llm import ThreeTierCognition

async def retrieve_documents(query: str) -> list[str]:
    """Your document retrieval function."""
    # Could use Pinecone, Chroma, etc.
    return ["Relevant document 1", "Relevant document 2"]

cognition = ThreeTierCognition(
    enable_rag=True,
    rag_retrieval=retrieve_documents,
)

Error Handling

from netrun_llm import (
    LLMFallbackChain,
    AllAdaptersFailedError,
    RateLimitError,
    CircuitBreakerOpenError,
)

chain = LLMFallbackChain()

try:
    response = chain.execute("Test prompt")
except AllAdaptersFailedError as e:
    print(f"All adapters failed: {e.failed_adapters}")
    print(f"Errors: {e.errors}")
except RateLimitError as e:
    print(f"Rate limited on {e.adapter_name}")
    print(f"Retry after: {e.retry_after_seconds}s")
except CircuitBreakerOpenError as e:
    print(f"Circuit breaker open for {e.adapter_name}")
    print(f"Cooldown: {e.cooldown_remaining_seconds}s")

Pricing Reference (2025)

Provider Model Input (per 1M tokens) Output (per 1M tokens)
Claude Sonnet 4.5/3.5 $3.00 $15.00
Claude Opus 3 $15.00 $75.00
Claude Haiku 3 $0.25 $1.25
OpenAI GPT-4 Turbo $10.00 $30.00
OpenAI GPT-4o $5.00 $15.00
OpenAI GPT-3.5 Turbo $0.50 $1.50
Ollama All models $0.00 $0.00

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netrun_llm-1.0.0.tar.gz (34.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

netrun_llm-1.0.0-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file netrun_llm-1.0.0.tar.gz.

File metadata

  • Download URL: netrun_llm-1.0.0.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for netrun_llm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 76846a0a11ee26b73c0098ee2da6e039bfb749817783a9d4ed052c9503f8dead
MD5 498adc6906b0d3f5f7b3137f0389247b
BLAKE2b-256 7a94313937a5409ef3959cc00d21974164935e8b38f93724e1e84d21fdd5a272

See more details on using hashes here.

File details

Details for the file netrun_llm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: netrun_llm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for netrun_llm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa05c30c5b6d30e2b2223f533b3b3f9b0563a1048fab8849dfdf138b832d8641
MD5 5b56753f10389ac7cc695b15e4803ad0
BLAKE2b-256 270902a0afd92ebf0884d11c075404982c8b9901005b35672ea4b25906de9c9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page