Smart AI model cascading for cost optimization - Save 40-85% on LLM costs with 2-6x faster responses. Available for Python and TypeScript/JavaScript.

These details have not been verified by PyPI

Project links

Project description

Smart AI model cascading for cost optimization

Python • TypeScript • n8n • 📖 Docs • 💡 Examples

Stop Bleeding Money on AI Calls. Cut Costs 30-65% in 3 Lines of Code.

40-70% of text prompts and 20-60% of agent calls don't need expensive flagship models. You're overpaying every single day.

cascadeflow fixes this with intelligent model cascading, available in Python and TypeScript.

pip install cascadeflow

npm install @cascadeflow/core

Why cascadeflow?

cascadeflow is an intelligent AI model cascading library that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.

Use Cases

Use cascadeflow for:

Cost Optimization. Reduce API costs by 40-85% through intelligent model cascading and speculative execution with automatic per-query cost tracking.
Cost Control and Transparency. Built-in telemetry for query, model, and provider-level cost tracking with configurable budget limits and programmable spending caps.
Low Latency & Speed Optimization. Sub-2ms framework overhead with fast provider routing (Groq sub-50ms). Cascade simple queries to fast models while reserving expensive models for complex reasoning, achieving 2-10x latency reduction overall. (use preset PRESET_ULTRA_FAST)
Multi-Provider Flexibility. Unified API across OpenAI, Anthropic, Groq, Ollama, vLLM, Together, and Hugging Face with automatic provider detection and zero vendor lock-in. Optional LiteLLM integration for 100+ additional providers.
Edge & Local-Hosted AI Deployment. Use best of both worlds: handle most queries with local models (vLLM, Ollama), then automatically escalate complex queries to cloud providers only when needed.

ℹ️ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper

How cascadeflow Works

cascadeflow uses speculative execution with quality validation:

Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
Learns patterns to optimize future cascading decisions and domain specific routing

Zero configuration. Works with YOUR existing models (7 Providers currently supported).

In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation

Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.

┌─────────────────────────────────────────────────────────────┐
│                      cascadeflow Stack                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascade Agent                                        │  │
│  │                                                       │  │
│  │  Orchestrates the entire cascade execution            │  │
│  │  • Query routing & model selection                    │  │
│  │  • Drafter -> Verifier coordination                   │  │
│  │  • Cost tracking & telemetry                          │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Domain Pipeline                                      │  │
│  │                                                       │  │
│  │  Automatic domain classification                      │  │
│  │  • Rule-based detection (CODE, MATH, DATA, etc.)      │  │
│  │  • Optional ML semantic classification                │  │
│  │  • Domain-optimized pipelines & model selection       │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Quality Validation Engine                            │  │
│  │                                                       │  │
│  │  Multi-dimensional quality checks                     │  │
│  │  • Length validation (too short/verbose)              │  │
│  │  • Confidence scoring (logprobs analysis)             │  │
│  │  • Format validation (JSON, structured output)        │  │
│  │  • Semantic alignment (intent matching)               │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascading Engine (<2ms overhead)                     │  │
│  │                                                       │  │
│  │  Smart model escalation strategy                      │  │
│  │  • Try cheap models first (speculative execution)     │  │
│  │  • Validate quality instantly                         │  │
│  │  • Escalate only when needed                          │  │
│  │  • Automatic retry & fallback                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Provider Abstraction Layer                           │  │
│  │                                                       │  │
│  │  Unified interface for 7+ providers                   │  │
│  │  • OpenAI • Anthropic • Groq • Ollama                 │  │
│  │  • Together • vLLM • HuggingFace • LiteLLM            │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Start

Python

pip install cascadeflow[all]

from cascadeflow import CascadeAgent, ModelConfig

# Define your cascade - try cheap model first, escalate if needed
agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),  # Draft model (~$0.375/1M tokens)
    ModelConfig(name="gpt-5", provider="openai", cost=0.00562),         # Verifier model (~$5.62/1M tokens)
])

# Run query - automatically routes to optimal model
result = await agent.run("What's the capital of France?")

print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")

💡 Optional: Use ML-based Semantic Quality Validation

For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.

Step 1: Install the optional ML package:

pip install cascadeflow[ml]  # Adds semantic similarity via FastEmbed (~80MB model)

Step 2: Use semantic quality validation:

from cascadeflow.quality.semantic import SemanticQualityChecker

# Initialize semantic checker (downloads model on first use)
checker = SemanticQualityChecker(
    similarity_threshold=0.5,  # Minimum similarity score (0-1)
    toxicity_threshold=0.7     # Maximum toxicity score (0-1)
)

# Validate query-response alignment
query = "Explain Python decorators"
response = "Decorators are a way to modify functions using @syntax..."

result = checker.validate(query, response, check_toxicity=True)

print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")

What you get:

🎯 Semantic similarity scoring (query ↔ response alignment)
🛡️ Optional toxicity detection
🔄 Automatic model download and caching
🚀 Fast inference (~100ms per check)

Full example: See semantic_quality_domain_detection.py

⚠️ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).

📖 Learn more: Python Documentation | Quickstart Guide | Providers Guide

TypeScript

npm install @cascadeflow/core

import { CascadeAgent, ModelConfig } from '@cascadeflow/core';

// Same API as Python!
const agent = new CascadeAgent({
  models: [
    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
});

const result = await agent.run('What is TypeScript?');
console.log(`Model: ${result.modelUsed}`);
console.log(`Cost: $${result.totalCost}`);
console.log(`Saved: ${result.savingsPercentage}%`);

💡 Optional: ML-based Semantic Quality Validation

For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.

Step 1: Install the optional ML packages:

npm install @cascadeflow/ml @xenova/transformers

Step 2: Enable semantic validation in your cascade:

import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';

const agent = new CascadeAgent({
  models: [
    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
  quality: {
    threshold: 0.40,                    // Traditional confidence threshold
    requireMinimumTokens: 5,            // Minimum response length
    useSemanticValidation: true,        // Enable ML validation
    semanticThreshold: 0.5,             // 50% minimum similarity
  },
});

// Responses now validated for semantic alignment
const result = await agent.run('Explain TypeScript generics');

Step 3: Or use semantic validation directly:

import { SemanticQualityChecker } from '@cascadeflow/core';

const checker = new SemanticQualityChecker();

if (await checker.isAvailable()) {
  const result = await checker.checkSimilarity(
    'What is TypeScript?',
    'TypeScript is a typed superset of JavaScript.'
  );

  console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);
  console.log(`Passed: ${result.passed}`);
}

What you get:

🎯 Query-response semantic alignment detection
🚫 Off-topic response filtering
📦 BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
⚡ Fast CPU inference (~50-100ms with caching)
🔄 Request-scoped caching (50% latency reduction)
🌐 Works in Node.js, Browser, and Edge Functions

Example: semantic-quality.ts

📖 Learn more: TypeScript Documentation | Quickstart Guide | Node.js Examples | Browser/Edge Guide

🔄 Migration Example

Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.

Before (Standard Approach)

Cost: $0.000113, Latency: 850ms

# Using expensive model for everything
result = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's 2+2?"}]
)

After (With cascadeflow)

Cost: $0.000007, Latency: 234ms

agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),
    ModelConfig(name="gpt-4o", provider="openai", cost=0.00625),
])

result = await agent.run("What's 2+2?")

🔥 Saved: $0.000106 (94% reduction), 3.6x faster

📊 Learn more: Cost Tracking Guide | Production Best Practices | Performance Optimization

n8n Integration

Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!

Installation

Open n8n
Go to Settings → Community Nodes
Search for: @cascadeflow/n8n-nodes-cascadeflow
Click Install

Quick Example

Create a workflow:

Manual Trigger → cascadeflow Node → Set Node

Configure cascadeflow node:

Draft Model: gpt-4o-mini ($0.000375)
Verifier Model: gpt-4o ($0.00625)
Message: Your prompt
Output: Full Metrics

Result: 40-85% cost savings in your n8n workflows!

Features:

✅ Visual workflow integration
✅ Multi-provider support
✅ Cost tracking in workflow
✅ Tool calling support
✅ Easy debugging with metrics

🔌 Learn more: n8n Integration Guide | n8n Documentation

Resources

Examples

Python Examples:

Basic Examples - Get started quickly

Example	Description	Link
Basic Usage	Simple cascade setup with OpenAI models	View
Preset Usage	Use built-in presets for quick setup	View
Multi-Provider	Mix multiple AI providers in one cascade	View
Reasoning Models	Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1)	View
Tool Execution	Function calling and tool usage	View
Streaming Text	Stream responses from cascade agents	View
Cost Tracking	Track and analyze costs across queries	View

Advanced Examples - Production & customization

Example	Description	Link
Production Patterns	Best practices for production deployments	View
FastAPI Integration	Integrate cascades with FastAPI	View
Streaming Tools	Stream tool calls and responses	View
Batch Processing	Process multiple queries efficiently	View
Multi-Step Cascade	Build complex multi-step cascades	View
Edge Device	Run cascades on edge devices with local models	View
vLLM Example	Use vLLM for local model deployment	View
Custom Cascade	Build custom cascade strategies	View
Custom Validation	Implement custom quality validators	View
User Budget Tracking	Per-user budget enforcement and tracking	View
User Profile Usage	User-specific routing and configurations	View
Rate Limiting	Implement rate limiting for cascades	View
Guardrails	Add safety and content guardrails	View
Cost Forecasting	Forecast costs and detect anomalies	View
Semantic Quality Detection	ML-based domain and quality detection	View
Profile Database Integration	Integrate user profiles with databases	View

TypeScript Examples:

Basic Examples - Get started quickly

Example	Description	Link
Basic Usage	Simple cascade setup (Node.js)	View
Tool Calling	Function calling with tools (Node.js)	View
Multi-Provider	Mix providers in TypeScript (Node.js)	View
Reasoning Models	Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1)	View
Cost Tracking	Track and analyze costs across queries	View
Semantic Quality	ML-based semantic validation with embeddings	View
Streaming	Stream responses in TypeScript	View

Advanced Examples - Production & edge deployment

Example	Description	Link
Production Patterns	Production best practices (Node.js)	View
Browser/Edge	Vercel Edge runtime example	View

📂 View All Python Examples → | View All TypeScript Examples →

Documentation

Getting Started - Core concepts and basics

Guide	Description	Link
Quickstart	Get started with cascadeflow in 5 minutes	Read
Providers Guide	Configure and use different AI providers	Read
Presets Guide	Using and creating custom presets	Read
Streaming Guide	Stream responses from cascade agents	Read
Tools Guide	Function calling and tool usage	Read
Cost Tracking	Track and analyze API costs	Read

Advanced Topics - Production, customization & integrations

Guide	Description	Link
Production Guide	Best practices for production deployments	Read
Performance Guide	Optimize cascade performance and latency	Read
Custom Cascade	Build custom cascade strategies	Read
Custom Validation	Implement custom quality validators	Read
Edge Device	Deploy cascades on edge devices	Read
Browser Cascading	Run cascades in the browser/edge	Read
FastAPI Integration	Integrate with FastAPI applications	Read
n8n Integration	Use cascadeflow in n8n workflows	Read

📚 View All Documentation →

Features

Feature	Benefit
🎯 Speculative Cascading	Tries cheap models first, escalates intelligently
💰 40-85% Cost Savings	Research-backed, proven in production
⚡ 2-10x Faster	Small models respond in <50ms vs 500-2000ms
⚡ Low Latency	Sub-2ms framework overhead, negligible performance impact
🔄 Mix Any Providers	OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional)
👤 User Profile System	Per-user budgets, tier-aware routing, enforcement callbacks
✅ Quality Validation	Automatic checks + semantic similarity (optional ML, ~80MB, CPU)
🎨 Cascading Policies	Domain-specific pipelines, multi-step validation strategies
🧠 Domain Understanding	Auto-detects code/medical/legal/math/structured data, routes to specialists
🤖 Drafter/Validator Pattern	20-60% savings for agent/tool systems
🔧 Tool Calling Support	Universal format, works across all providers
📊 Cost Tracking	Built-in analytics + OpenTelemetry export (vendor-neutral)
🚀 3-Line Integration	Zero architecture changes needed
🏭 Production Ready	Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection

License

MIT © see LICENSE file.

Free for commercial use. Attribution appreciated but not required.

Contributing

We ❤️ contributions!

📝 Contributing Guide - Python & TypeScript development setup

Roadmap

Cascade Profiler - Analyzes your AI API logs to calculate cost savings potential and generate optimized cascadeflow configurations automatically
User Tier Management - Cost controls and limits per user tier with advanced routing
Semantic Quality Validators - Optional lightweight local quality scoring (200MB CPU model, no external API calls)
Code Complexity Detection - Dynamic cascading based on task complexity analysis
Domain Aware Cascading - Multi-stage pipelines tailored to specific domains
Benchmark Reports - Automated performance and cost benchmarking

Support

📖 GitHub Discussions - Searchable Q&A
🐛 GitHub Issues - Bug reports & feature requests
📧 Email Support - Direct support

Citation

If you use cascadeflow in your research or project, please cite:

@software{cascadeflow2025,
  author = {Lemony Inc., Sascha Buehrle and Contributors},
  title = {cascadeflow: Smart AI model cascading for cost optimization},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/lemony-ai/cascadeflow}
}

Ready to cut your AI costs by 40-85%?

pip install cascadeflow

npm install @cascadeflow/core

Read the Docs • View Python Examples • View TypeScript Examples • Join Discussions

About

Built with ❤️ by Lemony Inc. and the cascadeflow Community

One cascade. Hundreds of specialists.

New York | Zurich

⭐ Star us on GitHub if cascadeflow helps you save money!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.0

Apr 2, 2026

1.1.0

Mar 8, 2026

1.0.0

Feb 22, 2026

0.7.1

Feb 14, 2026

0.7.0

Feb 14, 2026

0.6.5

Dec 8, 2025

0.6.0

Nov 18, 2025

0.5.0

Nov 7, 2025

This version

0.4.0

Nov 6, 2025

0.3.0

Nov 6, 2025

0.1.1

Nov 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cascadeflow-0.4.0.tar.gz (358.9 kB view details)

Uploaded Nov 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cascadeflow-0.4.0-py3-none-any.whl (330.7 kB view details)

Uploaded Nov 6, 2025 Python 3

File details

Details for the file cascadeflow-0.4.0.tar.gz.

File metadata

Download URL: cascadeflow-0.4.0.tar.gz
Upload date: Nov 6, 2025
Size: 358.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cascadeflow-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`d2c59321246a61791dd43689f6ff1c283339b62a176136bcfe357635219ecebf`
MD5	`759d9801dd051df32857e2af83f737f9`
BLAKE2b-256	`c28f97771ee616b304c40409bb24e17d0d7cf2ca4408bee9771b734151699747`

See more details on using hashes here.

File details

Details for the file cascadeflow-0.4.0-py3-none-any.whl.

File metadata

Download URL: cascadeflow-0.4.0-py3-none-any.whl
Upload date: Nov 6, 2025
Size: 330.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cascadeflow-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa221b09700c41dec5d8d98bd97c94c2563afcffa08ef35c6a6bc4035a3f6f69`
MD5	`837af61cf0ba58aee4fb9adc1da901e2`
BLAKE2b-256	`6b2f60f3116a9bcbe10fca7bf619b77aca39ed57cec8a55c54f354f234a8d706`

See more details on using hashes here.

cascadeflow 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Smart AI model cascading for cost optimization

Why cascadeflow?

Use Cases

How cascadeflow Works

Quick Start

Python

TypeScript

🔄 Migration Example

Before (Standard Approach)

After (With cascadeflow)

n8n Integration

Installation

Quick Example

Resources

Examples

Documentation

Features

License

Contributing

Roadmap

Support

Citation

About

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes