Skip to main content

Simulated OpenAI Compatible API Server for Testing and Development

Project description

FakeAI

OpenAI-Compatible API Server for Testing, Development & Benchmarking

PyPI version Python 3.10+ License Tests codecov

FakeAI is a high-performance simulation of the OpenAI API that returns realistic responses without performing actual inference. Ideal for development, testing, CI/CD pipelines, and LLM application benchmarking.

Key Features

  • Zero-cost testing with unlimited requests
  • 100% OpenAI schema compliance
  • No external dependencies or API keys required
  • Handles 10,000+ concurrent requests
  • Comprehensive metrics and monitoring
  • AIPerf benchmarking support

Installation

pip install fakeai

Quick Start

# Start server
fakeai-server

# Use with OpenAI client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Core Features

OpenAI API Compatibility

  • Chat Completions - Streaming, tools, multi-modal content
  • Embeddings - Vector generation for RAG applications
  • Images - DALL-E simulation
  • Audio - Speech synthesis and transcription
  • Moderations - Content safety classification
  • Files - Upload and management
  • Batches - Asynchronous batch processing
  • Tool Calling - Function calls with parallel execution
  • Structured Outputs - JSON Schema validation
  • Log Probabilities - Token-level logprobs
  • Responses API - March 2025 format
  • Vector Stores - Document storage and retrieval

NVIDIA AI-Dynamo Features

  • KV Cache Reuse - Radix tree prefix matching with smart routing
  • Smart Router - Cache-aware request routing
  • KVBM - 4-tier memory hierarchy (GPU HBM, CPU DRAM, SSD, Remote)
  • SLA-Based Planner - Load prediction with ARIMA and Prophet
  • DCGM Metrics - GPU metrics simulation (25+ field IDs)
  • Dynamo Metrics - TTFT, ITL, TPOT, latency breakdown

Performance & Monitoring

  • uvloop Integration - 2-4x faster async I/O
  • Numpy Metrics - Sliding window rate calculations
  • Prometheus Export - Standard metrics format
  • Real-Time Dashboards - Web UI with Chart.js
  • Configurable Latency - TTFT and ITL with variance control

Configuration

Environment Variables

# Server
export FAKEAI_HOST=0.0.0.0
export FAKEAI_PORT=9001

# Latency (milliseconds)
export FAKEAI_TTFT_MS=20
export FAKEAI_ITL_MS=5

# Security (disabled by default)
export FAKEAI_ENABLE_SECURITY=false
export FAKEAI_REQUIRE_API_KEY=false

CLI Arguments

# Basic usage
fakeai-server --host 0.0.0.0 --port 9001

# Configure latency
fakeai-server --ttft 20 --itl 5

# Latency with variance
fakeai-server --ttft 50:30 --itl 10:20  # 50ms ±30%, 10ms ±20%

# Enable security
fakeai-server --enable-security --api-key sk-test-key

Config File

Create fakeai.yaml:

host: 0.0.0.0
port: 9001
ttft_ms: 20.0
itl_ms: 5.0
kv_cache_enabled: true
enable_security: false

Run with config:

fakeai-server --config-file fakeai.yaml

API Endpoints

Core OpenAI API

Endpoint Method Description
/v1/chat/completions POST Chat completions with streaming and tools
/v1/completions POST Text completions
/v1/embeddings POST Text embeddings
/v1/images/generations POST Image generation
/v1/audio/speech POST Text-to-speech
/v1/moderations POST Content moderation
/v1/models GET List models
/v1/files GET/POST/DELETE File management
/v1/batches POST/GET Batch processing

Extended APIs

Endpoint Method Description
/v1/responses POST Responses API (March 2025)
/v1/ranking POST NVIDIA NIM Rankings
/v1/vector_stores POST/GET/DELETE Vector store management

Organization APIs

Endpoint Method Description
/v1/organization/users GET/POST User management
/v1/organization/projects GET/POST Project management
/v1/organization/usage/* GET Usage tracking
/v1/organization/costs GET Cost tracking

Monitoring

Endpoint Method Description
/health GET Health check
/metrics GET JSON metrics
/metrics/prometheus GET Prometheus format
/kv-cache-metrics GET KV cache performance
/dcgm-metrics GET GPU metrics
/dynamo-metrics GET Inference metrics
/dashboard GET Main dashboard
/dashboard/dynamo GET Advanced metrics dashboard

Examples

Basic Chat

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000", api_key="not-needed")

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing."}
    ]
)
print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Write a haiku"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Tool Calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Weather in Tokyo?"}],
    tools=tools
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"{tool_call.function.name}: {tool_call.function.arguments}")

Structured Outputs

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Extract: Alice is 25"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "number"}
                },
                "required": ["name", "age"]
            }
        }
    }
)

Embeddings

response = client.embeddings.create(
    model="sentence-transformers/all-mpnet-base-v2",
    input="The quick brown fox jumps over the lazy dog"
)
print(f"Dimensions: {len(response.data[0].embedding)}")

Reasoning Models

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Solve: 2x + 5 = 15"}]
)

print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

NVIDIA NIM Rankings

import httpx

response = httpx.post(
    "http://localhost:8000/v1/ranking",
    json={
        "model": "nvidia/nv-rerankqa-mistral-4b-v3",
        "query": {"text": "What is machine learning?"},
        "passages": [
            {"text": "Machine learning is a subset of AI"},
            {"text": "Python is a programming language"}
        ]
    }
)

Benchmarking

FakeAI supports AIPerf, the successor to NVIDIA GenAI-Perf.

Install AIPerf

pip install aiperf

Run Benchmark

# Start server
fakeai-server --port 9001 --ttft 20 --itl 5

# Run benchmark
aiperf profile \
  --model openai/gpt-oss-120b \
  --url http://localhost:9001 \
  --endpoint-type chat \
  --service-kind openai \
  --streaming \
  --concurrency 100 \
  --request-count 1000

Automated Suite

cd benchmarks
python run_aiperf_benchmarks.py --quick

See benchmarks/README.md for complete benchmarking guide.

Performance

Benchmarked performance at concurrency 500, 1000 requests:

Request Throughput:        930 req/s
Output Token Throughput:   198,910 tokens/s
Average Latency:           492 ms
P99 Latency:               636 ms
Time to First Token:       492 ms (avg), 636 ms (p99)

Proven capacity: 10,000 concurrent requests with 100% success rate.

Monitoring

Metrics Endpoints

# JSON format
curl http://localhost:8000/metrics

# Prometheus format
curl http://localhost:8000/metrics/prometheus

# KV cache metrics
curl http://localhost:8000/kv-cache-metrics

# GPU metrics
curl http://localhost:8000/dcgm-metrics

# Inference metrics
curl http://localhost:8000/dynamo-metrics

CLI Tools

# Show metrics
fakeai-server metrics

# Watch metrics (refresh every 5s)
fakeai-server metrics --watch

# Check server status
fakeai-server status

# KV cache statistics
fakeai-server cache-stats

Dashboards

  • Main dashboard: http://localhost:8000/dashboard
  • Advanced metrics: http://localhost:8000/dashboard/dynamo

Supported Models

Chat Models

  • openai/gpt-oss-120b - 117B MoE
  • openai/gpt-oss-20b - 20B dense
  • meta-llama/Llama-3.1-8B-Instruct - Llama 3.1 8B
  • meta-llama/Llama-3.1-70B-Instruct - Llama 3.1 70B
  • deepseek-ai/DeepSeek-R1 - 671B MoE with reasoning
  • mistralai/Mixtral-8x7B-Instruct-v0.1 - Mixtral 8x7B

Embedding Models

  • sentence-transformers/all-mpnet-base-v2 - 768 dimensions
  • nomic-ai/nomic-embed-text-v1.5 - 768 dimensions
  • BAAI/bge-m3 - Multilingual

Image Models

  • stabilityai/stable-diffusion-2-1
  • stabilityai/stable-diffusion-xl-base-1.0

Audio Models

  • whisper-1 - Speech recognition
  • tts-1, tts-1-hd - Text-to-speech

Dynamic Model Support

Any model ID is automatically created on first use, including LoRA fine-tuned models using the format ft:base_model:org:id.

Advanced Features

AI-Dynamo KV Cache

Simulates NVIDIA AI-Dynamo's datacenter-scale KV cache optimization:

  • Radix tree with O(k) prefix matching
  • Smart router with cache-aware routing
  • 4 simulated workers with block management
  • Configurable block size and overlap weight

Metrics available at /kv-cache-metrics.

DCGM GPU Metrics

Simulates NVIDIA DCGM metrics without real GPUs:

  • GPU utilization, temperature, power
  • Memory usage and bandwidth
  • SM occupancy, tensor core activity
  • PCIe and NVLink throughput
  • ECC errors and thermal throttling

Available in Prometheus format at /dcgm-metrics.

Dynamo Inference Metrics

LLM inference metrics in NVIDIA Dynamo style:

  • Time to First Token (TTFT): p50, p90, p99
  • Inter-Token Latency (ITL): p50, p90, p99
  • Time Per Output Token (TPOT)
  • Latency breakdown: Queue, Prefill, Decode, Total
  • Request and token throughput

Available at /dynamo-metrics (Prometheus) and /dynamo-metrics/json.

Configuration Reference

Server Settings

Variable Default Description
FAKEAI_HOST 127.0.0.1 Host to bind
FAKEAI_PORT 8000 Port number
FAKEAI_DEBUG false Debug mode

Latency Settings

Variable Default Description
FAKEAI_TTFT_MS 20 Time to first token (ms)
FAKEAI_TTFT_VARIANCE_PERCENT 10 TTFT variance (%)
FAKEAI_ITL_MS 5 Inter-token latency (ms)
FAKEAI_ITL_VARIANCE_PERCENT 10 ITL variance (%)

Security Settings

Variable Default Description
FAKEAI_ENABLE_SECURITY false Master security flag
FAKEAI_API_KEYS - Comma-separated API keys
FAKEAI_REQUIRE_API_KEY false Require authentication
FAKEAI_MAX_REQUEST_SIZE 104857600 Max request size (100 MB)

Performance Settings

Variable Default Description
FAKEAI_KV_CACHE_ENABLED true Enable KV cache
FAKEAI_KV_CACHE_NUM_WORKERS 4 Cache workers
FAKEAI_RATE_LIMIT_ENABLED false Enable rate limiting

See CONFIGURATION_REFERENCE.md for complete list.

CLI Commands

# Start server
fakeai-server

# Server options
fakeai-server --host 0.0.0.0 --port 9001 --debug
fakeai-server --ttft 20 --itl 5
fakeai-server --enable-security --api-key sk-test

# Monitoring
fakeai-server status
fakeai-server metrics
fakeai-server metrics --watch
fakeai-server cache-stats

# Interactive mode
fakeai-server interactive

Use Cases

Local Development

fakeai-server

Test OpenAI API integrations locally without API costs or rate limits.

CI/CD Pipelines

- name: Start FakeAI
  run: |
    pip install fakeai
    fakeai-server --port 8000 &
    sleep 2

- name: Run tests
  run: pytest tests/
  env:
    OPENAI_BASE_URL: http://localhost:8000
    OPENAI_API_KEY: not-needed

Benchmarking

# Minimal latency
fakeai-server --ttft 5 --itl 1

# Run AIPerf
aiperf profile \
  --model openai/gpt-oss-120b \
  --url http://localhost:8000 \
  --concurrency 500 \
  --request-count 5000

RAG Development

# Generate embeddings
embeddings = client.embeddings.create(
    model="sentence-transformers/all-mpnet-base-v2",
    input=["doc1", "doc2", "doc3"]
)

# Rerank results
import httpx
rankings = httpx.post(
    "http://localhost:8000/v1/ranking",
    json={
        "model": "nvidia/nv-rerankqa-mistral-4b-v3",
        "query": {"text": "user query"},
        "passages": [{"text": "doc1"}, {"text": "doc2"}]
    }
)

Architecture

fakeai/
├── app.py                # FastAPI application (60+ endpoints)
├── fakeai_service.py     # Core business logic
├── models.py             # Pydantic schemas
├── config.py             # Configuration management
├── metrics.py            # Numpy sliding window metrics
├── kv_cache.py           # AI-Dynamo KV cache
├── dcgm_metrics.py       # GPU metrics simulator
├── dynamo_metrics.py     # LLM inference metrics
├── dynamo_advanced.py    # KVBM, SLA planner, router
├── async_server.py       # uvloop utilities
├── cli.py                # CLI interface
└── utils.py              # Utilities and generators

Development

Setup

git clone https://github.com/ajcasagrande/fakeai.git
cd fakeai
pip install -e ".[dev]"

Testing

pytest
pytest tests/test_numpy_metrics_window.py -v
pytest --cov=fakeai

Code Quality

black fakeai/
isort fakeai/
mypy fakeai/

Docker

# Build
docker build -t fakeai .

# Run
docker run -p 8000:8000 fakeai

# With environment variables
docker run -p 9001:9001 \
  -e FAKEAI_PORT=9001 \
  -e FAKEAI_TTFT_MS=10 \
  fakeai

Deployment

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fakeai
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: fakeai
        image: fakeai:latest
        ports:
        - containerPort: 8000
        env:
        - name: FAKEAI_HOST
          value: "0.0.0.0"

See DEPLOYMENT_K8S.md for complete manifests.

Cloud Platforms

  • AWS: See DEPLOYMENT_AWS.md
  • Google Cloud: See DEPLOYMENT_CLOUD_RUN.md
  • Azure: See DEPLOYMENT_AZURE.md

Documentation

  • CLAUDE.md - Project knowledge base
  • API_REFERENCE.md - Complete API reference
  • CONFIGURATION_REFERENCE.md - All configuration options
  • PERFORMANCE.md - Performance tuning guide
  • SECURITY.md - Security features
  • benchmarks/README.md - Benchmarking guide
  • Auto-generated docs at http://localhost:8000/docs

Troubleshooting

Port already in use:

fakeai-server --port 9001

Slow performance:

pip install uvloop
fakeai-server --ttft 5 --itl 1

AIPerf benchmark fails:

# Always include http:// in URL
aiperf profile --url http://localhost:8000  # Correct

Dashboard not loading:

curl http://localhost:8000/health  # Verify server running

Contributing

Contributions welcome. Please follow PEP-8, KISS, and DRY principles. Add tests for new features.

See CONTRIBUTING.md for guidelines.

Related Projects

License

Apache License 2.0. See LICENSE file for details.

Links


Version: 0.0.5 Python: 3.10, 3.11, 3.12 Status: Active Development

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fakeai-0.0.5.tar.gz (259.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fakeai-0.0.5-py3-none-any.whl (151.3 kB view details)

Uploaded Python 3

File details

Details for the file fakeai-0.0.5.tar.gz.

File metadata

  • Download URL: fakeai-0.0.5.tar.gz
  • Upload date:
  • Size: 259.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for fakeai-0.0.5.tar.gz
Algorithm Hash digest
SHA256 b039aa12d7041e1d9242f423a44c6af54c0c303bbe542d3356df996741145926
MD5 13977ea353679f2e82b4504d2080a625
BLAKE2b-256 7f03b276ad8ae3981871effb04762ab65b5d8ccdd7880fc8ced1be9b4f2b5d33

See more details on using hashes here.

File details

Details for the file fakeai-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: fakeai-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 151.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for fakeai-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9f68eafc98ddbd972f629cf26bc4214bb08d144c6b655f754c42ef0317fe2c0d
MD5 f9884f4dbfce5a6d0f12336caca5aaca
BLAKE2b-256 67ce613dda883866ef18bdc82ed6c7dc74592a5cc46986aee321e0d80936d94d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page