Skip to main content

cloooooo — SGLang + RAG hybrid + tools + router + structured outputs + eval

Project description

clovis

Local LLM API · Web Search · Deep Research · RAG · Embeddings · Structured Outputs

PyPI version Python versions License Downloads


clovis is a Python client and production-ready API server for local LLMs, built on top of SGLang and Ollama. It ships with multi-step web research, a full RAG pipeline, vector embeddings, reranking, structured JSON outputs, vision, and an agentic deep-research mode — all accessible via a single HTTP endpoint.

Features

  • Simple inference — one-line calls with streaming, negative prompts, and extended reasoning
  • Web search — live SearXNG results injected into context, date-aware
  • Deep thinking — multi-step agentic research pipeline (MiroFlow) with source citations
  • Ultra deep thinking — multi-axis research with automated gap analysis, 280+ sources synthesized into a structured report
  • RAG — ingest PDF, DOCX, TXT documents; semantic search over your corpus
  • Embeddings — 768-dim dense vectors via nomic-embed-text-v1.5
  • Reranking — cross-encoder reranking of document candidates
  • Structured output — JSON Schema-constrained generation
  • Vision — image description from URL, file path, or base64
  • Auto-routing — automatic mode selection based on query type
  • Conversation memory — short-term history per conversation ID

Installation

pip install clovis

Requirements: Python 3.10+ · SGLang or Ollama running locally


Quick start

from clovis import cloooooo

ai = cloooooo()  # connects to SGLang on localhost:61005

# Direct call
response = ai("Explain transformer architecture")
print(response)

# With options
response = ai(
    "Write a sonnet about entropy",
    negative_prompt="no rhymes",
    thinking=True,            # enables extended chain-of-thought
    context="You are a physicist who loves poetry.",
)

# Streaming
for token in ai.stream("Describe the Big Bang in detail"):
    print(token, end="", flush=True)

# Multi-turn conversation
conv = ai.conversation(context="You are a senior software engineer.")
conv("Explain dependency injection")
conv("Show me a Python example")  # remembers previous turn
conv("How would you test it?")

API server

Start the server:

clovis serve --port 8000
clovis serve --port 8000 --key sk-your-secret-key   # with API key auth

All endpoints accept Content-Type: application/json. Streaming responses use text/plain.


POST /ia — Universal endpoint

The main endpoint. Handles all inference modes.

curl -X POST http://localhost:8000/ia \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is quantum entanglement?", "use_web": true}'

Parameters

Parameter Type Default Description
prompt str required The question or instruction
mode str null "simple" · "deep_thinking" · "ultra_deep_thinking"
use_web bool false Inject live web search results with current date
thinking bool false Enable extended reasoning (chain-of-thought)
stream bool false Stream tokens via text/plain
use_memory bool false Load and save conversation history
conversation_id str null Key for conversation memory
context str null System-level context injected before the prompt
negative_prompt str null Instructions for what to avoid

Response

{
  "response": "Quantum entanglement is a phenomenon where..."
}

For deep_thinking and ultra_deep_thinking, the response includes:

{
  "answer": "...",
  "sources": ["https://...", "https://..."],
  "model_used": "miroflow:Qwen/Qwen3-32B-AWQ",
  "fallback_used": false
}

Modes

simple — Direct inference

Fast, direct LLM call. Optionally augmented with web search (use_web: true) or reasoning (thinking: true).

import httpx

r = httpx.post("http://localhost:8000/ia", json={
    "prompt": "Latest news on fusion energy",
    "use_web": True,
    "thinking": True,
})
print(r.json()["response"])

deep_thinking — Agentic web research

Multi-step research pipeline. Performs web searches, reasons over the results, and returns a structured answer with source citations. Designed for complex questions that require up-to-date information.

r = httpx.post("http://localhost:8000/ia", json={
    "prompt": "What are the geopolitical implications of AGI development?",
    "mode": "deep_thinking",
}, timeout=300)

data = r.json()
print(data["answer"])       # full structured answer
print(data["sources"])      # list of URLs cited
print(data["fallback_used"])  # False = MiroFlow pipeline used

Streaming mode returns progress updates then the final JSON:

curl -X POST http://localhost:8000/ia \
  -d '{"prompt": "Impact of interest rates on tech stocks", "mode": "deep_thinking", "stream": true}'

# [deep_thinking... 5s]
# [deep_thinking... 10s]
# ...
# {"answer": "...", "sources": [...], "fallback_used": false}

ultra_deep_thinking — Multi-axis deep research

The most thorough mode. Decomposes the question into independent research axes, runs parallel searches on each, identifies knowledge gaps, fills them with additional targeted searches, then synthesizes a comprehensive structured report. Typically produces 10 000–15 000 character reports with 250–300 unique sources.

r = httpx.post("http://localhost:8000/ia", json={
    "prompt": "How does reinforcement learning from human feedback (RLHF) work?",
    "mode": "ultra_deep_thinking",
    "stream": True,
}, timeout=600)

for chunk in r.iter_text():
    print(chunk, end="", flush=True)

Streaming output example:

[axe:Definition and mechanism] researching...
[axe:Definition and mechanism] OK — 5 832 chars, 36 sources
[axe:Historical context] researching...
...
[gap analysis round 1/2] 5 gaps identified
[axe:Gap-1.1] researching...
...
[synthesis] 15 sections · 63 000 chars · 288 sources...
{"answer": "## RLHF: Complete Technical Overview\n\n...", "sources_count": 281}

Presets (configurable via /ultra_deep_thinking endpoint):

Preset Axes Depth Gap rounds Searches/axis
fast 3 2 1 2
deep (default) 5 3 2 3
ultra 8 4 3 3

GET /health — Server status

curl http://localhost:8000/health
{
  "status": "ok",
  "version": "0.5.6",
  "model": "Qwen/Qwen3-32B-AWQ",
  "sglang_url": "http://localhost:61005",
  "modes": ["simple", "search", "thinking", "deep_thinking", "ultra_deep_thinking", "embed", "rerank", "vision"]
}

POST /embed — Text embeddings

Generate 768-dimensional dense vectors (nomic-embed-text-v1.5).

r = httpx.post("http://localhost:8000/embed", json={
    "texts": ["Hello world", "Machine learning basics", "Deep neural networks"],
    "prefix": "search_document",   # or "search_query"
})
data = r.json()
print(data["dim"])         # 768
print(len(data["embeddings"]))  # 3

POST /rerank — Document reranking

Re-order documents by relevance to a query using a cross-encoder.

r = httpx.post("http://localhost:8000/rerank", json={
    "query": "machine learning optimization",
    "documents": [
        "Gradient descent is an optimization algorithm for ML",
        "The weather in Paris is sunny today",
        "Adam optimizer adapts learning rates per parameter",
        "Football match results from last weekend",
    ],
    "top_k": 3,
})
for item in r.json()["results"]:
    print(f"{item['score']:.3f}  {item['document'][:60]}")

POST /structured — JSON Schema output

Guarantee structured output conforming to any JSON Schema.

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "year": {"type": "integer"},
        "genres": {"type": "array", "items": {"type": "string"}},
        "rating": {"type": "number"},
    },
    "required": ["title", "year", "genres", "rating"],
}

r = httpx.post("http://localhost:8000/structured", json={
    "prompt": "Describe the movie Inception",
    "schema": schema,
})
print(r.json()["result"])
# {"title": "Inception", "year": 2010, "genres": ["sci-fi", "thriller"], "rating": 8.8}

POST /vision — Image understanding

Describe or analyze images from a URL, local file path, or base64 string.

r = httpx.post("http://localhost:8000/vision", json={
    "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
    "prompt": "What objects do you see in this image?",
})
print(r.json()["response"])

POST /rag/ingest + POST /rag/ask — Retrieval-augmented generation

Ingest your documents and ask questions over them.

# Ingest a document
httpx.post("http://localhost:8000/rag/ingest", json={
    "path": "/path/to/your/document.pdf"
})

# Ask a question
r = httpx.post("http://localhost:8000/rag/ask", json={
    "question": "What are the main conclusions of the report?",
    "top_k": 5,
})
print(r.json()["response"])

Supported formats: PDF, DOCX, TXT, Markdown.


POST /route — Auto-routing

Automatically select the best inference mode for a given prompt.

r = httpx.post("http://localhost:8000/route", json={
    "prompt": "Write a Python function to sort a list",
})
print(r.json())
# {"response": "...", "task_type": "code", "model": "Qwen/Qwen3-32B-AWQ", "confidence": 0.92}

Other endpoints

Endpoint Method Description
/deep_think POST Standalone multi-iteration deep research with streaming
/ultra_deep_thinking POST Standalone ultra deep research with preset control
/tools/exec POST Execute a registered tool
/tools GET List available tools
/eval/run POST Evaluate model responses against expected answers
/rag/sources GET List ingested document sources
/openapi.json GET OpenAPI schema
/docs GET Interactive API documentation

CLI

# Direct question
clovis "Explain the Riemann hypothesis"

# With options
clovis "Write a haiku about code" --no "no syllable counting"
clovis "Solve this integral" --think
clovis "Latest AI news" --web

# Interactive REPL
clovis repl

# Start API server
clovis serve --port 8000
clovis serve --port 8000 --key sk-your-secret-key

Configuration

export CLOVIS_LOCAL_URL="http://localhost:61005"   # SGLang or Ollama endpoint
export CLOVIS_MODEL="Qwen/Qwen3-32B-AWQ"          # model name
export CLOVIS_API_KEY="sk-..."                    # bearer token for the API server
export SEARXNG_URL="http://localhost:8888"        # SearXNG instance (enables web search)

Async usage

All blocking operations can be run in async contexts via asyncio.run_in_executor:

import asyncio
import httpx

async def ask(prompt: str) -> str:
    async with httpx.AsyncClient(timeout=300) as client:
        r = await client.post("http://localhost:8000/ia", json={"prompt": prompt})
        return r.json()["response"]

async def main():
    results = await asyncio.gather(
        ask("What is Python?"),
        ask("What is Rust?"),
        ask("What is Go?"),
    )
    for r in results:
        print(r[:80])

asyncio.run(main())

Streaming

All endpoints support "stream": true. Streaming responses use Content-Type: text/plain and emit tokens progressively.

import httpx

with httpx.stream("POST", "http://localhost:8000/ia", json={
    "prompt": "Write a detailed explanation of CRISPR-Cas9",
    "stream": True,
    "thinking": True,
}) as r:
    for chunk in r.iter_text():
        print(chunk, end="", flush=True)

For deep_thinking streaming, progress markers are emitted every 5 seconds:

[deep_thinking... 5s]
[deep_thinking... 10s]
[deep_thinking... 55s]
{"answer": "...", "sources": [...], "fallback_used": false}

License

MIT — Clovis Sfeir

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clovis-0.5.8-py3-none-any.whl (98.3 kB view details)

Uploaded Python 3

File details

Details for the file clovis-0.5.8-py3-none-any.whl.

File metadata

  • Download URL: clovis-0.5.8-py3-none-any.whl
  • Upload date:
  • Size: 98.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for clovis-0.5.8-py3-none-any.whl
Algorithm Hash digest
SHA256 401a8ba8f857f0875a68174bb2383bfef2c5d9a3b4b55c94b3ec065c6af24cf8
MD5 29f3bfed577c12d99da6ab4af3b6cf3d
BLAKE2b-256 7e3d677348cd18dab3a574d5d493c1778cae54490a2937f8ea4590293e37f31e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page