Skip to main content

ZIM-based retrieval augmented proxy for OpenAI-compatible AI

Project description

Tensor (Serve)

tensor-serve is a ZIM-based retrieval augmented proxy for any OpenAI-compatible AI. This program lets you download ZIM documentation from the live Kiwix OPDS catalog, builds a local semantic vector database from it, and uses that database to provide an AI model relevant context when answering questions.

The purpose of this program is to provide the service for customizing your AI for your specific needs seamlessly.

Combining keyword search and semantic search, Tensor helps produce more accurate responses for the data you have included in a ZIM database.


1. How the AI pipeline works

  1. Download — ZIM files fetched from Kiwix and stored in the configured ZIM source folder (zim_files/ by default)
  2. Ingest — Articles extracted, HTML stripped, split into 500-word overlapping chunks, embedded with sentence-transformers, indexed in FAISS and BM25
  3. Auto-load — On server startup, the last active collection's FAISS and BM25 indexes are loaded automatically
  4. Analyze — Simple queries can skip retrieval; domain-specific queries use the query analyzer to choose the best search mode (hybrid, faiss, or bm25); time-sensitive queries optionally trigger web search
  5. OpenAI-compatible proxy — For /v1/chat/completions, the user message is embedded (or served from cache) → hybrid search retrieves top-k chunks (optionally merged with web results) → optional cross-encoder reranking improves result order → retrieved context is injected into the request before it is forwarded to the upstream AI server.

Hybrid search (FAISS + BM25 + optional Web Search w/ Reciprocal Rank Fusion)

Search requests and OpenAI-compatible chat requests can run up to three retrievals in parallel and merge them:

FAISS (semantic) BM25 (keyword) Web Search
Finds Conceptually related chunks Exact term / token matches Current / recent information
Good for "How does backpressure work?" "asyncio.gather", error codes, API names "latest news", "today's events", time-sensitive queries
Requires setup Automatic Automatic Optional; disabled by default

Results are merged with Reciprocal Rank Fusion (score = Σ 1 / (60 + rank)). Chunks that rank well in multiple result sets float to the top. The pipeline degrades gracefully — if one index is unavailable it is skipped.

The query analyzer automatically selects the search strategy:

Mode When it is used
hybrid Mixed or general queries where semantic and keyword signals both help
faiss Conceptual queries such as explanations, architecture, patterns, and design questions
bm25 Keyword-heavy queries such as API names, code symbols, methods, classes, errors, and short exact searches

Query embeddings and search results are cached with an in-memory LRU cache to reduce repeated embedding and retrieval work. If enabled, the optional cross-encoder reranker performs a second-stage pass over retrieved chunks before context is sent to the model.


2. Search Complexity Profiles

Tensor Serve supports configurable search complexity tiers allowing you to optimize for your specific deployment:

Profile Search Algorithms Use Case Latency Memory
Lightweight BM25 Okapi + FAISS Flat Local machines, embedded <20ms <500MB
Balanced (default) BM25 Okapi + FAISS Flat + Reranking General purpose servers 50-100ms 1-2GB
Production BM25+ + FAISS-IVF + Query Expansion + Advanced Reranking Enterprise servers, large scale 200-500ms 4-8GB
Manual Custom backend selection Fine-tuned deployments Varies Varies

→ Read the full Search Profiles Guide

Quick Start: Switching Profiles

Use a preset profile:

tensor-serve config set-search-profile lightweight
tensor-serve config set-search-profile production

Fine-tune with overrides:

tensor-serve config set-search-profile balanced \
  --query-expansion prf \
  --enable-reranker

Manual profile (full control):

tensor-serve config set-search-profile manual \
  --keyword-backend bm25_plus \
  --semantic-backend faiss_ivf \
  --query-expansion prf \
  --enable-reranker \
  --reranker-model balanced

REST endpoints remain available for automation and custom integrations, for example POST /config/search-profiles/production.

Available Backends

Keyword Search:

  • bm25_okapi - Standard BM25, fast baseline
  • bm25_plus - Enhanced BM25 with better precision

Semantic Search:

  • faiss_flat - Exact L2 distance search (good for <500K vectors)
  • faiss_ivf - Approximate search with clustering (optimal for 500K+ vectors)

Query Expansion (Optional)

Dynamically expands queries to improve recall:

  • none - No expansion (default, fastest)
  • prf - Pseudo-relevance feedback (expand with top-1 result terms)
  • entity - Entity extraction and weighting

Reranker Models

Fine-tune quality vs. latency trade-off:

  • lightweight - 22M params, ~50ms per batch (default)
  • balanced - 71M params, ~100ms per batch (recommended for production)

Detailed information about the RAG proxy implementation can be found here.


3. CLI Reference

CLI Reference can be found here. It covers ZIM downloads, configuration, health, cache, cleanup, ingestion, vector databases, and collections.


5. REST API (api/main.py)

API Reference can be found here. Contains Health & Configuration, Cache, Collections, ZIM File Management, Vector Database, Download progress fields, Cleanup, OpenAI-Compatible API, Settings, Web Search for Time-Sensitive Information, Search Mode Customization, Model auto-detection.


Using With OpenAI-compatible Tools (Code Editors, etc..)

Point any OpenAI-compatible tool at http://localhost:8000/v1 (or http://localhost:8000 for tools that auto-discover models):

Tool Configuration
Zed Settings: assistant.openai_api_url = http://localhost:8000
Cursor Settings → Models → OpenAI Base URL = http://localhost:8000
Continue (VS Code) ~/.continue/config.jsonmodelsapiBase = http://localhost:8000/v1
Aider --openai-api-base http://localhost:8000/v1
Open WebUI Admin → Connections → OpenAI API → Base URL = http://localhost:8000/v1
OpenAI SDKs client = OpenAI(base_url="http://localhost:8000/v1")

Setup

Prerequisites

  • Python 3.10+ (check with python3 --version)
  • pip (Python package manager, usually bundled with Python)
  • An OpenAI-compatible AI endpoint (examples: Ollama, LM Studio, OpenAI API, Anthropic, LiteLLM gateway) — optional for basic setup, required for chat functionality

Setup Example

1. Install via pip:

pip install tensor-serve

2. Create and activate a virtual environment (optional but recommended):

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install tensor-serve

3. Configure the upstream OpenAI-compatible AI endpoint:

tensor-serve config detect-local-ai
tensor-serve config set-ai-endpoint \
  --endpoint http://localhost:11434 \
  --model mistral

Optional: inspect models exposed by the configured endpoint:

tensor-serve config list-models

4. Choose where ZIM files are stored:

tensor-serve config set-zim-source ./zim_files

5. Browse and download ZIM content from Kiwix:

tensor-serve zim list
tensor-serve zim install wikivoyage_en_europe

Optional: use an interactive category downloader instead:

tensor-serve zim install-category coding

6. Review the saved configuration and installed ZIM files:

tensor-serve config show
tensor-serve zim status

7. Start the server:

tensor-serve start

Other start options:

tensor-serve start --port 3000              # Custom port
tensor-serve start --auto-port              # Auto-select available port if 8000 is in use
tensor-serve start --reload                 # Development mode with auto-reload

Note: If you prefer to install from source (development), clone the repository and install in editable mode:

git clone https://github.com/3M1RY33T/tensor-serve.git
cd tensor-serve
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

For cloud or gateway providers, include an API key and provider-specific endpoint:

tensor-serve config set-ai-endpoint \
  --endpoint https://api.openai.com/v1 \
  --model gpt-4o-mini \
  --api-key "$OPENAI_API_KEY"

API keys are encrypted before they are written to config.json. Tensor Serve uses a local .tensor_config.key file by default, or you can provide TENSOR_CONFIG_KEY / TENSOR_CONFIG_KEY_FILE for deployments that manage secrets externally.

Docker

Build and run locally:

docker build -t tensor-serve:local .
docker run --rm -p 8000:8000 -v tensor_serve_data:/data tensor-serve:local

Or use Compose:

docker compose up --build

The container stores runtime state in /data, including config.json, encrypted config key material, ZIM files, collections, and generated vector databases. When connecting to a host machine AI runtime from Docker Desktop, use the host gateway address:

docker compose exec tensor-serve tensor-serve config set-ai-endpoint \
  --endpoint http://host.docker.internal:11434 \
  --model mistral

Supported Environments

Local AI Runtimes (no API key needed):

  • Ollama — easy single-command setup
  • LM Studio — GUI-based model management
  • vLLM — high-performance serving

Cloud APIs (API key required):

  • OpenAI (https://api.openai.com/v1)
  • Anthropic Claude
  • Other OpenAI-compatible endpoints

Gateways:

  • LiteLLM — unified interface for multiple providers

Workflow

Complete Example

Prerequisites:

  • Tensor Serve is installed (see Setup above)
  • An OpenAI-compatible AI endpoint is running locally or accessible via API (e.g., Ollama on http://localhost:11434)

Steps:

# 1. Start the server
tensor-serve start

# 2. Leave the server running. In another terminal, check health
tensor-serve health

# 3. Ingest all files from the configured ZIM source folder into a vector database
tensor-serve ingest --source-folder --output-name travel

# 4. Load the database into memory
tensor-serve db load travel

# 5. Optional: enable web search for time-sensitive queries with the configuration CLI
tensor-serve config enable-web-search --provider duckduckgo
tensor-serve config set-search-modes --keyword-mode auto --semantic-mode on

# 6. Start chatting through the OpenAI-compatible proxy
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "tensor_show_resources": false,
    "messages": [
      {"role": "user", "content": "Who invented the telephone?"}
    ]
  }'

# 7. Time-sensitive query (if web search is enabled, Tensor Serve can search web + ZIM)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [
      {"role": "user", "content": "What is the latest news about AI?"}
    ]
  }'

Error Handling

  • 400: Bad request (DB not loaded, AI not configured, invalid input)
  • 404: Resource not found (database files missing)
  • 500: Server error
  • 502: AI endpoint unreachable or error

Performance Notes

  • Large ZIM files (>1GB) may take 10-30 minutes to ingest
  • Both FAISS (.index + .pkl) and BM25 (.bm25) indexes are saved to disk and reloaded on startup — no re-ingestion needed
  • FAISS similarity search is O(1); BM25 scoring is O(n) but extremely fast in practice
  • Hybrid RRF adds negligible overhead — both searches run in milliseconds (or up to 3 sources with web search)
  • Chat responses depend on AI endpoint response time
  • Existing databases ingested before hybrid search was added will use semantic-only search until re-ingested (no .bm25 file present → graceful fallback)
  • Web search (when enabled): adds 1-3 seconds per time-sensitive query; cached results are instant; disabled by default (zero overhead)

Contributing

Thanks for helping improve Tensor Serve.

Please refer to Contributing for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensor_serve-0.2.0.tar.gz (94.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tensor_serve-0.2.0-py3-none-any.whl (87.0 kB view details)

Uploaded Python 3

File details

Details for the file tensor_serve-0.2.0.tar.gz.

File metadata

  • Download URL: tensor_serve-0.2.0.tar.gz
  • Upload date:
  • Size: 94.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tensor_serve-0.2.0.tar.gz
Algorithm Hash digest
SHA256 956c4c74ff2cb2428cb88f70273798ec7452cc6f5333ec6774f9cfb0567ac349
MD5 9d361834c1fb49cec928fa7b2867477b
BLAKE2b-256 5187a9c5d2cd21bf0a3bc05a57544fc2b1815d6da6718e4210aed8b68ec24b5f

See more details on using hashes here.

File details

Details for the file tensor_serve-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tensor_serve-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 87.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tensor_serve-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e810c13cecab1f49f43f756babed0178cc8aa0104237a226c8acffdcdb7b040
MD5 679d9c76ddf03c0e83e0444e539544a6
BLAKE2b-256 b62412bbcda2fed1f20674ce00c86a81edd90872dea3628f2e443f4f3f2da2a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page