ZIM-based retrieval augmented proxy for OpenAI-compatible AI
Project description
Tensor (Serve)
tensor-serve is a ZIM-based retrieval augmented proxy for any OpenAI-compatible AI. This program lets you download ZIM documentation from the live Kiwix OPDS catalog, builds a local semantic vector database from it, and uses that database to provide an AI model relevant context when answering questions.
The purpose of this program is to provide the service for customizing your AI for your specific needs seamlessly.
Combining keyword search and semantic search, Tensor helps produce more accurate responses for the data you have included in a ZIM database.
1. How the AI pipeline works
- Download — ZIM files fetched from Kiwix and stored in the configured ZIM source folder (
zim_files/by default) - Ingest — Articles extracted, HTML stripped, split into 500-word overlapping chunks, embedded with
sentence-transformers, indexed in FAISS and BM25 - Auto-load — On server startup, the last active collection's FAISS and BM25 indexes are loaded automatically
- Analyze — Simple queries can skip retrieval; domain-specific queries use the query analyzer to choose the best search mode (
hybrid,faiss, orbm25); time-sensitive queries optionally trigger web search - OpenAI-compatible proxy — For
/v1/chat/completions, the user message is embedded (or served from cache) → hybrid search retrieves top-k chunks (optionally merged with web results) → optional cross-encoder reranking improves result order → retrieved context is injected into the request before it is forwarded to the upstream AI server.
Hybrid search (FAISS + BM25 + optional Web Search w/ Reciprocal Rank Fusion)
Search requests and OpenAI-compatible chat requests can run up to three retrievals in parallel and merge them:
| FAISS (semantic) | BM25 (keyword) | Web Search | |
|---|---|---|---|
| Finds | Conceptually related chunks | Exact term / token matches | Current / recent information |
| Good for | "How does backpressure work?" | "asyncio.gather", error codes, API names | "latest news", "today's events", time-sensitive queries |
| Requires setup | Automatic | Automatic | Optional; disabled by default |
Results are merged with Reciprocal Rank Fusion (score = Σ 1 / (60 + rank)). Chunks that rank well in multiple result sets float to the top. The pipeline degrades gracefully — if one index is unavailable it is skipped.
The query analyzer automatically selects the search strategy:
| Mode | When it is used |
|---|---|
hybrid |
Mixed or general queries where semantic and keyword signals both help |
faiss |
Conceptual queries such as explanations, architecture, patterns, and design questions |
bm25 |
Keyword-heavy queries such as API names, code symbols, methods, classes, errors, and short exact searches |
Query embeddings and search results are cached with an in-memory LRU cache to reduce repeated embedding and retrieval work. If enabled, the optional cross-encoder reranker performs a second-stage pass over retrieved chunks before context is sent to the model.
Detailed information about the RAG proxy implementation can be found here.
3. CLI Reference
CLI Reference can be found here. It covers ZIM downloads, configuration, health, cache, cleanup, ingestion, vector databases, and collections.
5. REST API (api/main.py)
API Reference can be found here. Contains Health & Configuration, Cache, Collections, ZIM File Management, Vector Database, Download progress fields, Cleanup, OpenAI-Compatible API, Settings, Web Search for Time-Sensitive Information, Search Mode Customization, Model auto-detection.
Using With OpenAI-compatible Tools (Code Editors, etc..)
Point any OpenAI-compatible tool at http://localhost:8000/v1 (or http://localhost:8000 for tools that auto-discover models):
| Tool | Configuration |
|---|---|
| Zed | Settings: assistant.openai_api_url = http://localhost:8000 |
| Cursor | Settings → Models → OpenAI Base URL = http://localhost:8000 |
| Continue (VS Code) | ~/.continue/config.json → models → apiBase = http://localhost:8000/v1 |
| Aider | --openai-api-base http://localhost:8000/v1 |
| Open WebUI | Admin → Connections → OpenAI API → Base URL = http://localhost:8000/v1 |
| OpenAI SDKs | client = OpenAI(base_url="http://localhost:8000/v1") |
Setup
Prerequisites
- Python 3.10+ (check with
python3 --version) - pip (Python package manager, usually bundled with Python)
- An OpenAI-compatible AI endpoint (examples: Ollama, LM Studio, OpenAI API, Anthropic, LiteLLM gateway) — optional for basic setup, required for chat functionality
Setup Example
# 1. Clone and enter the project
git clone https://github.com/3M1RY33T/tensor-serve.git
cd tensor-serve
# 2. Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# 3. Install Tensor Serve and its dependencies
pip install -r requirements.txt
pip install -e .
# 4. Configure the upstream OpenAI-compatible AI endpoint
tensor-serve config detect-local-ai
tensor-serve config set-ai-endpoint \
--endpoint http://localhost:11434 \
--model mistral
# Optional: inspect models exposed by the configured endpoint
tensor-serve config list-models
# 5. Choose where ZIM files are stored
tensor-serve config set-zim-source ./zim_files
# 6. Browse and download ZIM content from Kiwix
tensor-serve zim list
tensor-serve zim install wikivoyage_en_europe
# Optional: use an interactive category downloader instead
# tensor-serve zim install-category coding
# 7. Review the saved configuration and installed ZIM files
tensor-serve config show
tensor-serve zim status
# 8. Start the server
tensor-serve start
# Custom port
tensor-serve start --port 3000
# Auto-select available port if 8000 is in use
tensor-serve start --auto-port
# Development mode with auto-reload
tensor-serve start --reload
For cloud or gateway providers, include an API key and provider-specific endpoint:
tensor-serve config set-ai-endpoint \
--endpoint https://api.openai.com/v1 \
--model gpt-4o-mini \
--api-key "$OPENAI_API_KEY"
API keys are encrypted before they are written to config.json. Tensor Serve uses a local .tensor_config.key file by default, or you can provide TENSOR_CONFIG_KEY / TENSOR_CONFIG_KEY_FILE for deployments that manage secrets externally.
Docker
Build and run locally:
docker build -t tensor-serve:local .
docker run --rm -p 8000:8000 -v tensor_serve_data:/data tensor-serve:local
Or use Compose:
docker compose up --build
The container stores runtime state in /data, including config.json, encrypted
config key material, ZIM files, collections, and generated vector databases.
When connecting to a host machine AI runtime from Docker Desktop, use the host
gateway address:
docker compose exec tensor-serve tensor-serve config set-ai-endpoint \
--endpoint http://host.docker.internal:11434 \
--model mistral
Supported Environments
Local AI Runtimes (no API key needed):
- Ollama — easy single-command setup
- LM Studio — GUI-based model management
- vLLM — high-performance serving
Cloud APIs (API key required):
- OpenAI (
https://api.openai.com/v1) - Anthropic Claude
- Other OpenAI-compatible endpoints
Gateways:
- LiteLLM — unified interface for multiple providers
Workflow
Complete Example
Prerequisites:
- Tensor Serve is installed (see Setup above)
- An OpenAI-compatible AI endpoint is running locally or accessible via API (e.g., Ollama on
http://localhost:11434)
Steps:
# 1. Start the server
tensor-serve start
# 2. Leave the server running. In another terminal, check health
tensor-serve health
# 3. Ingest all files from the configured ZIM source folder into a vector database
tensor-serve ingest --source-folder --output-name travel
# 4. Load the database into memory
tensor-serve db load travel
# 5. Optional: enable web search for time-sensitive queries with the configuration CLI
tensor-serve config enable-web-search --provider duckduckgo
tensor-serve config set-search-modes --keyword-mode auto --semantic-mode on
# 6. Start chatting through the OpenAI-compatible proxy
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"tensor_show_resources": false,
"messages": [
{"role": "user", "content": "Who invented the telephone?"}
]
}'
# 7. Time-sensitive query (if web search is enabled, Tensor Serve can search web + ZIM)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [
{"role": "user", "content": "What is the latest news about AI?"}
]
}'
Error Handling
- 400: Bad request (DB not loaded, AI not configured, invalid input)
- 404: Resource not found (database files missing)
- 500: Server error
- 502: AI endpoint unreachable or error
Performance Notes
- Large ZIM files (>1GB) may take 10-30 minutes to ingest
- Both FAISS (
.index+.pkl) and BM25 (.bm25) indexes are saved to disk and reloaded on startup — no re-ingestion needed - FAISS similarity search is O(1); BM25 scoring is O(n) but extremely fast in practice
- Hybrid RRF adds negligible overhead — both searches run in milliseconds (or up to 3 sources with web search)
- Chat responses depend on AI endpoint response time
- Existing databases ingested before hybrid search was added will use semantic-only search until re-ingested (no
.bm25file present → graceful fallback) - Web search (when enabled): adds 1-3 seconds per time-sensitive query; cached results are instant; disabled by default (zero overhead)
Contributing
Thanks for helping improve Tensor Serve.
Please refer to Contributing for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tensor_serve-0.1.0.tar.gz.
File metadata
- Download URL: tensor_serve-0.1.0.tar.gz
- Upload date:
- Size: 81.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
040dde41ea30f4421db885d6dbf79bda6bf9ffd7b171019cbe1da862e4ccaf6b
|
|
| MD5 |
a18d62ec9c51216dcf7c2b8295a874ef
|
|
| BLAKE2b-256 |
311dcd44a422587223c05d037b3879dec93482b6f7a9c413dfe54d788288afc1
|
File details
Details for the file tensor_serve-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tensor_serve-0.1.0-py3-none-any.whl
- Upload date:
- Size: 72.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3add6fddae809fc30efc844be565a45107ab12989af750d9f788a75534a508ce
|
|
| MD5 |
f039cd0e0f8d0cabe87c8bd61f89af57
|
|
| BLAKE2b-256 |
9060c432c3cfc21be69904a445371355b351bd38fe1f922d99adde1c824f54a3
|