Single Model Embedding & Reranker API with Apple Silicon acceleration
Project description
๐ฅ Embeddings + Reranking on your Mac (MLXโfirst)
Blazingโfast local embeddings and true crossโencoder reranking on Apple Silicon. Works with Native, OpenAI, TEI, and Cohere APIs.
This page is a beginnerโfriendly quick start. Detailed guides live in docs/.
๐ Four APIs, One Service
| API | Endpoint | Use Case |
|---|---|---|
| Native | /api/v1/embed, /api/v1/rerank |
New projects |
| OpenAI | /v1/embeddings, /v1/openai/rerank (alias: /v1/rerank_openai) |
Existing OpenAI code |
| TEI | /embed, /rerank, /info |
Hugging Face TEI replacement |
| Cohere | /v1/rerank, /v2/rerank |
Cohere API replacement |
/docs /health |
More info. |
๐ Performance Visualization
Latency Comparison (Projected)
Single Text Embedding Latency (milliseconds)
Apple MLX โโโโ 0.2ms
PyTorch MPS โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 45ms
PyTorch CPU โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 120ms
CUDA (Est.) โโโโโโโโโโโโ 12ms
Vulkan (Est.) โโโโโโโโโโโโโโโโโโโโโโโโ 25ms
0ms 25ms 50ms 75ms 100ms 125ms
Throughput Comparison (texts/second)
Maximum Throughput (texts per second)
Apple MLX โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 35,000
CUDA (Est.) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 8,000
PyTorch MPS โโโโโโ 1,500
Vulkan (Est.) โโโโโโโโโโโโ 3,000
PyTorch CPU โโ 500
0 10k 20k 30k 40k
๐ Start here (60 seconds)
- Install and run (embeddings only)
pip install embed-rerank
# Minimal .env
cat > .env <<'ENV'
BACKEND=auto
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ
PORT=9000
HOST=0.0.0.0
ENV
embed-rerank # http://localhost:9000
Want 2560โD vectors by default? Add this to .env and restart:
cat >> .env <<'ENV'
# Use the model hidden_size (2560 for Qwen3-Embedding-4B) as output dimension
DIMENSION_STRATEGY=hidden_size
# Or enforce a fixed size (pads/truncates as needed):
# OUTPUT_EMBEDDING_DIMENSION=2560
# DIMENSION_STRATEGY=pad_or_truncate
ENV
# Verify
curl -s http://localhost:9000/api/v1/embed/ \
-H 'Content-Type: application/json' \
-d '{"texts":["hello"],"normalize":true}' | jq '.vectors[0] | length'
- Try it (embeddings + simple rerank)
# Embeddings (Native)
curl -s http://localhost:9000/api/v1/embed/ \
-H 'Content-Type: application/json' \
-d '{"texts":["Hello MLX","Apple Silicon rocks"]}' | jq '.embeddings | length'
# Rerank fallback (no dedicated reranker yet)
curl -s http://localhost:9000/api/v1/rerank/ \
-H 'Content-Type: application/json' \
-d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'
- Add a dedicated reranker (better quality)
cat >> .env <<'ENV'
RERANKER_BACKEND=auto
RERANKER_MODEL_ID=cross-encoder/ms-marco-MiniLM-L-6-v2 # Torch (stable)
# MLX experimental v1 also available: vserifsaglam/Qwen3-Reranker-4B-4bit-MLX
ENV
# Restart server, then call Native or OpenAI-compatible rerank
curl -s http://localhost:9000/api/v1/rerank/ \
-H 'Content-Type: application/json' \
-d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'
- (Optional) Run as a macOS service
# Uses your .env to generate a LaunchAgent and start the service
./tools/setup-macos-service.sh
# Check status and health
launchctl list | grep com.embed-rerank.server
open http://localhost:9000/health/
Notes
- OpenAI drop-in supported for both embeddings and rerank (/v1/embeddings, /v1/rerank). See docs for a tiny SDK example.
- Scores may be autoโsigmoidโnormalized for OpenAI clients by default (disable via
OPENAI_RERANK_AUTO_SIGMOID=false). - The root endpoint
/shows bothembedding_dimension(served) andhidden_size(model config) for clarity.
Run the full validation suite
./tools/server-tests.sh --full
๐งญ Pick your path
- Deployment profiles (Embeddingsโonly, Fallback rerank, Dedicated reranker): docs/DEPLOYMENT_PROFILES.md
- OpenAI usage (tiny example + options): docs/ENHANCED_OPENAI_API.md
- Quality benchmarks (JSONL/CSV judgments): docs/QUALITY_BENCHMARKS.md
- Troubleshooting: docs/TROUBLESHOOTING.md
- Backend specs and performance: docs/BACKEND_TECHNICAL_SPECS.md, docs/PERFORMANCE_COMPARISON_CHARTS.md
Try it with OpenAI SDK (tiny)
import openai
client = openai.OpenAI(base_url="http://localhost:9000/v1", api_key="dummy")
# Embeddings
res = client.embeddings.create(model="text-embedding-ada-002", input=["hello world"])
print(len(res.data[0].embedding))
# Rerank (OpenAI-compatible)
rr = client._request(
"post",
"/v1/openai/rerank",
json={
"query": "capital of france",
"documents": [
{"id": "a", "text": "Paris is the capital of France"},
{"id": "b", "text": "Berlin is in Germany"},
],
"top_n": 2,
},
)
print(rr.get("results", rr))
Tested Frameworks
| Framework | Tests | |
|---|---|---|
| โ | Open WebUI | Embed |
| โ | LightRAG | Embed Rerank |
| โ | continue.dev | Embed Rerank |
| โ | Kilo Code | Embed |
We are waiting for your reports!
๐ License
MIT License โ build amazing things locally.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embed_rerank-1.5.1.tar.gz.
File metadata
- Download URL: embed_rerank-1.5.1.tar.gz
- Upload date:
- Size: 147.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c58ace7f7310bfc0ba894cb4b8430757e5fb8ce4ad443c4a7873c65400fb2b2d
|
|
| MD5 |
3a872775dd902705b6c29c101d3c70ab
|
|
| BLAKE2b-256 |
d3a40dac741750b26ccbb4e949414ad0decab9b7b4c96ed6c8bda6b5854732b8
|
File details
Details for the file embed_rerank-1.5.1-py3-none-any.whl.
File metadata
- Download URL: embed_rerank-1.5.1-py3-none-any.whl
- Upload date:
- Size: 90.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7df23056345d69018b5e015cb0a110e68e828732d2ea4e414908e10d45e05566
|
|
| MD5 |
618e7940d973d315fc404e6c84c780f3
|
|
| BLAKE2b-256 |
97d6dbabee1d600319a1fbc46573b19be7972f222ab64172f676c0acd8b235f5
|