Skip to main content

Single Model Embedding & Reranker API with Apple Silicon acceleration

Project description

๐Ÿ”ฅ Embeddings + Reranking on your Mac (MLXโ€‘first)

OpenAI rerank supported (/v1/openai/rerank) auto-sigmoid default on PyPI Version

Blazingโ€‘fast local embeddings and true crossโ€‘encoder reranking on Apple Silicon. Works with Native, OpenAI, TEI, and Cohere APIs.

This page is a beginnerโ€‘friendly quick start. Detailed guides live in docs/.

๐ŸŒ Four APIs, One Service

API Endpoint Use Case
Native /api/v1/embed, /api/v1/rerank New projects
OpenAI /v1/embeddings, /v1/openai/rerank (alias: /v1/rerank_openai) Existing OpenAI code
TEI /embed, /rerank, /info Hugging Face TEI replacement
Cohere /v1/rerank, /v2/rerank Cohere API replacement
/docs /health More info.

๐Ÿ“ˆ Performance Visualization

Latency Comparison (Projected)

Single Text Embedding Latency (milliseconds)

Apple MLX    โ–ˆโ–ˆโ–ˆโ–ˆ 0.2ms
PyTorch MPS  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 45ms  
PyTorch CPU  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 120ms
CUDA (Est.)  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 12ms
Vulkan (Est.) โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 25ms

0ms        25ms       50ms       75ms       100ms      125ms

Throughput Comparison (texts/second)

Maximum Throughput (texts per second)

Apple MLX     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 35,000
CUDA (Est.)   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 8,000  
PyTorch MPS   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1,500
Vulkan (Est.) โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 3,000
PyTorch CPU   โ–ˆโ–ˆ 500

0          10k        20k        30k        40k

๐Ÿš€ Start here (60 seconds)

  1. Install and run (embeddings only)
pip install embed-rerank

# Minimal .env
cat > .env <<'ENV'
BACKEND=auto
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ
PORT=9000
HOST=0.0.0.0
ENV

embed-rerank  # http://localhost:9000

Want 2560โ€‘D vectors by default? Add this to .env and restart:

cat >> .env <<'ENV'
# Use the model hidden_size (2560 for Qwen3-Embedding-4B) as output dimension
DIMENSION_STRATEGY=hidden_size
# Or enforce a fixed size (pads/truncates as needed):
# OUTPUT_EMBEDDING_DIMENSION=2560
# DIMENSION_STRATEGY=pad_or_truncate
ENV

# Verify
curl -s http://localhost:9000/api/v1/embed/ \
  -H 'Content-Type: application/json' \
  -d '{"texts":["hello"],"normalize":true}' | jq '.vectors[0] | length'
  1. Try it (embeddings + simple rerank)
# Embeddings (Native)
curl -s http://localhost:9000/api/v1/embed/ \
  -H 'Content-Type: application/json' \
  -d '{"texts":["Hello MLX","Apple Silicon rocks"]}' | jq '.embeddings | length'

# Rerank fallback (no dedicated reranker yet)
curl -s http://localhost:9000/api/v1/rerank/ \
  -H 'Content-Type: application/json' \
  -d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'
  1. Add a dedicated reranker (better quality)
cat >> .env <<'ENV'
RERANKER_BACKEND=auto
RERANKER_MODEL_ID=cross-encoder/ms-marco-MiniLM-L-6-v2  # Torch (stable)
# MLX experimental v1 also available: vserifsaglam/Qwen3-Reranker-4B-4bit-MLX
ENV

# Restart server, then call Native or OpenAI-compatible rerank
curl -s http://localhost:9000/api/v1/rerank/ \
  -H 'Content-Type: application/json' \
  -d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'
  1. (Optional) Run as a macOS service
# Uses your .env to generate a LaunchAgent and start the service
./tools/setup-macos-service.sh

# Check status and health
launchctl list | grep com.embed-rerank.server
open http://localhost:9000/health/

Notes

  • OpenAI drop-in supported for both embeddings and rerank (/v1/embeddings, /v1/rerank). See docs for a tiny SDK example.
  • Scores may be autoโ€‘sigmoidโ€‘normalized for OpenAI clients by default (disable via OPENAI_RERANK_AUTO_SIGMOID=false).
  • The root endpoint / shows both embedding_dimension (served) and hidden_size (model config) for clarity.

Run the full validation suite

./tools/server-tests.sh --full

๐Ÿงญ Pick your path

  • Deployment profiles (Embeddingsโ€‘only, Fallback rerank, Dedicated reranker): docs/DEPLOYMENT_PROFILES.md
  • OpenAI usage (tiny example + options): docs/ENHANCED_OPENAI_API.md
  • Quality benchmarks (JSONL/CSV judgments): docs/QUALITY_BENCHMARKS.md
  • Troubleshooting: docs/TROUBLESHOOTING.md
  • Backend specs and performance: docs/BACKEND_TECHNICAL_SPECS.md, docs/PERFORMANCE_COMPARISON_CHARTS.md

Try it with OpenAI SDK (tiny)

import openai

client = openai.OpenAI(base_url="http://localhost:9000/v1", api_key="dummy")

# Embeddings
res = client.embeddings.create(model="text-embedding-ada-002", input=["hello world"]) 
print(len(res.data[0].embedding))

# Rerank (OpenAI-compatible)
rr = client._request(
  "post",
  "/v1/openai/rerank",
  json={
    "query": "capital of france",
    "documents": [
      {"id": "a", "text": "Paris is the capital of France"},
      {"id": "b", "text": "Berlin is in Germany"},
    ],
    "top_n": 2,
  },
)
print(rr.get("results", rr))

Tested Frameworks

Framework Tests
โœ… Open WebUI Embed
โœ… LightRAG Embed Rerank
โœ… continue.dev Embed Rerank
โœ… Kilo Code Embed
We are waiting for your reports!

๐Ÿ“„ License

MIT License โ€“ build amazing things locally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_rerank-1.5.1.tar.gz (147.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embed_rerank-1.5.1-py3-none-any.whl (90.3 kB view details)

Uploaded Python 3

File details

Details for the file embed_rerank-1.5.1.tar.gz.

File metadata

  • Download URL: embed_rerank-1.5.1.tar.gz
  • Upload date:
  • Size: 147.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embed_rerank-1.5.1.tar.gz
Algorithm Hash digest
SHA256 c58ace7f7310bfc0ba894cb4b8430757e5fb8ce4ad443c4a7873c65400fb2b2d
MD5 3a872775dd902705b6c29c101d3c70ab
BLAKE2b-256 d3a40dac741750b26ccbb4e949414ad0decab9b7b4c96ed6c8bda6b5854732b8

See more details on using hashes here.

File details

Details for the file embed_rerank-1.5.1-py3-none-any.whl.

File metadata

  • Download URL: embed_rerank-1.5.1-py3-none-any.whl
  • Upload date:
  • Size: 90.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embed_rerank-1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7df23056345d69018b5e015cb0a110e68e828732d2ea4e414908e10d45e05566
MD5 618e7940d973d315fc404e6c84c780f3
BLAKE2b-256 97d6dbabee1d600319a1fbc46573b19be7972f222ab64172f676c0acd8b235f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page