Single Model Embedding & Reranker API with Apple Silicon acceleration

These details have not been verified by PyPI

Project links

Project description

🔥 Embeddings + Reranking on your Mac (MLX‑first)

OpenAI rerank supported (/v1/openai/rerank)

Blazing‑fast local embeddings and true cross‑encoder reranking on Apple Silicon. Works with Native, OpenAI, TEI, and Cohere APIs.

This page is a beginner‑friendly quick start. Detailed guides live in docs/.

🌐 Four APIs, One Service

API	Endpoint	Use Case
Native	`/api/v1/embed`, `/api/v1/rerank`	New projects
OpenAI	`/v1/embeddings`, `/v1/openai/rerank` (alias: `/v1/rerank_openai`)	Existing OpenAI code
TEI	`/embed`, `/rerank`, `/info`	Hugging Face TEI replacement
Cohere	`/v1/rerank`, `/v2/rerank`	Cohere API replacement
	`/docs` `/health`	More info.

📈 Performance Visualization

Latency Comparison (Projected)

Single Text Embedding Latency (milliseconds)

Apple MLX    ████ 0.2ms
PyTorch MPS  ████████████████████████████████████████████████ 45ms  
PyTorch CPU  ████████████████████████████████████████████████████████████████████████████████████████████████████████ 120ms
CUDA (Est.)  ████████████ 12ms
Vulkan (Est.) ████████████████████████ 25ms

0ms        25ms       50ms       75ms       100ms      125ms

Throughput Comparison (texts/second)

Maximum Throughput (texts per second)

Apple MLX     ████████████████████████████████████████████████████████████████████████████████████████████████████████ 35,000
CUDA (Est.)   ████████████████████████████████ 8,000  
PyTorch MPS   ██████ 1,500
Vulkan (Est.) ████████████ 3,000
PyTorch CPU   ██ 500

0          10k        20k        30k        40k

🚀 Start here (60 seconds)

Install and run (embeddings only)

pip install embed-rerank

# Minimal .env
cat > .env <<'ENV'
BACKEND=auto
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ
PORT=9000
HOST=0.0.0.0
ENV

embed-rerank  # http://localhost:9000

Want 2560‑D vectors by default? Add this to .env and restart:

cat >> .env <<'ENV'
# Use the model hidden_size (2560 for Qwen3-Embedding-4B) as output dimension
DIMENSION_STRATEGY=hidden_size
# Or enforce a fixed size (pads/truncates as needed):
# OUTPUT_EMBEDDING_DIMENSION=2560
# DIMENSION_STRATEGY=pad_or_truncate
ENV

# Verify
curl -s http://localhost:9000/api/v1/embed/ \
  -H 'Content-Type: application/json' \
  -d '{"texts":["hello"],"normalize":true}' | jq '.vectors[0] | length'

Try it (embeddings + simple rerank)

# Embeddings (Native)
curl -s http://localhost:9000/api/v1/embed/ \
  -H 'Content-Type: application/json' \
  -d '{"texts":["Hello MLX","Apple Silicon rocks"]}' | jq '.embeddings | length'

# Rerank fallback (no dedicated reranker yet)
curl -s http://localhost:9000/api/v1/rerank/ \
  -H 'Content-Type: application/json' \
  -d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'

Add a dedicated reranker (better quality)

cat >> .env <<'ENV'
RERANKER_BACKEND=auto
RERANKER_MODEL_ID=cross-encoder/ms-marco-MiniLM-L-6-v2  # Torch (stable)
# MLX experimental v1 also available: vserifsaglam/Qwen3-Reranker-4B-4bit-MLX
ENV

# Restart server, then call Native or OpenAI-compatible rerank
curl -s http://localhost:9000/api/v1/rerank/ \
  -H 'Content-Type: application/json' \
  -d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'

(Optional) Run as a macOS service

# Uses your .env to generate a LaunchAgent and start the service
./tools/setup-macos-service.sh

# Check status and health
launchctl list | grep com.embed-rerank.server
open http://localhost:9000/health/

Notes

OpenAI drop-in supported for both embeddings and rerank (/v1/embeddings, /v1/rerank). See docs for a tiny SDK example.
Scores may be auto‑sigmoid‑normalized for OpenAI clients by default (disable via OPENAI_RERANK_AUTO_SIGMOID=false).
The root endpoint / shows both embedding_dimension (served) and hidden_size (model config) for clarity.

Run the full validation suite

./tools/server-tests.sh --full

🧭 Pick your path

Deployment profiles (Embeddings‑only, Fallback rerank, Dedicated reranker): docs/DEPLOYMENT_PROFILES.md
OpenAI usage (tiny example + options): docs/ENHANCED_OPENAI_API.md
Quality benchmarks (JSONL/CSV judgments): docs/QUALITY_BENCHMARKS.md
Troubleshooting: docs/TROUBLESHOOTING.md
Backend specs and performance: docs/BACKEND_TECHNICAL_SPECS.md, docs/PERFORMANCE_COMPARISON_CHARTS.md

Try it with OpenAI SDK (tiny)

import openai

client = openai.OpenAI(base_url="http://localhost:9000/v1", api_key="dummy")

# Embeddings
res = client.embeddings.create(model="text-embedding-ada-002", input=["hello world"]) 
print(len(res.data[0].embedding))

# Rerank (OpenAI-compatible)
rr = client._request(
  "post",
  "/v1/openai/rerank",
  json={
    "query": "capital of france",
    "documents": [
      {"id": "a", "text": "Paris is the capital of France"},
      {"id": "b", "text": "Berlin is in Germany"},
    ],
    "top_n": 2,
  },
)
print(rr.get("results", rr))

Tested Frameworks

	Framework	Tests
✅	Open WebUI	`Embed`
✅	LightRAG	`Embed` `Rerank`
✅	continue.dev	`Embed` `Rerank`
✅	Kilo Code	`Embed`

We are waiting for your reports!

📄 License

MIT License – build amazing things locally.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.5.1

Nov 14, 2025

1.5.0

Nov 5, 2025

1.3.0

Nov 4, 2025

1.2.3

Oct 30, 2025

1.2.2

Sep 10, 2025

1.2.1

Sep 10, 2025

1.2.0

Sep 9, 2025

1.1.3

Sep 3, 2025

1.1.1

Sep 3, 2025

1.1.0

Aug 28, 2025

1.0.2

Aug 28, 2025

1.0.1

Aug 28, 2025

1.0.0

Aug 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_rerank-1.5.1.tar.gz (147.7 kB view details)

Uploaded Nov 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embed_rerank-1.5.1-py3-none-any.whl (90.3 kB view details)

Uploaded Nov 14, 2025 Python 3

File details

Details for the file embed_rerank-1.5.1.tar.gz.

File metadata

Download URL: embed_rerank-1.5.1.tar.gz
Upload date: Nov 14, 2025
Size: 147.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embed_rerank-1.5.1.tar.gz
Algorithm	Hash digest
SHA256	`c58ace7f7310bfc0ba894cb4b8430757e5fb8ce4ad443c4a7873c65400fb2b2d`
MD5	`3a872775dd902705b6c29c101d3c70ab`
BLAKE2b-256	`d3a40dac741750b26ccbb4e949414ad0decab9b7b4c96ed6c8bda6b5854732b8`

See more details on using hashes here.

File details

Details for the file embed_rerank-1.5.1-py3-none-any.whl.

File metadata

Download URL: embed_rerank-1.5.1-py3-none-any.whl
Upload date: Nov 14, 2025
Size: 90.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for embed_rerank-1.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7df23056345d69018b5e015cb0a110e68e828732d2ea4e414908e10d45e05566`
MD5	`618e7940d973d315fc404e6c84c780f3`
BLAKE2b-256	`97d6dbabee1d600319a1fbc46573b19be7972f222ab64172f676c0acd8b235f5`

See more details on using hashes here.

embed-rerank 1.5.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

🔥 Embeddings + Reranking on your Mac (MLX‑first)

🌐 Four APIs, One Service

📈 Performance Visualization

Latency Comparison (Projected)

Throughput Comparison (texts/second)

🚀 Start here (60 seconds)

🧭 Pick your path

Try it with OpenAI SDK (tiny)

Tested Frameworks

We are waiting for your reports!

📄 License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes