Single Model Embedding & Reranker API with Apple Silicon acceleration

These details have not been verified by PyPI

Project links

Project description

🔥 Single Model Embedding & Reranking API

Lightning-fast local embeddings & reranking for Apple Silicon (MLX-first, OpenAI & TEI compatible)

⚡ Why This Matters

Transform your text processing with 10x faster embeddings and reranking on Apple Silicon. Drop-in replacement for OpenAI API and Hugging Face TEI with zero code changes required.

🏆 Performance Comparison

Operation	This API (MLX)	OpenAI API	Hugging Face TEI
Embeddings	`0.78ms`	`200ms+`	`15ms`
Reranking	`1.04ms`	`N/A`	`25ms`
Model Loading	`0.36s`	`N/A`	`3.2s`
Cost	`$0`	`$0.02/1K`	`$0`

Tested on Apple M4 Max

🚀 Quick Start

Option 1: Install from PyPI (Recommended)

# Install the package
pip install embed-rerank

# Start the server (default port 9000)
embed-rerank

# Or with custom port and options
embed-rerank --port 8080 --host 127.0.0.1

# See all options
embed-rerank --help

CLI Options:

--host: Server host (default: 0.0.0.0)
--port: Server port (default: 9000)
--reload: Enable auto-reload for development
--log-level: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Environment Variables:

# Alternative: Use environment variables
export PORT=8080
export HOST=127.0.0.1
embed-rerank

Option 2: From Source

# 1. Clone and setup
git clone https://github.com/joonsoo-me/embed-rerank.git
cd embed-rerank
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Start server (macOS/Linux)
./tools/server-run.sh

# 3. Test it works
curl http://localhost:9000/health/

🎉 Done! Visit http://localhost:9000/docs for interactive API documentation.

🛠 Server Management (macOS/Linux)

# Start server (background)
./tools/server-run.sh

# Start server (foreground/development)
./tools/server-run-foreground.sh

# Stop server
./tools/server-stop.sh

Windows Support: Coming soon! Currently optimized for macOS/Linux.

⚙️ Configuration

Create .env file (optional):

# Server
PORT=9000
HOST=0.0.0.0

# Backend
BACKEND=auto                                   # auto | mlx | torch
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ

# Model Cache (first run downloads ~2.3GB model)
MODEL_PATH=                               # Custom model directory
TRANSFORMERS_CACHE=                           # HF cache override
# Default: ~/.cache/huggingface/hub/

# Performance
BATCH_SIZE=32
MAX_TEXTS_PER_REQUEST=100

📂 Model Cache Management

The service automatically manages model downloads and caching:

Environment Variable	Purpose	Default
`MODEL_PATH`	Custom model directory	(uses HF cache)
`TRANSFORMERS_CACHE`	Override HF cache location	`~/.cache/huggingface/transformers`
`HF_HOME`	HF home directory	`~/.cache/huggingface`
(auto)	Default HF cache	`~/.cache/huggingface/hub/`

Cache Location Check

# Find where your model is cached
python3 -c "
import os
print('MODEL_PATH:', os.getenv('MODEL_PATH', '<not set>'))
print('TRANSFORMERS_CACHE:', os.getenv('TRANSFORMERS_CACHE', '<not set>'))
print('HF_HOME:', os.getenv('HF_HOME', '<not set>'))
print('Default cache:', os.path.expanduser('~/.cache/huggingface/hub'))
"

# List cached Qwen3 models
ls ~/.cache/huggingface/hub | grep -i qwen3 || echo "No Qwen3 models found in cache"

🌐 Three APIs, One Service

API	Endpoint	Use Case
Native	`/api/v1/embed`, `/api/v1/rerank`	New projects
OpenAI	`/v1/embeddings`	Existing OpenAI code
TEI	`/embed`, `/rerank`	Hugging Face TEI replacement

OpenAI Compatible (Drop-in)

import openai

client = openai.OpenAI(
    api_key="dummy-key",
    base_url="http://localhost:9000/v1"
)

response = client.embeddings.create(
    input=["Hello world", "Apple Silicon is fast!"],
    model="text-embedding-ada-002"
)
# 🚀 10x faster than OpenAI, same code!

TEI Compatible

curl -X POST "http://localhost:9000/embed" 
  -H "Content-Type: application/json" 
  -d '{"inputs": ["Hello world"], "truncate": true}'

Native API

# Embeddings
curl -X POST "http://localhost:9000/api/v1/embed/" 
  -H "Content-Type: application/json" 
  -d '{"texts": ["Apple Silicon", "MLX acceleration"]}'

# Reranking  
curl -X POST "http://localhost:9000/api/v1/rerank/" 
  -H "Content-Type: application/json" 
  -d '{"query": "machine learning", "passages": ["AI is cool", "Dogs are pets", "MLX is fast"]}'

🧪 Testing

# Comprehensive test suite
./tools/server-tests.sh

# Quick health & model loaded info check
curl http://localhost:9000/health/

# Run pytest
pytest tests/ -v

🚀 What You Get

✅ Zero Code Changes: Drop-in replacement for OpenAI API and TEI
⚡ 10x Performance: Apple MLX acceleration on Apple Silicon
💰 Zero Costs: No API fees, runs locally
🔒 Privacy: Your data never leaves your machine
🎯 Three APIs: Native, OpenAI, and TEI compatibility
📊 Production Ready: Health checks, monitoring, structured logging

📄 License

MIT License - build amazing things with this code!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.5.1

Nov 14, 2025

1.5.0

Nov 5, 2025

1.3.0

Nov 4, 2025

1.2.3

Oct 30, 2025

1.2.2

Sep 10, 2025

1.2.1

Sep 10, 2025

1.2.0

Sep 9, 2025

1.1.3

Sep 3, 2025

1.1.1

Sep 3, 2025

1.1.0

Aug 28, 2025

This version

1.0.2

Aug 28, 2025

1.0.1

Aug 28, 2025

1.0.0

Aug 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_rerank-1.0.2.tar.gz (97.0 kB view details)

Uploaded Aug 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embed_rerank-1.0.2-py3-none-any.whl (58.6 kB view details)

Uploaded Aug 28, 2025 Python 3

File details

Details for the file embed_rerank-1.0.2.tar.gz.

File metadata

Download URL: embed_rerank-1.0.2.tar.gz
Upload date: Aug 28, 2025
Size: 97.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for embed_rerank-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`f1f19879fc93f7f924757e3a06b17803a4c65de56d24be0d0f2749e7eb93cf36`
MD5	`fcddd8c472660c248e2ebd65b1b109bb`
BLAKE2b-256	`8adc6543113343eb692170212d1e6d96ca574f5655cbcb4a16d0e6038eecb16f`

See more details on using hashes here.

File details

Details for the file embed_rerank-1.0.2-py3-none-any.whl.

File metadata

Download URL: embed_rerank-1.0.2-py3-none-any.whl
Upload date: Aug 28, 2025
Size: 58.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for embed_rerank-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1c03ff97ca387484cc505bba747bf40fef0deb87107adf79b0dbfd8acfea3b8`
MD5	`0f84611a2ec323ed14113a398355db2d`
BLAKE2b-256	`f1db8b820ad1a0fbeed6ff9bcf8ceb758e9f2d89936052da941ee5a9868635fd`

See more details on using hashes here.

embed-rerank 1.0.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

🔥 Single Model Embedding & Reranking API

⚡ Why This Matters

🏆 Performance Comparison

🚀 Quick Start

Option 1: Install from PyPI (Recommended)

Option 2: From Source

🛠 Server Management (macOS/Linux)

⚙️ Configuration

📂 Model Cache Management

Cache Location Check

🌐 Three APIs, One Service

OpenAI Compatible (Drop-in)

TEI Compatible

Native API

🧪 Testing

🚀 What You Get

📄 License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes