Skip to main content

Single Model Embedding & Reranker API with Apple Silicon acceleration

Project description

๐Ÿ”ฅ Single Model Embedding & Reranking API

Lightning-fast local embeddings & reranking for Apple Silicon (MLX-first, OpenAI & TEI compatible)


โšก Why This Matters

Transform your text processing with 10x faster embeddings and reranking on Apple Silicon. Drop-in replacement for OpenAI API and Hugging Face TEI with zero code changes required.

๐Ÿ† Performance Comparison

Operation This API (MLX) OpenAI API Hugging Face TEI
Embeddings 0.78ms 200ms+ 15ms
Reranking 1.04ms N/A 25ms
Model Loading 0.36s N/A 3.2s
Cost $0 $0.02/1K $0

Tested on Apple M4 Max


๐Ÿš€ Quick Start

Option 1: Install from PyPI (Recommended)

# Install the package
pip install embed-rerank

# Start the server (default port 9000)
embed-rerank

# Or with custom port and options
embed-rerank --port 8080 --host 127.0.0.1

# See all options
embed-rerank --help

Option 2: From Source (Development)

# 1. Clone and setup
git clone https://github.com/joonsoo-me/embed-rerank.git
cd embed-rerank
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Start server (macOS/Linux)
./tools/server-run.sh

# 3. Test it works
curl http://localhost:9000/health/

๐ŸŽ‰ Done! Visit http://localhost:9000/docs for interactive API documentation.


๐Ÿ›  Server Management (macOS/Linux)

# Start server (background)
./tools/server-run.sh

# Start server (foreground/development)
./tools/server-run-foreground.sh

# Stop server
./tools/server-stop.sh

Windows Support: Coming soon! Currently optimized for macOS/Linux.


โš™๏ธ CLI Configuration

PyPI Package CLI Options

Server Options:

  • --host: Server host (default: 0.0.0.0)
  • --port: Server port (default: 9000)
  • --reload: Enable auto-reload for development
  • --log-level: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Testing Options:

  • --test quick: Run quick validation tests
  • --test performance: Run performance benchmark tests
  • --test quality: Run quality validation tests
  • --test full: Run comprehensive test suite
  • --test-url: Custom server URL for testing
  • --test-output: Test output directory

Examples:

# Custom server configuration
embed-rerank --port 8080 --host 127.0.0.1 --reload

# Built-in performance testing
embed-rerank --port 8080 &
embed-rerank --test performance --test-url http://localhost:8080
pkill -f embed-rerank

# Environment variables
export PORT=8080 HOST=127.0.0.1
embed-rerank

Source Code Configuration

Create .env file for development:

# Server
PORT=9000
HOST=0.0.0.0

# Backend
BACKEND=auto                                   # auto | mlx | torch
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ

# Model Cache (first run downloads ~2.3GB model)
MODEL_PATH=                               # Custom model directory
TRANSFORMERS_CACHE=                           # HF cache override
# Default: ~/.cache/huggingface/hub/

# Performance
BATCH_SIZE=32
MAX_TEXTS_PER_REQUEST=100

๐Ÿ“‚ Model Cache Management

The service automatically manages model downloads and caching:

Environment Variable Purpose Default
MODEL_PATH Custom model directory (uses HF cache)
TRANSFORMERS_CACHE Override HF cache location ~/.cache/huggingface/transformers
HF_HOME HF home directory ~/.cache/huggingface
(auto) Default HF cache ~/.cache/huggingface/hub/

Cache Location Check

# Find where your model is cached
python3 -c "
import os
print('MODEL_PATH:', os.getenv('MODEL_PATH', '<not set>'))
print('TRANSFORMERS_CACHE:', os.getenv('TRANSFORMERS_CACHE', '<not set>'))
print('HF_HOME:', os.getenv('HF_HOME', '<not set>'))
print('Default cache:', os.path.expanduser('~/.cache/huggingface/hub'))
"

# List cached Qwen3 models
ls ~/.cache/huggingface/hub | grep -i qwen3 || echo "No Qwen3 models found in cache"

๐ŸŒ Three APIs, One Service

API Endpoint Use Case
Native /api/v1/embed, /api/v1/rerank New projects
OpenAI /v1/embeddings Existing OpenAI code
TEI /embed, /rerank Hugging Face TEI replacement

OpenAI Compatible (Drop-in)

import openai

client = openai.OpenAI(
    api_key="dummy-key",
    base_url="http://localhost:9000/v1"
)

response = client.embeddings.create(
    input=["Hello world", "Apple Silicon is fast!"],
    model="text-embedding-ada-002"
)
# ๐Ÿš€ 10x faster than OpenAI, same code!

TEI Compatible

curl -X POST "http://localhost:9000/embed" 
  -H "Content-Type: application/json" 
  -d '{"inputs": ["Hello world"], "truncate": true}'

Native API

# Embeddings
curl -X POST "http://localhost:9000/api/v1/embed/" 
  -H "Content-Type: application/json" 
  -d '{"texts": ["Apple Silicon", "MLX acceleration"]}'

# Reranking  
curl -X POST "http://localhost:9000/api/v1/rerank/" 
  -H "Content-Type: application/json" 
  -d '{"query": "machine learning", "passages": ["AI is cool", "Dogs are pets", "MLX is fast"]}'

๐Ÿงช Performance Testing & Validation

๐Ÿš€ Built-in CLI Testing (PyPI Package)

The PyPI package includes powerful built-in testing capabilities:

# Quick validation (basic functionality check)
embed-rerank --test quick

# Performance benchmark (latency, throughput, concurrency)
embed-rerank --test performance --test-url http://localhost:9000

# Quality validation (semantic similarity, multilingual)  
embed-rerank --test quality --test-url http://localhost:9000

# Full comprehensive test suite
embed-rerank --test full --test-url http://localhost:9000

Test Results Include:

  • ๐Ÿ“Š Latency Metrics: Mean, P95, P99 response times
  • ๐Ÿš€ Throughput Analysis: Texts/sec processing rates
  • ๐Ÿ”„ Concurrency Testing: Multi-threaded request handling
  • ๐Ÿง  Semantic Validation: Quality of embeddings and reranking
  • ๐ŸŒ Multilingual Support: Cross-language performance
  • ๐Ÿ“ˆ JSON Reports: Detailed metrics for automation

Example Output:

๐Ÿงช Running Embed-Rerank Test Suite
๐Ÿ“ Target URL: http://localhost:9000
๐ŸŽฏ Test Mode: performance

โšก Performance Results:
โ€ข Latency: 0.8ms avg, 1.2ms max
โ€ข Throughput: 1,250 texts/sec peak  
โ€ข Concurrency: 5/5 successful (100%)
๐Ÿ“ Results saved to: ./test-results/performance_test_results.json

๐Ÿ”ง Advanced Testing (Source Code)

### ๐Ÿ”ง Advanced Testing (Source Code)

For development and comprehensive testing with the source code:

```bash
# Comprehensive test suite (shell script)
./tools/server-tests.sh

# Run with specific test modes
./tools/server-tests.sh --quick            # Quick validation only
./tools/server-tests.sh --performance      # Performance tests only
./tools/server-tests.sh --full             # Full test suite

# Custom server URL
./tools/server-tests.sh --url http://localhost:8080

# Manual health check
curl http://localhost:9000/health/

# Unit tests with pytest
pytest tests/ -v

๐Ÿ›  Development & Deployment

Local Development (Source Code)

# Start server (background)
./tools/server-run.sh

# Start server (foreground/development)
./tools/server-run-foreground.sh

# Stop server
./tools/server-stop.sh

Production Deployment (PyPI Package)

# Install and run
pip install embed-rerank
embed-rerank --port 9000 --host 0.0.0.0

# With custom configuration
embed-rerank --port 8080 --reload --log-level DEBUG

# Background deployment
embed-rerank --port 9000 &

Windows Support: Coming soon! Currently optimized for macOS/Linux.


---

## ๐Ÿš€ What You Get

### ๐ŸŽฏ Core Features
- โœ… **Zero Code Changes**: Drop-in replacement for OpenAI API and TEI
- โšก **10x Performance**: Apple MLX acceleration on Apple Silicon  
- ๐Ÿ’ฐ **Zero Costs**: No API fees, runs locally
- ๐Ÿ”’ **Privacy**: Your data never leaves your machine
- ๐ŸŽฏ **Three APIs**: Native, OpenAI, and TEI compatibility
- ๐Ÿ“Š **Production Ready**: Health checks, monitoring, structured logging

### ๐Ÿงช Built-in Testing & Benchmarking
- ๐Ÿ“ˆ **CLI Performance Testing**: One-command benchmarking
- ๐Ÿ”„ **Concurrency Testing**: Multi-threaded request validation
- ๐Ÿง  **Quality Validation**: Semantic similarity and multilingual testing
- ๐Ÿ“Š **JSON Reports**: Automated performance monitoring
- ๐Ÿš€ **Real-time Metrics**: Latency, throughput, and success rates

### ๐Ÿ›  Deployment Options
- ๐Ÿ“ฆ **PyPI Package**: `pip install embed-rerank` for instant deployment
- ๐Ÿ”ง **Source Code**: Full development environment with advanced tooling
- ๐ŸŒ **Multi-API Support**: OpenAI, TEI, and native endpoints
- โš™๏ธ **Flexible Configuration**: Environment variables, CLI args, .env files

---

## ๏ฟฝ Quick Reference

### Installation & Startup
```bash
# PyPI Package (Production)
pip install embed-rerank && embed-rerank

# Source Code (Development)  
git clone https://github.com/joonsoo-me/embed-rerank.git
cd embed-rerank && ./tools/server-run.sh

Performance Testing

# One-command benchmark
embed-rerank --test performance --test-url http://localhost:9000

# Comprehensive testing
./tools/server-tests.sh --full

API Endpoints

  • Native: POST /api/v1/embed/ and /api/v1/rerank/
  • OpenAI: POST /v1/embeddings (drop-in replacement)
  • TEI: POST /embed and /rerank (Hugging Face compatible)
  • Health: GET /health/ (monitoring and diagnostics)

๏ฟฝ๐Ÿ“„ License

MIT License - build amazing things with this code!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_rerank-1.1.1.tar.gz (101.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embed_rerank-1.1.1-py3-none-any.whl (62.3 kB view details)

Uploaded Python 3

File details

Details for the file embed_rerank-1.1.1.tar.gz.

File metadata

  • Download URL: embed_rerank-1.1.1.tar.gz
  • Upload date:
  • Size: 101.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for embed_rerank-1.1.1.tar.gz
Algorithm Hash digest
SHA256 12d62d99b61fd74adfc52abb46d8eff2c6a48b3289e0ec70b1d5416f592edd07
MD5 f84761ac4612450bc8275693b69fde74
BLAKE2b-256 d9840d7239ce7a9bdaf6337ab788e371b0550ffc1075993fc200dfac23292c8b

See more details on using hashes here.

File details

Details for the file embed_rerank-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: embed_rerank-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 62.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for embed_rerank-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6713eb7748c01ad73cebbd22bf0034c5e559605ada1f5f0a7b9bc12a543111d9
MD5 e3ecf8d0607a4e9d817d637375b57a59
BLAKE2b-256 5ed240564ff18232b4e26b001d37e393274b0874ad306eb8f072467a72d568fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page