Single Model Embedding & Reranker API with Apple Silicon acceleration
Project description
๐ฅ Single Model Embedding & Reranking API
โก Why This Matters
Transform your text processing with 10x faster embeddings and reranking on Apple Silicon. Drop-in replacement for OpenAI API and Hugging Face TEI with zero code changes required.
๐ Performance Comparison
| Operation | This API (MLX) | OpenAI API | Hugging Face TEI |
|---|---|---|---|
| Embeddings | 0.78ms |
200ms+ |
15ms |
| Reranking | 1.04ms |
N/A |
25ms |
| Model Loading | 0.36s |
N/A |
3.2s |
| Cost | $0 |
$0.02/1K |
$0 |
Tested on Apple M4 Max
๐ Quick Start
Option 1: Install from PyPI (Recommended)
# Install the package
pip install embed-rerank
# Start the server (default port 9000)
embed-rerank
# Or with custom port and options
embed-rerank --port 8080 --host 127.0.0.1
# See all options
embed-rerank --help
Option 2: From Source (Development)
# 1. Clone and setup
git clone https://github.com/joonsoo-me/embed-rerank.git
cd embed-rerank
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2. Start server (macOS/Linux)
./tools/server-run.sh
# 3. Test it works
curl http://localhost:9000/health/
๐ Done! Visit http://localhost:9000/docs for interactive API documentation.
๐ Server Management (macOS/Linux)
# Start server (background)
./tools/server-run.sh
# Start server (foreground/development)
./tools/server-run-foreground.sh
# Stop server
./tools/server-stop.sh
# Development automation tools (NEW!)
./tools/setup-macos-service.sh # Auto-generate macOS LaunchAgent
./tools/test-ci-locally.sh # Run GitHub CI tests locally
Windows Support: Coming soon! Currently optimized for macOS/Linux.
โ๏ธ CLI Configuration
PyPI Package CLI Options
Server Options:
--host: Server host (default: 0.0.0.0)--port: Server port (default: 9000)--reload: Enable auto-reload for development--log-level: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
Testing Options:
--test quick: Run quick validation tests--test performance: Run performance benchmark tests--test quality: Run quality validation tests--test full: Run comprehensive test suite--test-url: Custom server URL for testing--test-output: Test output directory
Examples:
# Custom server configuration
embed-rerank --port 8080 --host 127.0.0.1 --reload
# Built-in performance testing
embed-rerank --port 8080 &
embed-rerank --test performance --test-url http://localhost:8080
pkill -f embed-rerank
# Environment variables
export PORT=8080 HOST=127.0.0.1
embed-rerank
Source Code Configuration
Create .env file for development:
# Server
PORT=9000
HOST=0.0.0.0
# Backend
BACKEND=auto # auto | mlx | torch
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ
# Model Cache (first run downloads ~2.3GB model)
MODEL_PATH= # Custom model directory
TRANSFORMERS_CACHE= # HF cache override
# Default: ~/.cache/huggingface/hub/
# Performance & Auto-Configuration
BATCH_SIZE=32
MAX_TEXTS_PER_REQUEST=100
# Note: Token limits and dimensions are automatically extracted from model metadata
# The service dynamically configures itself based on the loaded model's capabilities
๐ง Smart Text Processing Features
The service automatically handles long texts with intelligent processing:
- Auto-Truncation: Texts exceeding token limits are automatically reduced by ~75%
- Smart Summarization: Key sentences are preserved while removing redundancy
- Dynamic Token Limits: Automatically detected from model metadata (e.g., 512 tokens for Qwen3)
- Dimension Detection: Vector dimensions auto-configured from model (e.g., 1024D for Qwen3)
- Processing Transparency: Optional processing info in API responses
Example: 8000+ character text โ 2037 tokens automatically
๐ Model Cache Management
The service automatically manages model downloads and caching:
| Environment Variable | Purpose | Default |
|---|---|---|
MODEL_PATH |
Custom model directory | (uses HF cache) |
TRANSFORMERS_CACHE |
Override HF cache location | ~/.cache/huggingface/transformers |
HF_HOME |
HF home directory | ~/.cache/huggingface |
| (auto) | Default HF cache | ~/.cache/huggingface/hub/ |
Cache Location Check
# Find where your model is cached
python3 -c "
import os
print('MODEL_PATH:', os.getenv('MODEL_PATH', '<not set>'))
print('TRANSFORMERS_CACHE:', os.getenv('TRANSFORMERS_CACHE', '<not set>'))
print('HF_HOME:', os.getenv('HF_HOME', '<not set>'))
print('Default cache:', os.path.expanduser('~/.cache/huggingface/hub'))
"
# List cached Qwen3 models
ls ~/.cache/huggingface/hub | grep -i qwen3 || echo "No Qwen3 models found in cache"
๐ Three APIs, One Service
| API | Endpoint | Use Case |
|---|---|---|
| Native | /api/v1/embed, /api/v1/rerank |
New projects |
| OpenAI | /v1/embeddings |
Existing OpenAI code |
| TEI | /embed, /rerank |
Hugging Face TEI replacement |
OpenAI Compatible (Drop-in)
import openai
client = openai.OpenAI(
api_key="dummy-key",
base_url="http://localhost:9000/v1"
)
response = client.embeddings.create(
input=["Hello world", "Apple Silicon is fast!"],
model="text-embedding-ada-002"
)
# ๐ 10x faster than OpenAI, same code!
TEI Compatible
curl -X POST "http://localhost:9000/embed"
-H "Content-Type: application/json"
-d '{"inputs": ["Hello world"], "truncate": true}'
Native API
# Embeddings
curl -X POST "http://localhost:9000/api/v1/embed/"
-H "Content-Type: application/json"
-d '{"texts": ["Apple Silicon", "MLX acceleration"]}'
# Reranking
curl -X POST "http://localhost:9000/api/v1/rerank/"
-H "Content-Type: application/json"
-d '{"query": "machine learning", "passages": ["AI is cool", "Dogs are pets", "MLX is fast"]}'
๐งช Performance Testing & Validation
๐ Built-in CLI Testing (PyPI Package)
The PyPI package includes powerful built-in testing capabilities:
# Quick validation (basic functionality check)
embed-rerank --test quick
# Performance benchmark (latency, throughput, concurrency)
embed-rerank --test performance --test-url http://localhost:9000
# Quality validation (semantic similarity, multilingual)
embed-rerank --test quality --test-url http://localhost:9000
# Full comprehensive test suite
embed-rerank --test full --test-url http://localhost:9000
Test Results Include:
- ๐ Latency Metrics: Mean, P95, P99 response times
- ๐ Throughput Analysis: Texts/sec processing rates
- ๐ Concurrency Testing: Multi-threaded request handling
- ๐ง Semantic Validation: Quality of embeddings and reranking
- ๐ Multilingual Support: Cross-language performance
- ๐ JSON Reports: Detailed metrics for automation
Example Output:
๐งช Running Embed-Rerank Test Suite
๐ Target URL: http://localhost:9000
๐ฏ Test Mode: performance
โก Performance Results:
โข Latency: 0.8ms avg, 1.2ms max
โข Throughput: 1,250 texts/sec peak
โข Concurrency: 5/5 successful (100%)
๐ Results saved to: ./test-results/performance_test_results.json
๐ง Advanced Testing (Source Code)
### ๐ง Advanced Testing (Source Code)
For development and comprehensive testing with the source code:
```bash
# Comprehensive test suite (shell script)
./tools/server-tests.sh
# Run with specific test modes
./tools/server-tests.sh --quick # Quick validation only
./tools/server-tests.sh --performance # Performance tests only
./tools/server-tests.sh --full # Full test suite
./tools/server-tests.sh --text-processing # Text processing validation
# Custom server URL
./tools/server-tests.sh --url http://localhost:8080
# Development automation (NEW!)
./tools/test-ci-locally.sh # Run GitHub CI tests locally
./tools/setup-macos-service.sh # Generate macOS LaunchAgent
# Manual health check
curl http://localhost:9000/health/
# Unit tests with pytest
pytest tests/ -v
๐ Development & Deployment
Local Development (Source Code)
# Start server (background)
./tools/server-run.sh
# Start server (foreground/development)
./tools/server-run-foreground.sh
# Stop server
./tools/server-stop.sh
Production Deployment (PyPI Package)
# Install and run
pip install embed-rerank
embed-rerank --port 9000 --host 0.0.0.0
# With custom configuration
embed-rerank --port 8080 --reload --log-level DEBUG
# Background deployment
embed-rerank --port 9000 &
Windows Support: Coming soon! Currently optimized for macOS/Linux.
---
## ๐ What You Get
### ๐ฏ Core Features
- โ
**Zero Code Changes**: Drop-in replacement for OpenAI API and TEI
- โก **10x Performance**: Apple MLX acceleration on Apple Silicon
- ๐ฐ **Zero Costs**: No API fees, runs locally
- ๐ **Privacy**: Your data never leaves your machine
- ๐ฏ **Three APIs**: Native, OpenAI, and TEI compatibility
- ๐ **Production Ready**: Health checks, monitoring, structured logging
- ๐ง **Smart Text Processing**: Auto-truncation and summarization for long texts
- โ๏ธ **Dynamic Configuration**: Automatic model metadata extraction and dimension detection
### ๐งช Built-in Testing & Benchmarking
- ๐ **CLI Performance Testing**: One-command benchmarking
- ๐ **Concurrency Testing**: Multi-threaded request validation
- ๐ง **Quality Validation**: Semantic similarity and multilingual testing
- ๐ **JSON Reports**: Automated performance monitoring
- ๐ **Real-time Metrics**: Latency, throughput, and success rates
### ๐ Development Automation (New!)
- ๐ **macOS Service Management**: Auto-generate LaunchAgent from configuration
- ๐งช **Local CI Testing**: Run GitHub CI tests locally before commits
- ๐ **Code Quality Tools**: Automated Black, isort, and flake8 validation
- ๐ง **Smart Development Workflow**: Virtual environment checks and setup automation
### ๐ Deployment Options
- ๐ฆ **PyPI Package**: `pip install embed-rerank` for instant deployment
- ๐ง **Source Code**: Full development environment with advanced tooling
- ๐ **Multi-API Support**: OpenAI, TEI, and native endpoints
- โ๏ธ **Flexible Configuration**: Environment variables, CLI args, .env files
---
## ๏ฟฝ Quick Reference
### Installation & Startup
```bash
# PyPI Package (Production)
pip install embed-rerank && embed-rerank
# Source Code (Development)
git clone https://github.com/joonsoo-me/embed-rerank.git
cd embed-rerank && ./tools/server-run.sh
Performance Testing
# One-command benchmark
embed-rerank --test performance --test-url http://localhost:9000
# Comprehensive testing
./tools/server-tests.sh --full
API Endpoints
- Native:
POST /api/v1/embed/and/api/v1/rerank/ - OpenAI:
POST /v1/embeddings(drop-in replacement) - TEI:
POST /embedand/rerank(Hugging Face compatible) - Health:
GET /health/(monitoring and diagnostics with model metadata)
Development Tools (New!)
# macOS service automation
./tools/setup-macos-service.sh # Auto-generate LaunchAgent from .env.example
# Local CI testing
./tools/test-ci-locally.sh # Run complete GitHub CI suite locally
# Code quality automation
black --line-length 120 app/ tests/ # Consistent formatting
isort --profile black app/ tests/ # Import organization
flake8 app/ tests/ --max-line-length=120 --extend-ignore=E203,W503 # Linting
๏ฟฝ๐ License
MIT License - build amazing things with this code!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embed_rerank-1.2.0.tar.gz.
File metadata
- Download URL: embed_rerank-1.2.0.tar.gz
- Upload date:
- Size: 122.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a44344925e2aa404fa4b78dc6bbe3d9dac0bdd51ee8a91951880622e7cd01a7
|
|
| MD5 |
1aff906bcc6c90ae7380121bd6694a02
|
|
| BLAKE2b-256 |
c5c9d18d77066bf80658bf73ce822547b15eec12967e52115be309e77f84f440
|
File details
Details for the file embed_rerank-1.2.0-py3-none-any.whl.
File metadata
- Download URL: embed_rerank-1.2.0-py3-none-any.whl
- Upload date:
- Size: 77.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08720c6a5f80ec94df4f0bb6853938b4a3981999689a6c80b84c2606ba9ec3e0
|
|
| MD5 |
1dd1700831ce677deb4f4ca585096c2a
|
|
| BLAKE2b-256 |
cc665ec6d66ec55d75beb6227b65a1a47869a84d414c79dd2196509855f48f5c
|