Lightweight Qwen3 text embedding & reranking via ONNX Runtime and GGUF (fork of fastembed)
Project description
Qwen3 Embed
Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF
Trimmed fork of fastembed, keeping only Qwen3 models.
Features
- Last-token pooling: Uses the final token representation (with left-padding) instead of mean pooling.
- MRL support: Matryoshka Representation Learning allows truncating embeddings to any dimension from 32 to 1024 while preserving quality.
- Instruction-aware: Query embedding supports task instructions for better retrieval performance.
- Causal LM reranking: Reranker uses yes/no logit scoring via causal language model, producing calibrated [0, 1] scores.
- Multiple backends: ONNX Runtime (INT8, Q4F16) and GGUF (Q4_K_M via llama-cpp-python).
- GPU optional, no PyTorch: Runs on ONNX Runtime or llama-cpp-python -- no heavy ML framework required. Auto-detects GPU (CUDA, DirectML) when available.
- Multilingual: Both models support multi-language inputs.
Supported Models
ONNX (default)
| Model | Type | Dims | Max Tokens | Size |
|---|---|---|---|---|
n24q02m/Qwen3-Embedding-0.6B-ONNX |
Embedding | 32-1024 (MRL) | 32768 | 573 MB |
n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16 |
Embedding | 32-1024 (MRL) | 32768 | 517 MB |
n24q02m/Qwen3-Reranker-0.6B-ONNX |
Reranker | - | 40960 | 573 MB |
n24q02m/Qwen3-Reranker-0.6B-ONNX-Q4F16 |
Reranker | - | 40960 | 518 MB |
n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo |
Reranker | - | 40960 | 598 MB |
GGUF (optional, requires llama-cpp-python)
| Model | Type | Dims | Max Tokens | Size |
|---|---|---|---|---|
n24q02m/Qwen3-Embedding-0.6B-GGUF |
Embedding | 32-1024 (MRL) | 32768 | 378 MB |
n24q02m/Qwen3-Reranker-0.6B-GGUF |
Reranker | - | 40960 | 378 MB |
HuggingFace Repos
| Format | Embedding | Reranker |
|---|---|---|
| ONNX | n24q02m/Qwen3-Embedding-0.6B-ONNX | n24q02m/Qwen3-Reranker-0.6B-ONNX |
| GGUF | n24q02m/Qwen3-Embedding-0.6B-GGUF | n24q02m/Qwen3-Reranker-0.6B-GGUF |
Installation
pip install qwen3-embed
# For GGUF support
pip install qwen3-embed[gguf]
Usage
Text Embedding
from qwen3_embed import TextEmbedding
# INT8 (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")
# Q4F16 (smaller, slightly less accurate)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16")
# GGUF (requires: pip install qwen3-embed[gguf])
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")
documents = [
"Qwen3 is a multilingual embedding model.",
"ONNX Runtime enables fast CPU inference.",
]
embeddings = list(model.embed(documents))
# Each embedding: numpy array of shape (1024,), L2-normalized
# Matryoshka Representation Learning (MRL) -- truncate to smaller dims
embeddings_256 = list(model.embed(documents, dim=256))
# Each embedding: numpy array of shape (256,), L2-normalized
# Query with instruction (for retrieval tasks)
queries = list(model.query_embed(
["What is Qwen3?"],
task="Given a question, retrieve relevant passages",
))
Reranking
from qwen3_embed import TextCrossEncoder
reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX")
# YesNo variant: ~10x less RAM (~598MB vs ~12GB at inference)
# reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo")
query = "What is Qwen3?"
documents = [
"Qwen3 is a series of large language models by Alibaba.",
"The weather today is sunny.",
"Qwen3-Embedding supports multilingual text embedding.",
]
scores = list(reranker.rerank(query, documents))
# scores: list of float in [0, 1], higher = more relevant
# Or rerank pairs directly
pairs = [
("What is AI?", "Artificial intelligence is a branch of computer science."),
("What is ML?", "Machine learning is a subset of AI."),
]
pair_scores = list(reranker.rerank_pairs(pairs))
Configuration
GPU Acceleration
Both ONNX and GGUF backends auto-detect GPU when available (Device.AUTO is the default).
ONNX
Requires onnxruntime-gpu (CUDA) or onnxruntime-directml (Windows) instead of onnxruntime:
pip install onnxruntime-gpu # NVIDIA CUDA
# or
pip install onnxruntime-directml # Windows AMD/Intel/NVIDIA
from qwen3_embed import TextEmbedding, Device
# Auto-detect GPU (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")
# Force CPU
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CPU)
# Force CUDA
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CUDA)
GGUF
GPU is handled by llama-cpp-python. The default pip install qwen3-embed[gguf] is CPU-only.
For CUDA GPU support, build with:
CMAKE_ARGS="-DGGML_CUDA=on" pip install qwen3-embed[gguf]
from qwen3_embed import TextEmbedding, Device
# Auto-detect GPU (default, offloads all layers)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")
# Force CPU only
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF", cuda=Device.CPU)
Development
mise run setup # Install deps + pre-commit hooks
mise run lint # ruff check + format --check
mise run test # pytest
mise run fix # ruff auto-fix + format
Related Projects
- wet-mcp -- MCP web search server with vector-based docs search, uses qwen3-embed for local embedding
- mnemo-mcp -- MCP memory server with semantic search powered by qwen3-embed
- better-code-review-graph -- Knowledge graph for code reviews, uses qwen3-embed for local ONNX embedding
- modalcom-ai-workers -- GPU-serverless workers that convert Qwen3 models to ONNX/GGUF format
Contributing
See CONTRIBUTING.md.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qwen3_embed-1.7.0.tar.gz.
File metadata
- Download URL: qwen3_embed-1.7.0.tar.gz
- Upload date:
- Size: 180.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f805a1a6c72f3e4b790b910086a4f8c3bfd6693870dd7f79a68c3fd315bfe625
|
|
| MD5 |
10fa937ccb359daec00dbbed7f6dcb4b
|
|
| BLAKE2b-256 |
f67b95d2bc68ab50c89899d6d525b5cdbea6efddbdb8d27870c03c30b98e4124
|
File details
Details for the file qwen3_embed-1.7.0-py3-none-any.whl.
File metadata
- Download URL: qwen3_embed-1.7.0-py3-none-any.whl
- Upload date:
- Size: 56.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23f4a95e61b85409dfdd03018dc1f945a814a42a230be3c9a935dd54fbc39de2
|
|
| MD5 |
8a9c606e702b70ac43bc216b557c5dcf
|
|
| BLAKE2b-256 |
5448bba811776f53a5ef4c26d8fd55d3f8e2c8b6b54a97949593b16af2c891ed
|