Skip to main content

Lightweight Qwen3 text embedding & reranking via ONNX Runtime and GGUF (fork of fastembed)

Project description

Qwen3 Embed

Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF

CI codecov PyPI License: Apache-2.0

Python ONNX Runtime Hugging Face semantic-release Renovate

Trimmed fork of fastembed, keeping only Qwen3 models.

Features

  • Last-token pooling: Uses the final token representation (with left-padding) instead of mean pooling.
  • MRL support: Matryoshka Representation Learning allows truncating embeddings to any dimension from 32 to 1024 while preserving quality.
  • Instruction-aware: Query embedding supports task instructions for better retrieval performance.
  • Causal LM reranking: Reranker uses yes/no logit scoring via causal language model, producing calibrated [0, 1] scores.
  • Multiple backends: ONNX Runtime (INT8, Q4F16) and GGUF (Q4_K_M via llama-cpp-python).
  • GPU optional, no PyTorch: Runs on ONNX Runtime or llama-cpp-python -- no heavy ML framework required. Auto-detects GPU (CUDA, DirectML) when available.
  • Multilingual: Both models support multi-language inputs.

Supported Models

ONNX (default)

Model Type Dims Max Tokens Size
n24q02m/Qwen3-Embedding-0.6B-ONNX Embedding 32-1024 (MRL) 32768 573 MB
n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16 Embedding 32-1024 (MRL) 32768 517 MB
n24q02m/Qwen3-Reranker-0.6B-ONNX Reranker - 40960 573 MB
n24q02m/Qwen3-Reranker-0.6B-ONNX-Q4F16 Reranker - 40960 518 MB
n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo Reranker - 40960 598 MB

GGUF (optional, requires llama-cpp-python)

Model Type Dims Max Tokens Size
n24q02m/Qwen3-Embedding-0.6B-GGUF Embedding 32-1024 (MRL) 32768 378 MB
n24q02m/Qwen3-Reranker-0.6B-GGUF Reranker - 40960 378 MB

HuggingFace Repos

Format Embedding Reranker
ONNX n24q02m/Qwen3-Embedding-0.6B-ONNX n24q02m/Qwen3-Reranker-0.6B-ONNX
GGUF n24q02m/Qwen3-Embedding-0.6B-GGUF n24q02m/Qwen3-Reranker-0.6B-GGUF

Installation

pip install qwen3-embed

# For GGUF support
pip install qwen3-embed[gguf]

Usage

Text Embedding

from qwen3_embed import TextEmbedding

# INT8 (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")

# Q4F16 (smaller, slightly less accurate)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16")

# GGUF (requires: pip install qwen3-embed[gguf])
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")

documents = [
    "Qwen3 is a multilingual embedding model.",
    "ONNX Runtime enables fast CPU inference.",
]

embeddings = list(model.embed(documents))
# Each embedding: numpy array of shape (1024,), L2-normalized

# Matryoshka Representation Learning (MRL) -- truncate to smaller dims
embeddings_256 = list(model.embed(documents, dim=256))
# Each embedding: numpy array of shape (256,), L2-normalized

# Query with instruction (for retrieval tasks)
queries = list(model.query_embed(
    ["What is Qwen3?"],
    task="Given a question, retrieve relevant passages",
))

Reranking

from qwen3_embed import TextCrossEncoder

reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX")

# YesNo variant: ~10x less RAM (~598MB vs ~12GB at inference)
# reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo")

query = "What is Qwen3?"
documents = [
    "Qwen3 is a series of large language models by Alibaba.",
    "The weather today is sunny.",
    "Qwen3-Embedding supports multilingual text embedding.",
]

scores = list(reranker.rerank(query, documents))
# scores: list of float in [0, 1], higher = more relevant

# Or rerank pairs directly
pairs = [
    ("What is AI?", "Artificial intelligence is a branch of computer science."),
    ("What is ML?", "Machine learning is a subset of AI."),
]
pair_scores = list(reranker.rerank_pairs(pairs))

Configuration

GPU Acceleration

Both ONNX and GGUF backends auto-detect GPU when available (Device.AUTO is the default).

ONNX

Requires onnxruntime-gpu (CUDA) or onnxruntime-directml (Windows) instead of onnxruntime:

pip install onnxruntime-gpu  # NVIDIA CUDA
# or
pip install onnxruntime-directml  # Windows AMD/Intel/NVIDIA
from qwen3_embed import TextEmbedding, Device

# Auto-detect GPU (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")

# Force CPU
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CPU)

# Force CUDA
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CUDA)

GGUF

GPU is handled by llama-cpp-python. The default pip install qwen3-embed[gguf] is CPU-only. For CUDA GPU support, build with:

CMAKE_ARGS="-DGGML_CUDA=on" pip install qwen3-embed[gguf]
from qwen3_embed import TextEmbedding, Device

# Auto-detect GPU (default, offloads all layers)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")

# Force CPU only
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF", cuda=Device.CPU)

Development

mise run setup   # Install deps + pre-commit hooks
mise run lint    # ruff check + format --check
mise run test    # pytest
mise run fix     # ruff auto-fix + format

Related Projects

  • wet-mcp -- MCP web search server with vector-based docs search, uses qwen3-embed for local embedding
  • mnemo-mcp -- MCP memory server with semantic search powered by qwen3-embed
  • better-code-review-graph -- Knowledge graph for code reviews, uses qwen3-embed for local ONNX embedding
  • modalcom-ai-workers -- GPU-serverless workers that convert Qwen3 models to ONNX/GGUF format

Contributing

See CONTRIBUTING.md.

License

Apache-2.0 -- See LICENSE. Original fastembed by Qdrant.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwen3_embed-1.8.0.tar.gz (180.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwen3_embed-1.8.0-py3-none-any.whl (56.3 kB view details)

Uploaded Python 3

File details

Details for the file qwen3_embed-1.8.0.tar.gz.

File metadata

  • Download URL: qwen3_embed-1.8.0.tar.gz
  • Upload date:
  • Size: 180.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for qwen3_embed-1.8.0.tar.gz
Algorithm Hash digest
SHA256 a7a2bb99c734c0b047b3c7f7d036a17069472822037ace4e2c7b5fba2195475e
MD5 03e835c6178b60c50eba3aa7bd33a095
BLAKE2b-256 b98b86679949c5ffc301c1b3cab6f085ce02a93822382bbb6e113bada22d46d6

See more details on using hashes here.

File details

Details for the file qwen3_embed-1.8.0-py3-none-any.whl.

File metadata

  • Download URL: qwen3_embed-1.8.0-py3-none-any.whl
  • Upload date:
  • Size: 56.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for qwen3_embed-1.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fc23920384cf3bb2c3d1b17a446d12a5b22fd087fa1b0a60b48542e837b41c14
MD5 355d5596f2685282a9012918a0f13209
BLAKE2b-256 9375ba3b7c32c8fbc574cf969f9813bdd7120692835c39c4db84b10a61a79233

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page