Lightweight Qwen3 text embedding & reranking via ONNX Runtime and GGUF (fork of fastembed)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

n24q02m

These details have not been verified by PyPI

Project description

Qwen3 Embed

Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF

Trimmed fork of fastembed, keeping only Qwen3 models.

Features

Last-token pooling: Uses the final token representation (with left-padding) instead of mean pooling.
MRL support: Matryoshka Representation Learning allows truncating embeddings to any dimension from 32 to 1024 while preserving quality.
Instruction-aware: Query embedding supports task instructions for better retrieval performance.
Causal LM reranking: Reranker uses yes/no logit scoring via causal language model, producing calibrated [0, 1] scores.
Multiple backends: ONNX Runtime (INT8, Q4F16) and GGUF (Q4_K_M via llama-cpp-python).
GPU optional, no PyTorch: Runs on ONNX Runtime or llama-cpp-python -- no heavy ML framework required. Auto-detects GPU (CUDA, DirectML) when available.
Multilingual: Both models support multi-language inputs.

Supported Models

ONNX (default)

Model	Type	Dims	Max Tokens	Size
`n24q02m/Qwen3-Embedding-0.6B-ONNX`	Embedding	32-1024 (MRL)	32768	573 MB
`n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16`	Embedding	32-1024 (MRL)	32768	517 MB
`n24q02m/Qwen3-Reranker-0.6B-ONNX`	Reranker	-	40960	573 MB
`n24q02m/Qwen3-Reranker-0.6B-ONNX-Q4F16`	Reranker	-	40960	518 MB
`n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo`	Reranker	-	40960	598 MB

GGUF (optional, requires `llama-cpp-python`)

Model	Type	Dims	Max Tokens	Size
`n24q02m/Qwen3-Embedding-0.6B-GGUF`	Embedding	32-1024 (MRL)	32768	378 MB
`n24q02m/Qwen3-Reranker-0.6B-GGUF`	Reranker	-	40960	378 MB

HuggingFace Repos

Format	Embedding	Reranker
ONNX	n24q02m/Qwen3-Embedding-0.6B-ONNX	n24q02m/Qwen3-Reranker-0.6B-ONNX
GGUF	n24q02m/Qwen3-Embedding-0.6B-GGUF	n24q02m/Qwen3-Reranker-0.6B-GGUF

Installation

pip install qwen3-embed

# For GGUF support
pip install qwen3-embed[gguf]

Usage

Text Embedding

from qwen3_embed import TextEmbedding

# INT8 (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")

# Q4F16 (smaller, slightly less accurate)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16")

# GGUF (requires: pip install qwen3-embed[gguf])
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")

documents = [
    "Qwen3 is a multilingual embedding model.",
    "ONNX Runtime enables fast CPU inference.",
]

embeddings = list(model.embed(documents))
# Each embedding: numpy array of shape (1024,), L2-normalized

# Matryoshka Representation Learning (MRL) -- truncate to smaller dims
embeddings_256 = list(model.embed(documents, dim=256))
# Each embedding: numpy array of shape (256,), L2-normalized

# Query with instruction (for retrieval tasks)
queries = list(model.query_embed(
    ["What is Qwen3?"],
    task="Given a question, retrieve relevant passages",
))

Reranking

from qwen3_embed import TextCrossEncoder

reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX")

# YesNo variant: ~10x less RAM (~598MB vs ~12GB at inference)
# reranker = TextCrossEncoder(model_name="n24q02m/Qwen3-Reranker-0.6B-ONNX-YesNo")

query = "What is Qwen3?"
documents = [
    "Qwen3 is a series of large language models by Alibaba.",
    "The weather today is sunny.",
    "Qwen3-Embedding supports multilingual text embedding.",
]

scores = list(reranker.rerank(query, documents))
# scores: list of float in [0, 1], higher = more relevant

# Or rerank pairs directly
pairs = [
    ("What is AI?", "Artificial intelligence is a branch of computer science."),
    ("What is ML?", "Machine learning is a subset of AI."),
]
pair_scores = list(reranker.rerank_pairs(pairs))

Configuration

GPU Acceleration

Both ONNX and GGUF backends auto-detect GPU when available (Device.AUTO is the default).

ONNX

Requires onnxruntime-gpu (CUDA) or onnxruntime-directml (Windows) instead of onnxruntime:

pip install onnxruntime-gpu  # NVIDIA CUDA
# or
pip install onnxruntime-directml  # Windows AMD/Intel/NVIDIA

from qwen3_embed import TextEmbedding, Device

# Auto-detect GPU (default)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX")

# Force CPU
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CPU)

# Force CUDA
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-ONNX", cuda=Device.CUDA)

GGUF

GPU is handled by llama-cpp-python. The default pip install qwen3-embed[gguf] is CPU-only. For CUDA GPU support, build with:

CMAKE_ARGS="-DGGML_CUDA=on" pip install qwen3-embed[gguf]

from qwen3_embed import TextEmbedding, Device

# Auto-detect GPU (default, offloads all layers)
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF")

# Force CPU only
model = TextEmbedding(model_name="n24q02m/Qwen3-Embedding-0.6B-GGUF", cuda=Device.CPU)

Development

mise run setup   # Install deps + pre-commit hooks
mise run lint    # ruff check + format --check
mise run test    # pytest
mise run fix     # ruff auto-fix + format

Related Projects

wet-mcp -- MCP web search server with vector-based docs search, uses qwen3-embed for local embedding
mnemo-mcp -- MCP memory server with semantic search powered by qwen3-embed
better-code-review-graph -- Knowledge graph for code reviews, uses qwen3-embed for local ONNX embedding
modalcom-ai-workers -- GPU-serverless workers that convert Qwen3 models to ONNX/GGUF format

Contributing

See CONTRIBUTING.md.

License

Apache-2.0 -- See LICENSE. Original fastembed by Qdrant.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

n24q02m

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.9.1

Apr 27, 2026

1.9.0

Apr 21, 2026

This version

1.8.0

Apr 4, 2026

1.7.0

Apr 3, 2026

1.6.0

Mar 31, 2026

1.5.1

Mar 20, 2026

1.5.0

Mar 18, 2026

1.4.3

Mar 13, 2026

1.4.0

Mar 12, 2026

1.4.0b1 pre-release

Mar 12, 2026

1.3.0

Mar 11, 2026

1.2.0

Mar 1, 2026

1.1.3

Feb 18, 2026

1.1.2

Feb 18, 2026

1.1.1

Feb 17, 2026

1.1.0

Feb 17, 2026

1.0.0

Feb 14, 2026

0.2.1

Feb 14, 2026

0.2.1b0 pre-release

Feb 14, 2026

0.2.0

Feb 13, 2026

0.2.0b0 pre-release

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwen3_embed-1.8.0.tar.gz (180.6 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

qwen3_embed-1.8.0-py3-none-any.whl (56.3 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file qwen3_embed-1.8.0.tar.gz.

File metadata

Download URL: qwen3_embed-1.8.0.tar.gz
Upload date: Apr 4, 2026
Size: 180.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for qwen3_embed-1.8.0.tar.gz
Algorithm	Hash digest
SHA256	`a7a2bb99c734c0b047b3c7f7d036a17069472822037ace4e2c7b5fba2195475e`
MD5	`03e835c6178b60c50eba3aa7bd33a095`
BLAKE2b-256	`b98b86679949c5ffc301c1b3cab6f085ce02a93822382bbb6e113bada22d46d6`

See more details on using hashes here.

File details

Details for the file qwen3_embed-1.8.0-py3-none-any.whl.

File metadata

Download URL: qwen3_embed-1.8.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 56.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for qwen3_embed-1.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc23920384cf3bb2c3d1b17a446d12a5b22fd087fa1b0a60b48542e837b41c14`
MD5	`355d5596f2685282a9012918a0f13209`
BLAKE2b-256	`9375ba3b7c32c8fbc574cf969f9813bdd7120692835c39c4db84b10a61a79233`

See more details on using hashes here.

qwen3-embed 1.8.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Qwen3 Embed

Features

Supported Models

ONNX (default)

GGUF (optional, requires llama-cpp-python)

HuggingFace Repos

Installation

Usage

Text Embedding

Reranking

Configuration

GPU Acceleration

ONNX

GGUF

Development

Related Projects

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

GGUF (optional, requires `llama-cpp-python`)