The LEGO set for custom vLLM model plugins — build, test, and deploy custom encoders, poolers, and kernels

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ddickmann

These details have not been verified by PyPI

Project description

vLLM Factory

Production inference for encoders, poolers, and structured prediction — as vLLM plugins.

12 encoder plugins · IOProcessor pre/post-processing · continuous batching · zero vLLM forks

# Install and serve any model in 3 commands
pip install -e ".[gliner]"
pip install "vllm==0.15.1"          # always install vLLM last

vllm serve VAGOsolutions/SauerkrautLM-Multi-Reason-ModernColBERT \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --io-processor-plugin moderncolbert_io

# Query it
curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"VAGOsolutions/SauerkrautLM-Multi-Reason-ModernColBERT",
       "data":{"text":"European Central Bank monetary policy"}}'

Why vLLM Factory?

Decoder-based LLM serving is a solved problem. Encoder-based serving is not.

Production traffic is heterogeneous: staggered requests at unpredictable intervals, mixed sequence lengths, variable batch sizes — none of it neatly padded or synchronized. Vanilla PyTorch pipelines (GLiNER, PyLate, SentenceTransformers) process requests sequentially or require manual batching. They block on each model.forward(), waste GPU cycles waiting for the next request, and have no scheduler to absorb traffic spikes.

vLLM Factory bridges that gap. Every bespoke encoder architecture — ColBERT, GLiNER, entity linking, multimodal retrieval — gets the same production-grade scheduling and memory management as a 70B chat model. No fork. No custom server. Just vllm serve.

Each plugin ships an IOProcessor that handles all pre- and post-processing inside the vLLM process. Clients send structured JSON ({"data": {"text": ...}} or {"data": {"image": ...}}), and the IOProcessor converts to model inputs, runs inference, and returns structured results. No client-side tokenization. No manual extra_kwargs. Just POST /pooling.

Capability	HF / SentenceTransformers	TEI	vLLM Factory
ColBERT multi-vector retrieval	❌	❌	✅
GLiNER span-level NER	❌	❌	✅
GLiNER2 schema extraction	❌	❌	✅
Entity linking + reranking pipeline	❌	❌	✅
Multimodal retrieval (ColPali/ColQwen/Nemotron)	❌	❌	✅
Continuous batching for encoders	❌	✅	✅
CUDA graphs for encoders	❌	✅	✅
Built-in pre/post-processing (IOProcessor)	❌	❌	✅
Plugin architecture (no fork)	—	—	✅
End-to-end parity tests	—	—	✅

Installation

pip install vllm-factory          # from PyPI (Linux, requires CUDA)

Or from source for development:

Critical: vLLM must be the last package installed. Other dependencies (especially gliner) can pull in transformers versions that conflict with vLLM. Installing vLLM last ensures it pins all shared dependencies to compatible versions.

Standard install

git clone https://github.com/ddickmann/vllm-factory.git && cd vllm-factory

# Step 1: Install vllm-factory + base dependencies (+ gliner for NER/linking models)
pip install -e ".[gliner]"

# Step 2: Install vLLM — ALWAYS LAST
pip install "vllm==0.15.1"

# Step 3: Apply the pooling patch (one-time, enables extra_kwargs passthrough)
python -m forge.patches.pooling_extra_kwargs

Minimal install (no GLiNER models)

If you only need embedding or ColBERT models (no NER/linking):

pip install -e .
pip install "vllm==0.15.1"
python -m forge.patches.pooling_extra_kwargs

Docker

FROM vllm/vllm-openai:v0.15.1

COPY . /app/vllm-factory
WORKDIR /app/vllm-factory

# Install deps first, vLLM is already in base image (last)
RUN pip install -e ".[gliner]"
RUN python -m forge.patches.pooling_extra_kwargs

CMD ["vllm", "serve", "VAGOsolutions/SauerkrautLM-Multi-Reason-ModernColBERT", \
     "--runner", "pooling", "--trust-remote-code", "--dtype", "bfloat16", \
     "--io-processor-plugin", "moderncolbert_io"]

Verify installation

make test-serve P=embeddinggemma   # Fastest model — starts server, runs test, reports pass/fail

Serving — all 12 models

Every plugin is served with vllm serve + --io-processor-plugin. The IOProcessor handles all tokenization, formatting, and output decoding server-side. Clients send simple JSON.

Embedding

EmbeddingGemma — dense CLS embeddings (300M)

vllm serve unsloth/embeddinggemma-300m \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching \
  --io-processor-plugin embeddinggemma_io

curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"unsloth/embeddinggemma-300m",
       "data":{"text":"What is the knapsack problem?"}}'

Late Interaction / Retrieval

ModernColBERT — multi-vector ColBERT (ModernBERT backbone)

vllm serve VAGOsolutions/SauerkrautLM-Multi-Reason-ModernColBERT \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin moderncolbert_io

curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"VAGOsolutions/SauerkrautLM-Multi-Reason-ModernColBERT",
       "data":{"text":"European Central Bank monetary policy"}}'

LFM2-ColBERT — Mamba/SSM hybrid ColBERT (350M)

vllm serve LiquidAI/LFM2-ColBERT-350M \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin lfm2_colbert_io

curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"LiquidAI/LFM2-ColBERT-350M",
       "data":{"text":"Mamba state-space model architecture"}}'

Multimodal Retrieval (text + vision)

ColQwen3 — Qwen3-VL + ColPali (1.7B)

vllm serve VAGOsolutions/SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --max-model-len 8192 --limit-mm-per-prompt '{"image": 1}' \
  --io-processor-plugin colqwen3_io

# Text query
curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"VAGOsolutions/SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1",
       "data":{"text":"What does the revenue chart show?", "is_query": true}}'

# Image document
curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"VAGOsolutions/SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1",
       "data":{"image":"https://example.com/document.png", "is_query": false}}'

ColLFM2 — LFM2-VL + ColPali (450M, multimodal)

vllm serve VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1 \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin collfm2_io

curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1",
       "data":{"text":"Summarize the table contents"}}'

Nemotron-ColEmbed — bidirectional Qwen3-VL (4B, multimodal)

vllm serve nvidia/nemotron-colembed-vl-4b-v2 \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin nemotron_colembed_io

curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/nemotron-colembed-vl-4b-v2",
       "data":{"text":"Neural network optimization techniques", "is_query": true}}'

Named Entity Recognition (GLiNER)

GLiNER models use custom model directories prepared by forge/model_prep.py. The IOProcessor handles all NER preprocessing (tokenization, span generation) and postprocessing (entity decoding) server-side.

Requires pip install -e ".[gliner]" at install time.

mmbert_gliner — ModernBERT + GLiNER span head

# Prepare model (one-time)
vllm-factory-prep --model VAGOsolutions/SauerkrautLM-GLiNER --output /tmp/sauerkraut-gliner-vllm

# Serve
vllm serve /tmp/sauerkraut-gliner-vllm \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin mmbert_gliner_io

curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"/tmp/sauerkraut-gliner-vllm",
       "data":{
         "text":"Apple Inc. announced a partnership with OpenAI. Tim Cook presented at WWDC 2024.",
         "labels":["company","person","event"],
         "threshold":0.3
       }}'

Returns: {"data": [{"text": "Apple Inc.", "label": "company", "score": 0.95}, ...]}

mt5_gliner — mT5 encoder + multilingual GLiNER

vllm-factory-prep --model knowledgator/gliner-x-large --output /tmp/gliner-x-large-vllm

vllm serve /tmp/gliner-x-large-vllm \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin mt5_gliner_io

deberta_gliner — DeBERTa v2 + GLiNER span head

vllm-factory-prep --model urchade/gliner_small-v2.1 --output /tmp/gliner-pii-vllm

vllm serve /tmp/gliner-pii-vllm \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin deberta_gliner_io

deberta_gliner2 — DeBERTa v3 + GLiNER2 schema extraction

vllm-factory-prep --model fastino/gliner2-large-v1 --output /tmp/gliner2-vllm

vllm serve /tmp/gliner2-vllm \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin deberta_gliner2_io

Entity Linking & Reranking

deberta_gliner_linker — dual DeBERTa + LSTM + scorer (L3)

vllm serve plugins/deberta_gliner_linker/_model_cache \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin deberta_gliner_linker_io

curl -s http://localhost:8000/pooling \
  -H "Content-Type: application/json" \
  -d '{"model":"plugins/deberta_gliner_linker/_model_cache",
       "data":{
         "text":"Tesla announced record earnings in Austin.",
         "labels":["company","location"],
         "threshold":0.3,
         "candidate_labels":["Tesla Inc.","Austin, TX","TSLA"]
       }}'

modernbert_gliner_rerank — ModernBERT + projection + LSTM + scorer (L4)

vllm serve plugins/modernbert_gliner_rerank/_model_cache \
  --runner pooling --trust-remote-code --dtype bfloat16 \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --io-processor-plugin modernbert_gliner_rerank_io

Plugins

Embedding

Plugin	Architecture	Checkpoint	Params
`embeddinggemma`	Gemma + CLS projection	`unsloth/embeddinggemma-300m`	300M

Late Interaction / Retrieval

Plugin	Architecture	Checkpoint	Params
`moderncolbert`	ModernBERT + ColBERT	`VAGOsolutions/SauerkrautLM-Multi-Reason-ModernColBERT`	149M
`lfm2_colbert`	LFM2 (Mamba/SSM) + ColBERT	`LiquidAI/LFM2-ColBERT-350M`	350M
`colqwen3`	Qwen3-VL + ColPali (vision)	`VAGOsolutions/SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1`	1.7B
`collfm2`	LFM2-VL + ColPali (vision)	`VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1`	450M
`nemotron_colembed`	Qwen3-VL bidirectional + ColBERT	`nvidia/nemotron-colembed-vl-4b-v2`	4B

Named Entity Recognition (GLiNER)

Plugin	Architecture	Checkpoint	Params
`mmbert_gliner`	ModernBERT + GLiNER span head	`VAGOsolutions/SauerkrautLM-GLiNER`	150M
`deberta_gliner`	DeBERTa v2 + GLiNER span head	`urchade/gliner_small-v2.1`	166M
`mt5_gliner`	mT5 encoder + multilingual GLiNER	`knowledgator/gliner-x-large`	800M
`deberta_gliner2`	DeBERTa v3 + GLiNER2 schema extraction	`fastino/gliner2-large-v1`	304M

Entity Linking & Reranking

Plugin	Architecture	Checkpoint	Params
`deberta_gliner_linker`	Dual DeBERTa + LSTM + scorer	`knowledgator/gliner-linker-large-v1.0`	304M
`modernbert_gliner_rerank`	ModernBERT + projection + LSTM	`knowledgator/gliner-linker-rerank-v1.0`	68M

Parity — all 12 plugins validated

Every plugin passes end-to-end parity testing: vllm serve → HTTP request → compare against reference implementation. No smoke tests — real model inference, real outputs.

NER models are validated by comparing actual entity text and labels (not counts). The gating metric is recall — every reference entity must be found by vLLM. vLLM finding extra entities is acceptable. Entity confidence scores are compared informally (score deltas reported but not gating, since dtype rounding produces small drift).

Embedding/ColBERT models are validated by element-wise cosine similarity of the full output vector against reference tensors from the vanilla library.

All models run in bfloat16.

Plugin	Reference	Metric	Score
`embeddinggemma`	HF SentenceTransformer	cosine sim	1.0000
`mmbert_gliner`	GLiNER library	recall (entity text+label)	1.000
`deberta_gliner`	GLiNER library	recall (entity text+label)	1.000
`deberta_gliner2`	GLiNER2 library	recall (entity text+label)	1.000
`mt5_gliner`	GLiNER library	recall (entity text+label)	1.000
`deberta_gliner_linker`	Knowledgator GLinker	recall + link match	1.000
`modernbert_gliner_rerank`	Knowledgator GLinker	recall (entity text+label)	1.000
`moderncolbert`	PyLate	cosine sim	0.970
`lfm2_colbert`	HF transformers	cosine sim	1.000
`collfm2`	sauerkrautlm-colpali	cosine sim	0.9996
`colqwen3`	sauerkrautlm-colpali	cosine sim	0.9966
`nemotron_colembed`	HF transformers	cosine sim	0.9997

python scripts/serve_parity_test.py                # all 12 plugins
python scripts/serve_parity_test.py --plugin colqwen3  # single plugin

How it works

IOProcessor architecture

Each plugin registers an IOProcessor — a vLLM-native plugin that runs pre/post-processing inside the serving process. No client-side tokenization needed.

POST /pooling {"data": {"text": "..."}}
    │
    ▼
┌─────────────────────────────────────────────────┐
│  IOProcessor.parse_request()  → typed input      │
│  IOProcessor.pre_process()    → tokenized prompt  │
│  engine.encode()              → model forward     │
│  IOProcessor.post_process()   → structured output │
│  IOProcessor.output_to_response() → JSON response │
└─────────────────────────────────────────────────┘
    │
    ▼
{"data": [{"text": "Apple Inc.", "label": "company", "score": 0.95}, ...]}

Custom Triton Kernels

Kernel	What it optimizes
`flash_deberta_attention`	Fused c2p + p2c disentangled relative position bias for DeBERTa
`fused_glu_mlp`	Fused GeGLU chunk + GELU + mul + dropout
`fused_rope_global`	RoPE for ModernBERT global attention layers
`fused_rope_local`	RoPE for ModernBERT sliding-window local attention
`fused_layernorm`	Single-pass mean/var/normalize + affine
`fused_dropout_residual`	In-place dropout + residual add

Repository structure

vllm-factory/
├── plugins/              # 12 model plugins (each with io_processor.py + parity_test.py)
├── models/               # Encoder backbones (DeBERTa, ModernBERT, mT5, ...)
├── kernels/              # Custom Triton kernels
├── poolers/              # Shared pooler heads (ColBERT, GLiNER, ColPali, linker)
├── forge/                # Shared infrastructure (model_prep, patches, server utilities)
├── examples/             # Ready-to-run example scripts
├── scripts/              # Parity test orchestrator, reference generators
├── notebooks/            # Jupyter notebooks for each model family
├── Makefile              # install · serve · test · bench · lint
└── pyproject.toml        # All 12 plugins registered as vLLM entry points

Building custom plugins

See docs/PLUGIN_GUIDE.md for the step-by-step walkthrough.

A new plugin needs:

File	Purpose
`config.py`	HuggingFace-compatible config (dimensions, layers)
`model.py`	Encoder forward path + `self.pooler` wiring
`io_processor.py`	IOProcessor — parse, pre-process, post-process, response
`parity_test.py`	Validation against reference implementation

Why it's fast

Vanilla PyTorch blocks. One model.forward() at a time. If request B arrives while request A is mid-inference, B waits. Under staggered, heterogeneous load — which is what production actually looks like — GPU utilization craters.

vLLM schedules. Incoming requests are continuously batched by the async scheduler. Variable-length sequences are packed efficiently via PagedAttention. CUDA graphs eliminate kernel launch overhead. The GPU stays saturated regardless of arrival pattern.

vLLM Factory brings this to every encoder architecture with zero custom serving code.

Measured speedups (RTX 4090, 124 requests, 512 tokens)

Model	Vanilla	vLLM Factory	Speedup
LFM2-ColBERT (350M, Mamba/SSM)	HF AutoModel	`vllm serve`	6.7×
MT5 GLiNER (800M, NER)	GLiNER lib	`vllm serve`	2.7×
EmbeddingGemma (300M, dense)	SentenceTransformers	`vllm serve`	1.8×

Design principles

No vLLM forks — plugins, not patches
Parity before performance — every optimization validated against reference
IOProcessor-first — all pre/post-processing runs server-side
vLLM must install last — dependency order is enforced to avoid version conflicts
Task-aware architecture — backbone + pooler + IOProcessor = single deployment contract

Requirements

Python 3.11+
PyTorch 2.0+
vLLM 0.15+ (installed last)
NVIDIA GPU with CUDA support (production)
Triton 2.0+ (for custom kernels, optional)
macOS users: see docs/macos_vllm.md for local dev setup (CPU only, no production serving)

Enterprise support

Running vLLM Factory in production? Latence AI provides custom plugin development, performance optimization, and deployment review.

→ hello@latence.ai · GitHub Issues

Contributing

See CONTRIBUTING.md.

make install       # install everything (correct dep order)
make serve P=name  # serve a plugin
make test P=name   # run parity test
make lint          # ruff check

Acknowledgements

Project	Authors	Contribution
vLLM	vLLM Team	High-throughput serving engine
GLiNER	Urchade Zaratiana et al.	Generalist NER architecture
FlashDeBERTa	Knowledgator	Triton kernel for DeBERTa attention
GLinker	Knowledgator	Entity linking architecture
PyLate	LightOn AI	ColBERT training/inference reference
sauerkrautlm-colpali	VAGO Solutions	ColQwen/ColPali models
NV-Retriever	NVIDIA	Nemotron-ColEmbed architecture
LFM2	Liquid AI	LFM2 Mamba/SSM hybrid models
ColBERT	Omar Khattab (Stanford)	Late-interaction retrieval paradigm
ColPali	Illuin Technology	Vision-language retrieval
ModernBERT	Answer.AI & LightOn	Modern BERT architecture

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ddickmann

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Apr 13, 2026

0.2.0

Apr 5, 2026

This version

0.1.2

Mar 31, 2026

0.1.1

Mar 31, 2026

0.1.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_factory-0.1.2.tar.gz (262.6 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_factory-0.1.2-py3-none-any.whl (322.3 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file vllm_factory-0.1.2.tar.gz.

File metadata

Download URL: vllm_factory-0.1.2.tar.gz
Upload date: Mar 31, 2026
Size: 262.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllm_factory-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`fe33539cfd291e684faee5ad985c417caacb250b06bfb2cfacc4aeccc0e44926`
MD5	`dbf21ed3179f8b3865dd349a7f90db8b`
BLAKE2b-256	`1adcde2189132f19b45755272ff1a4832f3122ae86362cb617fce25f7650fd60`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_factory-0.1.2.tar.gz:

Publisher: release.yml on ddickmann/vllm-factory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllm_factory-0.1.2.tar.gz
- Subject digest: fe33539cfd291e684faee5ad985c417caacb250b06bfb2cfacc4aeccc0e44926
- Sigstore transparency entry: 1203617561
- Sigstore integration time: Mar 31, 2026
Source repository:
- Permalink: ddickmann/vllm-factory@84db321ffc6a0a468235e5ccb1fba369c743fa2b
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/ddickmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@84db321ffc6a0a468235e5ccb1fba369c743fa2b
- Trigger Event: push

File details

Details for the file vllm_factory-0.1.2-py3-none-any.whl.

File metadata

Download URL: vllm_factory-0.1.2-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 322.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vllm_factory-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4eb40edf49a81abb55dd811a26af4e8070b12051f3cccb5c8f9b333c69ecca3a`
MD5	`b944700c69d9fa04dfe2fc303d316967`
BLAKE2b-256	`065a284815b40bfc0d92ff41a488140d79389f75548c96665b3a7a8dcd3fc126`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_factory-0.1.2-py3-none-any.whl:

Publisher: release.yml on ddickmann/vllm-factory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllm_factory-0.1.2-py3-none-any.whl
- Subject digest: 4eb40edf49a81abb55dd811a26af4e8070b12051f3cccb5c8f9b333c69ecca3a
- Sigstore transparency entry: 1203617566
- Sigstore integration time: Mar 31, 2026
Source repository:
- Permalink: ddickmann/vllm-factory@84db321ffc6a0a468235e5ccb1fba369c743fa2b
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/ddickmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@84db321ffc6a0a468235e5ccb1fba369c743fa2b
- Trigger Event: push

vllm-factory 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

vLLM Factory

Why vLLM Factory?

Installation

Standard install

Minimal install (no GLiNER models)

Docker

Verify installation

Serving — all 12 models

Embedding

Late Interaction / Retrieval

Multimodal Retrieval (text + vision)

Named Entity Recognition (GLiNER)

Entity Linking & Reranking

Plugins

Embedding

Late Interaction / Retrieval

Named Entity Recognition (GLiNER)

Entity Linking & Reranking

Parity — all 12 plugins validated

How it works

IOProcessor architecture

Custom Triton Kernels

Repository structure

Building custom plugins

Why it's fast

Measured speedups (RTX 4090, 124 requests, 512 tokens)

Design principles

Requirements

Enterprise support

Contributing

Acknowledgements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance