LangExtract provider plugin for VLLM

These details have not been verified by PyPI

Project links

Project description

LangExtract vLLM Provider

A provider plugin for LangExtract that supports vLLM models with high-performance inference using PagedAttention.

Installation

pip install langextract-vllm

Supported Model IDs

Model ID using the format:

vLLM models: vllm:<model_name_or_path>

Where <model_name_or_path> can be:

HuggingFace model repository (e.g., meta-llama/Llama-2-7b-chat-hf)
Local model path (e.g., /path/to/model)

Usage

Basic Usage

import langextract as lx

config = lx.factory.ModelConfig(
    model_id="vllm:meta-llama/Llama-2-7b-chat-hf",
    provider="VLLMLanguageModel",  # optional as vllm: will resolve to the model
    provider_kwargs=dict(
        temperature=0.7,
        top_p=0.9,
        max_tokens=1024,
        gpu_memory_utilization=0.7,
        max_model_len=1024,
    ),
)

model = lx.factory.create_model(config)

result = lx.extract(
    model=model,
    text_or_documents="Your input text",
    prompt_description="Extract entities",
    examples=[...],
)

Advanced Configuration

import langextract as lx

config = lx.factory.ModelConfig(
    model_id="vllm:mistralai/Mistral-7B-Instruct-v0.1",
    provider="VLLMLanguageModel",
    provider_kwargs=dict(
        temperature=0.8,
        top_p=0.95,
        max_tokens=2048,
        gpu_memory_utilization=0.8,
        max_model_len=4096,
        max_workers=4,
        # Additional vLLM engine parameters
        enforce_eager=True,
        disable_custom_all_reduce=True,
    ),
)

model = lx.factory.create_model(config)

result = lx.extract(
    model=model,
    text_or_documents="Your input text",
    prompt_description="Extract named entities and their relationships",
    examples=[...],
)

Multi-GPU Configuration

For large models or high throughput scenarios, you can distribute the model across multiple GPUs:

import langextract as lx

config = lx.factory.ModelConfig(
    model_id="vllm:meta-llama/Llama-2-70b-chat-hf",
    provider="VLLMLanguageModel",
    provider_kwargs=dict(
        temperature=0.7,
        top_p=0.9,
        max_tokens=2048,
        # Multi-GPU configuration
        tensor_parallel_size=4,  # Distribute across 4 GPUs
        gpu_memory_utilization=0.8,
        max_model_len=4096,
        # Optimize for multi-GPU setup
        disable_custom_all_reduce=True,
        enforce_eager=True,
        max_workers=8,  # Higher batch size for better GPU utilization
    ),
)

model = lx.factory.create_model(config)

# Process multiple documents in parallel
results = []
documents = ["Document 1", "Document 2", "Document 3", ...]

for doc in documents:
    result = lx.extract(
        model=model,
        text_or_documents=doc,
        prompt_description="Extract key information",
        examples=[...],
    )
    results.append(result)

Multi-GPU Parameters:

tensor_parallel_size: Number of GPUs to use (must be power of 2: 1, 2, 4, 8)
pipeline_parallel_size: Pipeline parallelism degree (advanced feature)
disable_custom_all_reduce=True: Recommended for multi-GPU stability

Using Local vLLM Server

Connect to a locally running vLLM server:

import langextract as lx

# First, start your vLLM server:
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --host 0.0.0.0 \
    --port 8000

config = lx.factory.ModelConfig(
    model_id="vllm:http://localhost:8000/v1",
    provider="VLLMLanguageModel", 
    provider_kwargs=dict(
        temperature=0.7,
        max_tokens=1024,
        # Server connection settings
        timeout=60.0,
    ),
)

model = lx.factory.create_model(config)

result = lx.extract(
    model=model,
    text_or_documents="Your input text",
    prompt_description="Extract information",
    examples=[...],
)

Local Server Benefits:

Decouple model serving from client applications
Share one model instance across multiple clients
Better resource management and scaling
Easier deployment in production environments

Configuration Parameters

Provider Arguments (`provider_kwargs`)

Parameter	Type	Default	Description
`temperature`	float	0.7	Sampling temperature (0.0-2.0)
`top_p`	float	0.9	Top-p sampling parameter
`max_tokens`	int	1024	Maximum tokens to generate
`gpu_memory_utilization`	float	0.7	GPU memory utilization ratio (0.0-1.0)
`max_model_len`	int	1024	Maximum sequence length
`max_workers`	int	1	Maximum parallel workers
`enforce_eager`	bool	True	Disable torch.compile for stability
`disable_custom_all_reduce`	bool	True	Disable custom all-reduce operations
`tensor_parallel_size`	int	1	Number of GPUs for tensor parallelism (1, 2, 4, 8)
`pipeline_parallel_size`	int	1	Number of GPUs for pipeline parallelism
`api_key`	str	None	API key for server connection (use "EMPTY" for vLLM)
`timeout`	float	60.0	Request timeout in seconds for server connections

Additional vLLM engine parameters can be passed through provider_kwargs. Refer to the vLLM documentation for complete parameter reference.

Performance Optimization

Memory Management

Adjust gpu_memory_utilization based on your GPU memory
Lower max_model_len to reduce KV cache memory usage
Use max_workers to control batch processing

Inference Optimization

Set enforce_eager=True for stability (disables torch.compile)
Use disable_custom_all_reduce=True for multi-GPU setups
Batch multiple requests for better throughput

Multi-GPU Optimization

Use tensor_parallel_size for large models that don't fit on single GPU
Ensure GPU memory is balanced across all devices
Set tensor_parallel_size to power of 2 (1, 2, 4, 8)
Increase max_workers proportionally to GPU count for better utilization

Requirements

Python >= 3.10
CUDA-compatible GPU (for GPU acceleration)
vLLM >= 0.5.0
PyTorch >= 2.0.0
Transformers >= 4.30.0

Development

# Install in development mode
uv pip install -e .

# Run tests
uv run test_plugin.py

# Build package
uv build

# Publish to PyPI
uv publish

Troubleshooting

Common Issues

CUDA Out of Memory: Reduce gpu_memory_utilization or max_model_len
Model Loading Errors: Ensure the model path/repository is correct and accessible
Performance Issues: Increase max_workers for better batching

Error Messages

vLLM library not installed: Install vLLM with pip install vllm
InferenceRuntimeError: Check GPU memory and model compatibility

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Sep 24, 2025

0.1.2

Sep 11, 2025

0.1.1

Sep 10, 2025

0.1.0

Sep 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langextract_vllm-0.2.0.tar.gz (10.9 kB view details)

Uploaded Sep 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langextract_vllm-0.2.0-py3-none-any.whl (10.3 kB view details)

Uploaded Sep 24, 2025 Python 3

File details

Details for the file langextract_vllm-0.2.0.tar.gz.

File metadata

Download URL: langextract_vllm-0.2.0.tar.gz
Upload date: Sep 24, 2025
Size: 10.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for langextract_vllm-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d1a59e5ccc91cca24c60989f265cd208beddde6e028bb81d8042ebed9ecbe401`
MD5	`bdae6fcc72fb5fd3e6955a1d97d56178`
BLAKE2b-256	`511b48f91a783b4bf391f7216fb32c17d62b9ac1965e2909d7770af0383f0c7b`

See more details on using hashes here.

File details

Details for the file langextract_vllm-0.2.0-py3-none-any.whl.

File metadata

Download URL: langextract_vllm-0.2.0-py3-none-any.whl
Upload date: Sep 24, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for langextract_vllm-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`161358e3d1f0a2f50a70ef1332f70feed39d9e3f43aa2643dce885c701ef1ab1`
MD5	`c12566ca72820281ba33bced8a858d8c`
BLAKE2b-256	`dd1fc5204ef62dee6265d5b46a6ac62eff57cf55e9f6a3d49710465036883196`

See more details on using hashes here.

langextract-vllm 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LangExtract vLLM Provider

Installation

Supported Model IDs

Usage

Basic Usage

Advanced Configuration

Multi-GPU Configuration

Using Local vLLM Server

Configuration Parameters

Provider Arguments (provider_kwargs)

Performance Optimization

Memory Management

Inference Optimization

Multi-GPU Optimization

Requirements

Development

Troubleshooting

Common Issues

Error Messages

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Provider Arguments (`provider_kwargs`)