Accelerate spaCy transformers with TensorRT/ONNX Runtime

These details have not been verified by PyPI

Project links

Project description

spacy-accelerate

Accelerate spaCy transformers with TensorRT/ONNX Runtime. Drop-in replacement for transformer-based spaCy pipelines with Docker-verified GPU benchmark workflows.

Installation

spacy-accelerate depends on a CUDA/TensorRT stack that must stay version-aligned. The two failure modes we hit in practice were:

a second dependency resolution pass upgrading parts of the stack to different CUDA majors;
CUDA/TensorRT shared libraries from pip wheels not being visible to CuPy / ONNX Runtime.

The package now pins the runtime versions in pyproject.toml, and it configures the pip-installed native libraries automatically on import.

Benchmark Docker files live under benchmarks/docker/, and canonical benchmark artifacts are saved under artifacts/benchmarks/docker/. The root .dockerignore is kept at repository level because Docker build context filtering applies to the whole repo root.

PyPI install

pip install spacy-accelerate
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

The second command is still required to guarantee the TensorRT-enabled onnxruntime-gpu build from NVIDIA.

Source / editable install

pip install -r requirements.txt
pip install -e . --no-deps

Do not run plain pip install -e . after that. It can trigger a second resolver pass and replace the pinned CUDA 12 stack with newer incompatible packages.

Verify the installation:

python -m spacy_accelerate

You should see TensorRT EP : OK and CUDA EP : OK in the output.

Requirements:

Python 3.11+
CUDA 12.x
NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
spaCy 3.8+ with spacy-transformers

Quick Start

import spacy
import spacy_accelerate

# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")

# Optimize with one line!
nlp = spacy_accelerate.optimize(nlp, precision="fp16")

# Use as normal - same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]

# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))

API Reference

`optimize(nlp, **kwargs)`

Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.

Parameters:

Parameter	Type	Default	Description
`nlp`	`spacy.Language`	required	spaCy pipeline with transformer
`precision`	`"fp32"` \| `"fp16"`	`"fp16"`	Model precision
`provider`	`"tensorrt"` \| `"cuda"` \| `"cpu"`	`"cuda"`	Execution provider
`cache_dir`	`Path` \| `str`	`~/.cache/spacy-accelerate`	ONNX model cache directory
`warmup`	`bool`	`True`	Run warmup inference
`device_id`	`int`	`0`	CUDA device ID
`max_batch_size`	`int`	`128`	Max batch size for IO Binding
`max_seq_length`	`int`	`512`	Max sequence length for IO Binding
`use_io_binding`	`bool`	`True`	Use zero-copy IO Binding
`verbose`	`bool`	`False`	Enable verbose logging

TensorRT-specific parameters:

Parameter	Type	Default	Description
`trt_max_workspace_size`	`int`	`4GB`	TensorRT workspace size
`trt_builder_optimization_level`	`int`	`3`	Optimization level (0-5)
`trt_timing_cache`	`bool`	`True`	Enable timing cache

Returns: The optimized spacy.Language object (modified in-place).

Cache Management

import spacy_accelerate

# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")

# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")

# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")

Performance

Canonical benchmark results are the Docker runs under artifacts/benchmarks/docker.

Benchmark commands and runner details are maintained in benchmarks/README.md.

Latest full-pipeline Docker measurement for en_core_web_trf on NVIDIA RTX 4000 SFF Ada Generation, CoNLL-2003 test set, batch_size=128, 1 discarded prime pass and 3 measured passes averaged:

Execution Provider	Speed (WPS)	Speedup vs PyTorch	Accuracy
PyTorch Baseline (FP32)	6,241	1.00x	100.00%
PyTorch Baseline (FP16)	6,166	0.99x	100.00%
CUDA FP32	9,910	1.59x	99.90%
CUDA FP16	15,763	2.53x	99.75%
TensorRT FP32	10,552	1.69x	99.95%
TensorRT FP16	16,935	2.71x	99.50%

Latest Docker NER-only measurement for en_core_web_trf with tagger, parser, attribute_ruler, and lemmatizer disabled:

Execution Provider	Speed (WPS)	Speedup vs PyTorch	Accuracy
PyTorch Baseline (FP32)	7,066	1.00x	100.00%
PyTorch Baseline (FP16)	6,859	0.97x	100.00%
CUDA FP32	11,972	1.69x	99.90%
CUDA FP16	22,394	3.17x	99.75%
TensorRT FP32	13,138	1.86x	99.95%
TensorRT FP16	24,823	3.51x	99.65%

Examples

Using TensorRT for Maximum Performance

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    precision="fp16",
    trt_max_workspace_size=8 * 1024**3,  # 8GB
    trt_builder_optimization_level=5,     # Maximum optimization
)

# First inference builds TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")

Custom Cache Directory

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    cache_dir="/path/to/custom/cache",
    precision="fp16",
)

Verbose Mode for Debugging

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    verbose=True,  # Print detailed logs
)

Supported Models

Right now the confirmed spaCy model support is:

en_core_web_trf

The earlier wording here listed transformer architecture families, not actual published spaCy package names. Internally, the exporter and architecture detection logic currently target curated-transformer / RoBERTa-style backbones, with partial code paths for BERT and XLM-RoBERTa families, but those are not yet claimed here as generally supported spaCy packages.

How It Works

Weight Mapping: Extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.
ONNX Export: Exports the mapped model to ONNX format with dynamic batch and sequence dimensions.
FP16 Optimization (optional): Applies BERT-style optimizations and converts to FP16 for faster inference.
Runtime Patching: Replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same interface.
Caching: Converted models are cached to avoid re-conversion on subsequent loads.

Troubleshooting

TensorRT provider not available

Run the diagnostic tool first:

python -m spacy_accelerate

If you see TensorRT EP : MISSING, the NVIDIA build of onnxruntime-gpu is not installed. Fix with step 2 from the installation instructions:

pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

libnvinfer.so / libcublas.so / libcublasLt.so not found

If you see errors like libnvinfer.so.10, libcublas.so.12, or libcublasLt.so.12: cannot open shared object file:

Automatic fix: spacy-accelerate automatically configures both TensorRT libraries and the CUDA libraries installed under site-packages/nvidia/*/lib. Import spacy_accelerate before creating ONNX Runtime sessions or calling spacy.require_gpu().

Manual fix: If the automatic configuration doesn't work (e.g., running scripts directly):

SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE_PACKAGES/tensorrt_libs:$SITE_PACKAGES/nvidia/cublas/lib:$SITE_PACKAGES/nvidia/cuda_runtime/lib:$SITE_PACKAGES/nvidia/cudnn/lib:$LD_LIBRARY_PATH"

CUDA out of memory

Reduce workspace size or batch size:

nlp = spacy_accelerate.optimize(
    nlp,
    trt_max_workspace_size=2 * 1024**3,  # 2GB instead of 4GB
    max_batch_size=16,                    # Smaller batches
)

First inference is slow

TensorRT builds optimized engines on first run. Enable caching:

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_timing_cache=True,  # Cache timing data
)

License

MIT License

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Apr 3, 2026

This version

0.3.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_accelerate-0.3.0.tar.gz (84.9 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spacy_accelerate-0.3.0-py3-none-any.whl (38.1 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file spacy_accelerate-0.3.0.tar.gz.

File metadata

Download URL: spacy_accelerate-0.3.0.tar.gz
Upload date: Apr 3, 2026
Size: 84.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for spacy_accelerate-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`0645bfcd71bfaef74cbfbe5710f1aba1e73cc333b3ac81b2610fbf5f2376d85d`
MD5	`046890487a090c02e5acaace772dfeb3`
BLAKE2b-256	`ef9bab822770bf9bc59d2cb8ba87f417030a331c1960f1584355d6946f7d2b97`

See more details on using hashes here.

File details

Details for the file spacy_accelerate-0.3.0-py3-none-any.whl.

File metadata

Download URL: spacy_accelerate-0.3.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 38.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for spacy_accelerate-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a04bafaa98646caadc5e9e017f385e36c1102e58edb180e252cc3d28b2d99638`
MD5	`9a14d9535643bd0c7370f4627118dfe8`
BLAKE2b-256	`4b4ddc1f7d420bff0ec7320f5a2059bd47afe97f55a8b936e1d56b9f3d02de9c`

See more details on using hashes here.

spacy-accelerate 0.3.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

spacy-accelerate

Installation

PyPI install

Source / editable install

Quick Start

API Reference

optimize(nlp, **kwargs)

Cache Management

Performance

Examples

Using TensorRT for Maximum Performance

Custom Cache Directory

Verbose Mode for Debugging

Supported Models

How It Works

Troubleshooting

TensorRT provider not available

libnvinfer.so / libcublas.so / libcublasLt.so not found

CUDA out of memory

First inference is slow

License

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`optimize(nlp, **kwargs)`