Accelerate spaCy transformers with TensorRT/ONNX Runtime

These details have not been verified by PyPI

Project links

Project description

spacy-accelerate

Accelerate spaCy transformer pipelines with TensorRT and ONNX Runtime. Drop-in replacement — one line of code. Speedup depends on your GPU: 1.2–3.5× faster inference on the tested setups, with small accuracy deltas.

Repository: GitHub • Package: PyPI

Requirements

Python 3.11+
CUDA 12.x
NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
spaCy 3.8+ with spacy-transformers

Installation

PyPI

pip install spacy-accelerate
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

The second command installs the TensorRT-enabled build of onnxruntime-gpu from NVIDIA's index. It is required because the default PyPI build does not include the TensorRT execution provider.

[!NOTE] spacy-accelerate pins the full CUDA/TensorRT stack to keep versions aligned. On import it also configures the native library paths automatically, so no manual LD_LIBRARY_PATH setup is needed in most cases.

Source / editable install

pip install -r requirements.txt
pip install -e . --no-deps

[!WARNING] Use --no-deps when doing an editable install. Running plain pip install -e . triggers a second resolver pass that can replace the pinned CUDA 12 stack with newer, incompatible packages.

Verify the installation

python -m spacy_accelerate

You should see TensorRT EP : OK and CUDA EP : OK in the output.

Quick Start

import spacy
import spacy_accelerate

# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")

# Optimize with one line
nlp = spacy_accelerate.optimize(nlp, precision="fp16")

# Use as normal — same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]

# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))

API Reference

`optimize(nlp, **kwargs)`

Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.

Parameters:

Parameter	Type	Default	Description
`nlp`	`spacy.Language`	required	spaCy pipeline with transformer
`precision`	`"fp32"` \| `"fp16"`	`"fp16"`	Model precision
`provider`	`"tensorrt"` \| `"cuda"` \| `"cpu"`	`"cuda"`	Execution provider
`cache_dir`	`Path` \| `str`	`~/.cache/spacy-accelerate`	ONNX model cache directory
`warmup`	`bool`	`True`	Run warmup inference
`device_id`	`int`	`0`	CUDA device ID
`max_batch_size`	`int`	`128`	Max batch size for IO Binding
`max_seq_length`	`int`	`512`	Max sequence length for IO Binding
`use_io_binding`	`bool`	`True`	Use zero-copy IO Binding
`verbose`	`bool`	`False`	Enable verbose logging

TensorRT-specific parameters:

Parameter	Type	Default	Description
`trt_max_workspace_size`	`int`	`4 * 1024**3`	TensorRT workspace size in bytes
`trt_builder_optimization_level`	`int`	`3`	Optimization level (0–5)
`trt_timing_cache`	`bool`	`True`	Enable timing cache

Advanced parameters:

Parameter	Type	Default	Description
`fixed_batch_size`	`int \| None`	`None`	Export ONNX with fixed batch size (dynamic if None)
`batch_buckets`	`list[int] \| None`	`None`	Pre-compiled TRT batch sizes; defaults to `[1,2,4,8,16,32,64,128]` for TensorRT
`fixed_seq_length`	`int \| None`	`None`	Pad/truncate all sequences to this length on GPU
`align_seq_length`	`int`	`16`	Pad sequence length to the nearest multiple of this value

Returns: The optimized spacy.Language object (modified in-place).

Cache Management

import spacy_accelerate

# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")

# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")

# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")

Examples

Maximum performance with TensorRT

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    precision="fp16",
    trt_max_workspace_size=8 * 1024**3,  # 8 GB
    trt_builder_optimization_level=5,     # Maximum optimization
)

# First inference builds the TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")

Custom cache directory

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    cache_dir="/path/to/custom/cache",
    precision="fp16",
)

Verbose mode for debugging

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    verbose=True,
)

Performance

Conditions: en_core_web_trf, CoNLL-2003 test set, batch_size=128, 1 warmup pass + 3 measured passes averaged.

How much speedup to expect

The relative gain depends on your GPU. On a faster GPU, PyTorch is already faster — GPU compute stops being the bottleneck and the overhead shifts to the spaCy pipeline (tokenization, NER decoding, Python). ONNX Runtime does not accelerate that part.

GPU	Mode	PyTorch baseline	Best provider	Best WPS	Speedup
RTX 4000 SFF Ada ¹	Full pipeline	6,241 WPS	TensorRT FP16	16,935	2.71×
RTX 4000 SFF Ada ¹	NER only	7,066 WPS	TensorRT FP16	24,823	3.51×
A100 80 GB ²	Full pipeline	6,881 WPS	CUDA FP16	9,670	1.41×
A100 80 GB ²	NER only	9,486 WPS	CUDA FP16	15,291	1.61×
RTX 4090 ³	Full pipeline	9,924 WPS	TensorRT FP16	11,726	1.18×
RTX 4090 ³	NER only	14,728 WPS	TensorRT FP16	19,313	1.31×

NER-only mode disables tagger, parser, attribute_ruler, lemmatizer. Only the transformer + NER component run, which reduces non-GPU overhead and yields higher speedups.

RTX 4000 SFF Ada Generation — full results ¹

Full pipeline:

Execution Provider	WPS	Speedup	Accuracy
PyTorch Baseline FP32	6,241	1.00×	100.00%
PyTorch Baseline FP16	6,166	0.99×	100.00%
CUDA EP FP32	9,910	1.59×	99.90%
CUDA EP FP16	15,763	2.53×	99.75%
TensorRT FP32	10,552	1.69×	99.95%
TensorRT FP16	16,935	2.71×	99.50%

NER only:

Execution Provider	WPS	Speedup	Accuracy
PyTorch Baseline FP32	7,066	1.00×	100.00%
PyTorch Baseline FP16	6,859	0.97×	100.00%
CUDA EP FP32	11,972	1.69×	99.90%
CUDA EP FP16	22,394	3.17×	99.75%
TensorRT FP32	13,138	1.86×	99.95%
TensorRT FP16	24,823	3.51×	99.65%

A100 80 GB — full results ²

Full pipeline:

Execution Provider	WPS	Speedup	Accuracy
PyTorch Baseline FP32	6,881	1.00×	100.00%
PyTorch Baseline FP16	6,882	1.00×	100.00%
CUDA EP FP32	8,822	1.28×	99.85%
CUDA EP FP16	9,670	1.41×	99.75%
TensorRT FP32	8,846	1.29×	99.95%
TensorRT FP16	9,491	1.38×	99.05%

NER only:

Execution Provider	WPS	Speedup	Accuracy
PyTorch Baseline FP32	9,486	1.00×	100.00%
PyTorch Baseline FP16	9,414	0.99×	100.00%
CUDA EP FP32	13,554	1.43×	99.85%
CUDA EP FP16	15,291	1.61×	99.75%
TensorRT FP32	13,579	1.43×	99.95%
TensorRT FP16	13,078	1.38×	99.05%

On A100, CUDA EP FP16 outperforms TensorRT FP16. This is expected: A100 was optimized for BF16 and large-batch datacenter workloads; TensorRT gains are less pronounced for the NLP batch sizes typical in spaCy pipelines.

RTX 4090 — full results ³

Full pipeline:

Execution Provider	WPS	Speedup	Accuracy
PyTorch Baseline FP32	9,924	1.00×	100.00%
PyTorch Baseline FP16	9,839	0.99×	100.00%
CUDA EP FP32	10,102	1.02×	99.85%
CUDA EP FP16	11,381	1.15×	99.85%
TensorRT FP32	10,397	1.05×	99.95%
TensorRT FP16	11,726	1.18×	99.65%

NER only:

Execution Provider	WPS	Speedup	Accuracy
PyTorch Baseline FP32	14,728	1.00×	100.00%
PyTorch Baseline FP16	14,557	0.99×	100.00%
CUDA EP FP32	15,153	1.03×	99.85%
CUDA EP FP16	18,126	1.23×	99.85%
TensorRT FP32	15,853	1.08×	99.95%
TensorRT FP16	19,313	1.31×	99.65%

On RTX 4090, the PyTorch baseline is already fast (~10k WPS full pipeline). The remaining gains come mostly from FP16 precision, not from the runtime switch. Switching to NER-only mode shows the clearest improvement (1.31×).

¹ Cloud instance (Hetzner). Ada Lovelace architecture benefits most from TensorRT FP16. ² Virtual partition (GRID A100D-80C). On this GPU CUDA EP FP16 is the recommended provider — TensorRT does not outperform it for typical spaCy batch sizes. ³ Cloud instance (RunPod). Local RTX 4090 results may differ due to power limits or virtualization overhead.

Supported Models

Currently tested, confirmed, and supported:

en_core_web_trf (RoBERTa-based)

Other spaCy transformer packages should be treated as unsupported for now, even if related architecture-detection code exists internally.

How It Works

Weight Mapping — extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.
ONNX Export — exports the model to ONNX with dynamic batch and sequence dimensions.
FP16 Optimization (optional) — applies BERT-style graph optimizations and converts weights to FP16.
Runtime Patching — replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same spaCy interface.
Caching — converted models are cached to disk to avoid re-conversion on subsequent runs.

Troubleshooting

TensorRT provider not available

Run the diagnostic tool first:

python -m spacy_accelerate

If you see TensorRT EP : MISSING, the NVIDIA build of onnxruntime-gpu is not installed. Fix it with:

pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

libnvinfer.so / libcublas.so / libcublasLt.so not found

If you see errors like libnvinfer.so.10: cannot open shared object file:

Automatic fix: spacy-accelerate configures the native library paths on import. Make sure to import spacy_accelerate before calling spacy.require_gpu() or creating any ONNX Runtime sessions.

Manual fix: Set LD_LIBRARY_PATH explicitly:

SITE=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE/tensorrt_libs:$SITE/nvidia/cublas/lib:$SITE/nvidia/cuda_runtime/lib:$SITE/nvidia/cudnn/lib:$LD_LIBRARY_PATH"

CUDA out of memory

Reduce workspace size or batch size:

# For TensorRT provider — reduce workspace
nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_max_workspace_size=2 * 1024**3,  # 2 GB instead of 4 GB
)

# For any provider — reduce batch size
nlp = spacy_accelerate.optimize(
    nlp,
    max_batch_size=16,
)

First inference is slow

This only applies to the TensorRT provider. TensorRT compiles an optimized engine on the first run — this can take tens of seconds. Subsequent runs reuse the cached engine and are fast.

The timing cache is enabled by default and carries over build history between runs. If build time matters, prefer a lower optimization level:

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_timing_cache=True,             # on by default
    trt_builder_optimization_level=3,  # lower = faster build, 5 = best runtime perf
)

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.1

Apr 3, 2026

0.3.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_accelerate-0.3.1.tar.gz (29.5 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spacy_accelerate-0.3.1-py3-none-any.whl (40.6 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file spacy_accelerate-0.3.1.tar.gz.

File metadata

Download URL: spacy_accelerate-0.3.1.tar.gz
Upload date: Apr 3, 2026
Size: 29.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for spacy_accelerate-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`991a055722022d33bb664e3c86a98dc2215850f195674202ea5ce7af20f238e3`
MD5	`958aea94c164a1b393ad724c622dbfb0`
BLAKE2b-256	`d88473fee2dbbcd0763b8f1cf46942a300bcea96a7ea95d3ddb8162684aca1e2`

See more details on using hashes here.

File details

Details for the file spacy_accelerate-0.3.1-py3-none-any.whl.

File metadata

Download URL: spacy_accelerate-0.3.1-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 40.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for spacy_accelerate-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9cefe4526eca52df01fd1210491421378974ec30fa567b600a7d8b69c6852b7e`
MD5	`2609354f9b0daee7af20a19a01ccdcbc`
BLAKE2b-256	`80c6caa7ebe951b6b0b3b7e39688318f195d6656672b5b40dfc23a7a32bcfdf9`

See more details on using hashes here.

spacy-accelerate 0.3.1

Navigation

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

spacy-accelerate

Requirements

Installation

PyPI

Source / editable install

Verify the installation

Quick Start

API Reference

optimize(nlp, **kwargs)

Cache Management

Examples

Maximum performance with TensorRT

Custom cache directory

Verbose mode for debugging

Performance

How much speedup to expect

RTX 4000 SFF Ada Generation — full results ¹

A100 80 GB — full results ²

RTX 4090 — full results ³

Supported Models

How It Works

Troubleshooting

TensorRT provider not available

libnvinfer.so / libcublas.so / libcublasLt.so not found

CUDA out of memory

First inference is slow

Contributing

License

Project details

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`optimize(nlp, **kwargs)`