Accelerate spaCy transformers with TensorRT/ONNX Runtime
Project description
spacy-accelerate
Accelerate spaCy transformer pipelines with TensorRT and ONNX Runtime. Drop-in replacement — one line of code. Speedup depends on your GPU: 1.2–3.5× faster inference on the tested setups, with small accuracy deltas.
Repository: GitHub • Package: PyPI
Requirements
- Python 3.11+
- CUDA 12.x
- NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
- spaCy 3.8+ with spacy-transformers
Installation
PyPI
pip install spacy-accelerate
pip install --force-reinstall \
--extra-index-url https://pypi.nvidia.com \
onnxruntime-gpu==1.23.2
The second command installs the TensorRT-enabled build of onnxruntime-gpu from NVIDIA's index. It is required because the default PyPI build does not include the TensorRT execution provider.
[!NOTE]
spacy-acceleratepins the full CUDA/TensorRT stack to keep versions aligned. On import it also configures the native library paths automatically, so no manualLD_LIBRARY_PATHsetup is needed in most cases.
Source / editable install
pip install -r requirements.txt
pip install -e . --no-deps
[!WARNING] Use
--no-depswhen doing an editable install. Running plainpip install -e .triggers a second resolver pass that can replace the pinned CUDA 12 stack with newer, incompatible packages.
Verify the installation
python -m spacy_accelerate
You should see TensorRT EP : OK and CUDA EP : OK in the output.
Quick Start
import spacy
import spacy_accelerate
# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")
# Optimize with one line
nlp = spacy_accelerate.optimize(nlp, precision="fp16")
# Use as normal — same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]
# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))
API Reference
optimize(nlp, **kwargs)
Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
nlp |
spacy.Language |
required | spaCy pipeline with transformer |
precision |
"fp32" | "fp16" |
"fp16" |
Model precision |
provider |
"tensorrt" | "cuda" | "cpu" |
"cuda" |
Execution provider |
cache_dir |
Path | str |
~/.cache/spacy-accelerate |
ONNX model cache directory |
warmup |
bool |
True |
Run warmup inference |
device_id |
int |
0 |
CUDA device ID |
max_batch_size |
int |
128 |
Max batch size for IO Binding |
max_seq_length |
int |
512 |
Max sequence length for IO Binding |
use_io_binding |
bool |
True |
Use zero-copy IO Binding |
verbose |
bool |
False |
Enable verbose logging |
TensorRT-specific parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
trt_max_workspace_size |
int |
4 * 1024**3 |
TensorRT workspace size in bytes |
trt_builder_optimization_level |
int |
3 |
Optimization level (0–5) |
trt_timing_cache |
bool |
True |
Enable timing cache |
Advanced parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
fixed_batch_size |
int | None |
None |
Export ONNX with fixed batch size (dynamic if None) |
batch_buckets |
list[int] | None |
None |
Pre-compiled TRT batch sizes; defaults to [1,2,4,8,16,32,64,128] for TensorRT |
fixed_seq_length |
int | None |
None |
Pad/truncate all sequences to this length on GPU |
align_seq_length |
int |
16 |
Pad sequence length to the nearest multiple of this value |
Returns: The optimized spacy.Language object (modified in-place).
Cache Management
import spacy_accelerate
# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")
# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")
# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")
Examples
Maximum performance with TensorRT
import spacy
import spacy_accelerate
nlp = spacy.load("en_core_web_trf")
nlp = spacy_accelerate.optimize(
nlp,
provider="tensorrt",
precision="fp16",
trt_max_workspace_size=8 * 1024**3, # 8 GB
trt_builder_optimization_level=5, # Maximum optimization
)
# First inference builds the TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")
Custom cache directory
import spacy
import spacy_accelerate
nlp = spacy.load("en_core_web_trf")
nlp = spacy_accelerate.optimize(
nlp,
cache_dir="/path/to/custom/cache",
precision="fp16",
)
Verbose mode for debugging
import spacy
import spacy_accelerate
nlp = spacy.load("en_core_web_trf")
nlp = spacy_accelerate.optimize(
nlp,
verbose=True,
)
Performance
Conditions: en_core_web_trf, CoNLL-2003 test set, batch_size=128,
1 warmup pass + 3 measured passes averaged.
How much speedup to expect
The relative gain depends on your GPU. On a faster GPU, PyTorch is already faster — GPU compute stops being the bottleneck and the overhead shifts to the spaCy pipeline (tokenization, NER decoding, Python). ONNX Runtime does not accelerate that part.
| GPU | Mode | PyTorch baseline | Best provider | Best WPS | Speedup |
|---|---|---|---|---|---|
| RTX 4000 SFF Ada ¹ | Full pipeline | 6,241 WPS | TensorRT FP16 | 16,935 | 2.71× |
| RTX 4000 SFF Ada ¹ | NER only | 7,066 WPS | TensorRT FP16 | 24,823 | 3.51× |
| A100 80 GB ² | Full pipeline | 6,881 WPS | CUDA FP16 | 9,670 | 1.41× |
| A100 80 GB ² | NER only | 9,486 WPS | CUDA FP16 | 15,291 | 1.61× |
| RTX 4090 ³ | Full pipeline | 9,924 WPS | TensorRT FP16 | 11,726 | 1.18× |
| RTX 4090 ³ | NER only | 14,728 WPS | TensorRT FP16 | 19,313 | 1.31× |
NER-only mode disables
tagger,parser,attribute_ruler,lemmatizer. Only the transformer + NER component run, which reduces non-GPU overhead and yields higher speedups.
RTX 4000 SFF Ada Generation — full results ¹
Full pipeline:
| Execution Provider | WPS | Speedup | Accuracy |
|---|---|---|---|
| PyTorch Baseline FP32 | 6,241 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 6,166 | 0.99× | 100.00% |
| CUDA EP FP32 | 9,910 | 1.59× | 99.90% |
| CUDA EP FP16 | 15,763 | 2.53× | 99.75% |
| TensorRT FP32 | 10,552 | 1.69× | 99.95% |
| TensorRT FP16 | 16,935 | 2.71× | 99.50% |
NER only:
| Execution Provider | WPS | Speedup | Accuracy |
|---|---|---|---|
| PyTorch Baseline FP32 | 7,066 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 6,859 | 0.97× | 100.00% |
| CUDA EP FP32 | 11,972 | 1.69× | 99.90% |
| CUDA EP FP16 | 22,394 | 3.17× | 99.75% |
| TensorRT FP32 | 13,138 | 1.86× | 99.95% |
| TensorRT FP16 | 24,823 | 3.51× | 99.65% |
A100 80 GB — full results ²
Full pipeline:
| Execution Provider | WPS | Speedup | Accuracy |
|---|---|---|---|
| PyTorch Baseline FP32 | 6,881 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 6,882 | 1.00× | 100.00% |
| CUDA EP FP32 | 8,822 | 1.28× | 99.85% |
| CUDA EP FP16 | 9,670 | 1.41× | 99.75% |
| TensorRT FP32 | 8,846 | 1.29× | 99.95% |
| TensorRT FP16 | 9,491 | 1.38× | 99.05% |
NER only:
| Execution Provider | WPS | Speedup | Accuracy |
|---|---|---|---|
| PyTorch Baseline FP32 | 9,486 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 9,414 | 0.99× | 100.00% |
| CUDA EP FP32 | 13,554 | 1.43× | 99.85% |
| CUDA EP FP16 | 15,291 | 1.61× | 99.75% |
| TensorRT FP32 | 13,579 | 1.43× | 99.95% |
| TensorRT FP16 | 13,078 | 1.38× | 99.05% |
On A100, CUDA EP FP16 outperforms TensorRT FP16. This is expected: A100 was optimized for BF16 and large-batch datacenter workloads; TensorRT gains are less pronounced for the NLP batch sizes typical in spaCy pipelines.
RTX 4090 — full results ³
Full pipeline:
| Execution Provider | WPS | Speedup | Accuracy |
|---|---|---|---|
| PyTorch Baseline FP32 | 9,924 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 9,839 | 0.99× | 100.00% |
| CUDA EP FP32 | 10,102 | 1.02× | 99.85% |
| CUDA EP FP16 | 11,381 | 1.15× | 99.85% |
| TensorRT FP32 | 10,397 | 1.05× | 99.95% |
| TensorRT FP16 | 11,726 | 1.18× | 99.65% |
NER only:
| Execution Provider | WPS | Speedup | Accuracy |
|---|---|---|---|
| PyTorch Baseline FP32 | 14,728 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 14,557 | 0.99× | 100.00% |
| CUDA EP FP32 | 15,153 | 1.03× | 99.85% |
| CUDA EP FP16 | 18,126 | 1.23× | 99.85% |
| TensorRT FP32 | 15,853 | 1.08× | 99.95% |
| TensorRT FP16 | 19,313 | 1.31× | 99.65% |
On RTX 4090, the PyTorch baseline is already fast (~10k WPS full pipeline). The remaining gains come mostly from FP16 precision, not from the runtime switch. Switching to NER-only mode shows the clearest improvement (1.31×).
¹ Cloud instance (Hetzner). Ada Lovelace architecture benefits most from TensorRT FP16. ² Virtual partition (GRID A100D-80C). On this GPU CUDA EP FP16 is the recommended provider — TensorRT does not outperform it for typical spaCy batch sizes. ³ Cloud instance (RunPod). Local RTX 4090 results may differ due to power limits or virtualization overhead.
Supported Models
Currently tested, confirmed, and supported:
en_core_web_trf(RoBERTa-based)
Other spaCy transformer packages should be treated as unsupported for now, even if related architecture-detection code exists internally.
How It Works
- Weight Mapping — extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.
- ONNX Export — exports the model to ONNX with dynamic batch and sequence dimensions.
- FP16 Optimization (optional) — applies BERT-style graph optimizations and converts weights to FP16.
- Runtime Patching — replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same spaCy interface.
- Caching — converted models are cached to disk to avoid re-conversion on subsequent runs.
Troubleshooting
TensorRT provider not available
Run the diagnostic tool first:
python -m spacy_accelerate
If you see TensorRT EP : MISSING, the NVIDIA build of onnxruntime-gpu is not installed. Fix it with:
pip install --force-reinstall \
--extra-index-url https://pypi.nvidia.com \
onnxruntime-gpu==1.23.2
libnvinfer.so / libcublas.so / libcublasLt.so not found
If you see errors like libnvinfer.so.10: cannot open shared object file:
Automatic fix: spacy-accelerate configures the native library paths on import. Make sure to import spacy_accelerate before calling spacy.require_gpu() or creating any ONNX Runtime sessions.
Manual fix: Set LD_LIBRARY_PATH explicitly:
SITE=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE/tensorrt_libs:$SITE/nvidia/cublas/lib:$SITE/nvidia/cuda_runtime/lib:$SITE/nvidia/cudnn/lib:$LD_LIBRARY_PATH"
CUDA out of memory
Reduce workspace size or batch size:
# For TensorRT provider — reduce workspace
nlp = spacy_accelerate.optimize(
nlp,
provider="tensorrt",
trt_max_workspace_size=2 * 1024**3, # 2 GB instead of 4 GB
)
# For any provider — reduce batch size
nlp = spacy_accelerate.optimize(
nlp,
max_batch_size=16,
)
First inference is slow
This only applies to the TensorRT provider. TensorRT compiles an optimized engine on the first run — this can take tens of seconds. Subsequent runs reuse the cached engine and are fast.
The timing cache is enabled by default and carries over build history between runs. If build time matters, prefer a lower optimization level:
nlp = spacy_accelerate.optimize(
nlp,
provider="tensorrt",
trt_timing_cache=True, # on by default
trt_builder_optimization_level=3, # lower = faster build, 5 = best runtime perf
)
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spacy_accelerate-0.3.1.tar.gz.
File metadata
- Download URL: spacy_accelerate-0.3.1.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
991a055722022d33bb664e3c86a98dc2215850f195674202ea5ce7af20f238e3
|
|
| MD5 |
958aea94c164a1b393ad724c622dbfb0
|
|
| BLAKE2b-256 |
d88473fee2dbbcd0763b8f1cf46942a300bcea96a7ea95d3ddb8162684aca1e2
|
File details
Details for the file spacy_accelerate-0.3.1-py3-none-any.whl.
File metadata
- Download URL: spacy_accelerate-0.3.1-py3-none-any.whl
- Upload date:
- Size: 40.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cefe4526eca52df01fd1210491421378974ec30fa567b600a7d8b69c6852b7e
|
|
| MD5 |
2609354f9b0daee7af20a19a01ccdcbc
|
|
| BLAKE2b-256 |
80c6caa7ebe951b6b0b3b7e39688318f195d6656672b5b40dfc23a7a32bcfdf9
|