Accelerate spaCy transformers with TensorRT/ONNX Runtime
Project description
spacy-accelerate
Accelerate spaCy transformers with TensorRT/ONNX Runtime. Drop-in replacement for transformer-based spaCy pipelines with Docker-verified GPU benchmark workflows.
Installation
spacy-accelerate depends on a CUDA/TensorRT stack that must stay version-aligned.
The two failure modes we hit in practice were:
- a second dependency resolution pass upgrading parts of the stack to different CUDA majors;
- CUDA/TensorRT shared libraries from pip wheels not being visible to CuPy / ONNX Runtime.
The package now pins the runtime versions in pyproject.toml, and it configures
the pip-installed native libraries automatically on import.
Benchmark Docker files live under benchmarks/docker/, and canonical benchmark
artifacts are saved under artifacts/benchmarks/docker/. The root
.dockerignore is kept at repository level because Docker build context
filtering applies to the whole repo root.
PyPI install
pip install spacy-accelerate
pip install --force-reinstall \
--extra-index-url https://pypi.nvidia.com \
onnxruntime-gpu==1.23.2
The second command is still required to guarantee the TensorRT-enabled
onnxruntime-gpu build from NVIDIA.
Source / editable install
pip install -r requirements.txt
pip install -e . --no-deps
Do not run plain pip install -e . after that. It can trigger a second resolver
pass and replace the pinned CUDA 12 stack with newer incompatible packages.
Verify the installation:
python -m spacy_accelerate
You should see TensorRT EP : OK and CUDA EP : OK in the output.
Requirements:
- Python 3.11+
- CUDA 12.x
- NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
- spaCy 3.8+ with spacy-transformers
Quick Start
import spacy
import spacy_accelerate
# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")
# Optimize with one line!
nlp = spacy_accelerate.optimize(nlp, precision="fp16")
# Use as normal - same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]
# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))
API Reference
optimize(nlp, **kwargs)
Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
nlp |
spacy.Language |
required | spaCy pipeline with transformer |
precision |
"fp32" | "fp16" |
"fp16" |
Model precision |
provider |
"tensorrt" | "cuda" | "cpu" |
"cuda" |
Execution provider |
cache_dir |
Path | str |
~/.cache/spacy-accelerate |
ONNX model cache directory |
warmup |
bool |
True |
Run warmup inference |
device_id |
int |
0 |
CUDA device ID |
max_batch_size |
int |
128 |
Max batch size for IO Binding |
max_seq_length |
int |
512 |
Max sequence length for IO Binding |
use_io_binding |
bool |
True |
Use zero-copy IO Binding |
verbose |
bool |
False |
Enable verbose logging |
TensorRT-specific parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
trt_max_workspace_size |
int |
4GB |
TensorRT workspace size |
trt_builder_optimization_level |
int |
3 |
Optimization level (0-5) |
trt_timing_cache |
bool |
True |
Enable timing cache |
Returns: The optimized spacy.Language object (modified in-place).
Cache Management
import spacy_accelerate
# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")
# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")
# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")
Performance
Canonical benchmark results are the Docker runs under artifacts/benchmarks/docker.
Benchmark commands and runner details are maintained in benchmarks/README.md.
Latest full-pipeline Docker measurement for en_core_web_trf on NVIDIA RTX 4000 SFF Ada Generation, CoNLL-2003 test set, batch_size=128, 1 discarded prime pass and 3 measured passes averaged:
| Execution Provider | Speed (WPS) | Speedup vs PyTorch | Accuracy |
|---|---|---|---|
| PyTorch Baseline (FP32) | 6,241 | 1.00x | 100.00% |
| PyTorch Baseline (FP16) | 6,166 | 0.99x | 100.00% |
| CUDA FP32 | 9,910 | 1.59x | 99.90% |
| CUDA FP16 | 15,763 | 2.53x | 99.75% |
| TensorRT FP32 | 10,552 | 1.69x | 99.95% |
| TensorRT FP16 | 16,935 | 2.71x | 99.50% |
Latest Docker NER-only measurement for en_core_web_trf with tagger, parser, attribute_ruler, and lemmatizer disabled:
| Execution Provider | Speed (WPS) | Speedup vs PyTorch | Accuracy |
|---|---|---|---|
| PyTorch Baseline (FP32) | 7,066 | 1.00x | 100.00% |
| PyTorch Baseline (FP16) | 6,859 | 0.97x | 100.00% |
| CUDA FP32 | 11,972 | 1.69x | 99.90% |
| CUDA FP16 | 22,394 | 3.17x | 99.75% |
| TensorRT FP32 | 13,138 | 1.86x | 99.95% |
| TensorRT FP16 | 24,823 | 3.51x | 99.65% |
Examples
Using TensorRT for Maximum Performance
import spacy
import spacy_accelerate
nlp = spacy.load("en_core_web_trf")
nlp = spacy_accelerate.optimize(
nlp,
provider="tensorrt",
precision="fp16",
trt_max_workspace_size=8 * 1024**3, # 8GB
trt_builder_optimization_level=5, # Maximum optimization
)
# First inference builds TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")
Custom Cache Directory
import spacy
import spacy_accelerate
nlp = spacy.load("en_core_web_trf")
nlp = spacy_accelerate.optimize(
nlp,
cache_dir="/path/to/custom/cache",
precision="fp16",
)
Verbose Mode for Debugging
import spacy
import spacy_accelerate
nlp = spacy.load("en_core_web_trf")
nlp = spacy_accelerate.optimize(
nlp,
verbose=True, # Print detailed logs
)
Supported Models
Right now the confirmed spaCy model support is:
en_core_web_trf
The earlier wording here listed transformer architecture families, not actual published spaCy package names. Internally, the exporter and architecture detection logic currently target curated-transformer / RoBERTa-style backbones, with partial code paths for BERT and XLM-RoBERTa families, but those are not yet claimed here as generally supported spaCy packages.
How It Works
-
Weight Mapping: Extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.
-
ONNX Export: Exports the mapped model to ONNX format with dynamic batch and sequence dimensions.
-
FP16 Optimization (optional): Applies BERT-style optimizations and converts to FP16 for faster inference.
-
Runtime Patching: Replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same interface.
-
Caching: Converted models are cached to avoid re-conversion on subsequent loads.
Troubleshooting
TensorRT provider not available
Run the diagnostic tool first:
python -m spacy_accelerate
If you see TensorRT EP : MISSING, the NVIDIA build of onnxruntime-gpu is not installed.
Fix with step 2 from the installation instructions:
pip install --force-reinstall \
--extra-index-url https://pypi.nvidia.com \
onnxruntime-gpu==1.23.2
libnvinfer.so / libcublas.so / libcublasLt.so not found
If you see errors like libnvinfer.so.10, libcublas.so.12, or
libcublasLt.so.12: cannot open shared object file:
Automatic fix: spacy-accelerate automatically configures both TensorRT
libraries and the CUDA libraries installed under site-packages/nvidia/*/lib.
Import spacy_accelerate before creating ONNX Runtime sessions or calling
spacy.require_gpu().
Manual fix: If the automatic configuration doesn't work (e.g., running scripts directly):
SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE_PACKAGES/tensorrt_libs:$SITE_PACKAGES/nvidia/cublas/lib:$SITE_PACKAGES/nvidia/cuda_runtime/lib:$SITE_PACKAGES/nvidia/cudnn/lib:$LD_LIBRARY_PATH"
CUDA out of memory
Reduce workspace size or batch size:
nlp = spacy_accelerate.optimize(
nlp,
trt_max_workspace_size=2 * 1024**3, # 2GB instead of 4GB
max_batch_size=16, # Smaller batches
)
First inference is slow
TensorRT builds optimized engines on first run. Enable caching:
nlp = spacy_accelerate.optimize(
nlp,
provider="tensorrt",
trt_timing_cache=True, # Cache timing data
)
License
MIT License
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spacy_accelerate-0.3.0.tar.gz.
File metadata
- Download URL: spacy_accelerate-0.3.0.tar.gz
- Upload date:
- Size: 84.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0645bfcd71bfaef74cbfbe5710f1aba1e73cc333b3ac81b2610fbf5f2376d85d
|
|
| MD5 |
046890487a090c02e5acaace772dfeb3
|
|
| BLAKE2b-256 |
ef9bab822770bf9bc59d2cb8ba87f417030a331c1960f1584355d6946f7d2b97
|
File details
Details for the file spacy_accelerate-0.3.0-py3-none-any.whl.
File metadata
- Download URL: spacy_accelerate-0.3.0-py3-none-any.whl
- Upload date:
- Size: 38.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a04bafaa98646caadc5e9e017f385e36c1102e58edb180e252cc3d28b2d99638
|
|
| MD5 |
9a14d9535643bd0c7370f4627118dfe8
|
|
| BLAKE2b-256 |
4b4ddc1f7d420bff0ec7320f5a2059bd47afe97f55a8b936e1d56b9f3d02de9c
|