Skip to main content

Accelerate spaCy transformers with TensorRT/ONNX Runtime

Project description

spacy-accelerate

PyPI version Python License: MIT

Accelerate spaCy transformer pipelines with TensorRT and ONNX Runtime. Drop-in replacement — one line of code. Speedup depends on your GPU: 1.2–3.5× faster inference on the tested setups, with small accuracy deltas.

Repository: GitHub • Package: PyPI

Requirements

  • Python 3.11+
  • CUDA 12.x
  • NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
  • spaCy 3.8+ with spacy-transformers

Installation

PyPI

pip install spacy-accelerate
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

The second command installs the TensorRT-enabled build of onnxruntime-gpu from NVIDIA's index. It is required because the default PyPI build does not include the TensorRT execution provider.

[!NOTE] spacy-accelerate pins the full CUDA/TensorRT stack to keep versions aligned. On import it also configures the native library paths automatically, so no manual LD_LIBRARY_PATH setup is needed in most cases.

Source / editable install

pip install -r requirements.txt
pip install -e . --no-deps

[!WARNING] Use --no-deps when doing an editable install. Running plain pip install -e . triggers a second resolver pass that can replace the pinned CUDA 12 stack with newer, incompatible packages.

Verify the installation

python -m spacy_accelerate

You should see TensorRT EP : OK and CUDA EP : OK in the output.

Quick Start

import spacy
import spacy_accelerate

# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")

# Optimize with one line
nlp = spacy_accelerate.optimize(nlp, precision="fp16")

# Use as normal — same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]

# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))

API Reference

optimize(nlp, **kwargs)

Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.

Parameters:

Parameter Type Default Description
nlp spacy.Language required spaCy pipeline with transformer
precision "fp32" | "fp16" "fp16" Model precision
provider "tensorrt" | "cuda" | "cpu" "cuda" Execution provider
cache_dir Path | str ~/.cache/spacy-accelerate ONNX model cache directory
warmup bool True Run warmup inference
device_id int 0 CUDA device ID
max_batch_size int 128 Max batch size for IO Binding
max_seq_length int 512 Max sequence length for IO Binding
use_io_binding bool True Use zero-copy IO Binding
verbose bool False Enable verbose logging

TensorRT-specific parameters:

Parameter Type Default Description
trt_max_workspace_size int 4 * 1024**3 TensorRT workspace size in bytes
trt_builder_optimization_level int 3 Optimization level (0–5)
trt_timing_cache bool True Enable timing cache

Advanced parameters:

Parameter Type Default Description
fixed_batch_size int | None None Export ONNX with fixed batch size (dynamic if None)
batch_buckets list[int] | None None Pre-compiled TRT batch sizes; defaults to [1,2,4,8,16,32,64,128] for TensorRT
fixed_seq_length int | None None Pad/truncate all sequences to this length on GPU
align_seq_length int 16 Pad sequence length to the nearest multiple of this value

Returns: The optimized spacy.Language object (modified in-place).

Cache Management

import spacy_accelerate

# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")

# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")

# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")

Examples

Maximum performance with TensorRT

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    precision="fp16",
    trt_max_workspace_size=8 * 1024**3,  # 8 GB
    trt_builder_optimization_level=5,     # Maximum optimization
)

# First inference builds the TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")

Custom cache directory

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    cache_dir="/path/to/custom/cache",
    precision="fp16",
)

Verbose mode for debugging

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    verbose=True,
)

Performance

Conditions: en_core_web_trf, CoNLL-2003 test set, batch_size=128, 1 warmup pass + 3 measured passes averaged.

How much speedup to expect

The relative gain depends on your GPU. On a faster GPU, PyTorch is already faster — GPU compute stops being the bottleneck and the overhead shifts to the spaCy pipeline (tokenization, NER decoding, Python). ONNX Runtime does not accelerate that part.

GPU Mode PyTorch baseline Best provider Best WPS Speedup
RTX 4000 SFF Ada ¹ Full pipeline 6,241 WPS TensorRT FP16 16,935 2.71×
RTX 4000 SFF Ada ¹ NER only 7,066 WPS TensorRT FP16 24,823 3.51×
A100 80 GB ² Full pipeline 6,881 WPS CUDA FP16 9,670 1.41×
A100 80 GB ² NER only 9,486 WPS CUDA FP16 15,291 1.61×
RTX 4090 ³ Full pipeline 9,924 WPS TensorRT FP16 11,726 1.18×
RTX 4090 ³ NER only 14,728 WPS TensorRT FP16 19,313 1.31×

NER-only mode disables tagger, parser, attribute_ruler, lemmatizer. Only the transformer + NER component run, which reduces non-GPU overhead and yields higher speedups.


RTX 4000 SFF Ada Generation — full results ¹

Full pipeline:

Execution Provider WPS Speedup Accuracy
PyTorch Baseline FP32 6,241 1.00× 100.00%
PyTorch Baseline FP16 6,166 0.99× 100.00%
CUDA EP FP32 9,910 1.59× 99.90%
CUDA EP FP16 15,763 2.53× 99.75%
TensorRT FP32 10,552 1.69× 99.95%
TensorRT FP16 16,935 2.71× 99.50%

NER only:

Execution Provider WPS Speedup Accuracy
PyTorch Baseline FP32 7,066 1.00× 100.00%
PyTorch Baseline FP16 6,859 0.97× 100.00%
CUDA EP FP32 11,972 1.69× 99.90%
CUDA EP FP16 22,394 3.17× 99.75%
TensorRT FP32 13,138 1.86× 99.95%
TensorRT FP16 24,823 3.51× 99.65%

A100 80 GB — full results ²

Full pipeline:

Execution Provider WPS Speedup Accuracy
PyTorch Baseline FP32 6,881 1.00× 100.00%
PyTorch Baseline FP16 6,882 1.00× 100.00%
CUDA EP FP32 8,822 1.28× 99.85%
CUDA EP FP16 9,670 1.41× 99.75%
TensorRT FP32 8,846 1.29× 99.95%
TensorRT FP16 9,491 1.38× 99.05%

NER only:

Execution Provider WPS Speedup Accuracy
PyTorch Baseline FP32 9,486 1.00× 100.00%
PyTorch Baseline FP16 9,414 0.99× 100.00%
CUDA EP FP32 13,554 1.43× 99.85%
CUDA EP FP16 15,291 1.61× 99.75%
TensorRT FP32 13,579 1.43× 99.95%
TensorRT FP16 13,078 1.38× 99.05%

On A100, CUDA EP FP16 outperforms TensorRT FP16. This is expected: A100 was optimized for BF16 and large-batch datacenter workloads; TensorRT gains are less pronounced for the NLP batch sizes typical in spaCy pipelines.


RTX 4090 — full results ³

Full pipeline:

Execution Provider WPS Speedup Accuracy
PyTorch Baseline FP32 9,924 1.00× 100.00%
PyTorch Baseline FP16 9,839 0.99× 100.00%
CUDA EP FP32 10,102 1.02× 99.85%
CUDA EP FP16 11,381 1.15× 99.85%
TensorRT FP32 10,397 1.05× 99.95%
TensorRT FP16 11,726 1.18× 99.65%

NER only:

Execution Provider WPS Speedup Accuracy
PyTorch Baseline FP32 14,728 1.00× 100.00%
PyTorch Baseline FP16 14,557 0.99× 100.00%
CUDA EP FP32 15,153 1.03× 99.85%
CUDA EP FP16 18,126 1.23× 99.85%
TensorRT FP32 15,853 1.08× 99.95%
TensorRT FP16 19,313 1.31× 99.65%

On RTX 4090, the PyTorch baseline is already fast (~10k WPS full pipeline). The remaining gains come mostly from FP16 precision, not from the runtime switch. Switching to NER-only mode shows the clearest improvement (1.31×).


¹ Cloud instance (Hetzner). Ada Lovelace architecture benefits most from TensorRT FP16. ² Virtual partition (GRID A100D-80C). On this GPU CUDA EP FP16 is the recommended provider — TensorRT does not outperform it for typical spaCy batch sizes. ³ Cloud instance (RunPod). Local RTX 4090 results may differ due to power limits or virtualization overhead.

Supported Models

Currently tested, confirmed, and supported:

  • en_core_web_trf (RoBERTa-based)

Other spaCy transformer packages should be treated as unsupported for now, even if related architecture-detection code exists internally.

How It Works

  1. Weight Mapping — extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.
  2. ONNX Export — exports the model to ONNX with dynamic batch and sequence dimensions.
  3. FP16 Optimization (optional) — applies BERT-style graph optimizations and converts weights to FP16.
  4. Runtime Patching — replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same spaCy interface.
  5. Caching — converted models are cached to disk to avoid re-conversion on subsequent runs.

Troubleshooting

TensorRT provider not available

Run the diagnostic tool first:

python -m spacy_accelerate

If you see TensorRT EP : MISSING, the NVIDIA build of onnxruntime-gpu is not installed. Fix it with:

pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

libnvinfer.so / libcublas.so / libcublasLt.so not found

If you see errors like libnvinfer.so.10: cannot open shared object file:

Automatic fix: spacy-accelerate configures the native library paths on import. Make sure to import spacy_accelerate before calling spacy.require_gpu() or creating any ONNX Runtime sessions.

Manual fix: Set LD_LIBRARY_PATH explicitly:

SITE=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE/tensorrt_libs:$SITE/nvidia/cublas/lib:$SITE/nvidia/cuda_runtime/lib:$SITE/nvidia/cudnn/lib:$LD_LIBRARY_PATH"

CUDA out of memory

Reduce workspace size or batch size:

# For TensorRT provider — reduce workspace
nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_max_workspace_size=2 * 1024**3,  # 2 GB instead of 4 GB
)

# For any provider — reduce batch size
nlp = spacy_accelerate.optimize(
    nlp,
    max_batch_size=16,
)

First inference is slow

This only applies to the TensorRT provider. TensorRT compiles an optimized engine on the first run — this can take tens of seconds. Subsequent runs reuse the cached engine and are fast.

The timing cache is enabled by default and carries over build history between runs. If build time matters, prefer a lower optimization level:

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_timing_cache=True,             # on by default
    trt_builder_optimization_level=3,  # lower = faster build, 5 = best runtime perf
)

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_accelerate-0.3.1.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_accelerate-0.3.1-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file spacy_accelerate-0.3.1.tar.gz.

File metadata

  • Download URL: spacy_accelerate-0.3.1.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for spacy_accelerate-0.3.1.tar.gz
Algorithm Hash digest
SHA256 991a055722022d33bb664e3c86a98dc2215850f195674202ea5ce7af20f238e3
MD5 958aea94c164a1b393ad724c622dbfb0
BLAKE2b-256 d88473fee2dbbcd0763b8f1cf46942a300bcea96a7ea95d3ddb8162684aca1e2

See more details on using hashes here.

File details

Details for the file spacy_accelerate-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_accelerate-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9cefe4526eca52df01fd1210491421378974ec30fa567b600a7d8b69c6852b7e
MD5 2609354f9b0daee7af20a19a01ccdcbc
BLAKE2b-256 80c6caa7ebe951b6b0b3b7e39688318f195d6656672b5b40dfc23a7a32bcfdf9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page