Skip to main content

Accelerate spaCy transformers with TensorRT/ONNX Runtime

Project description

spacy-accelerate

Accelerate spaCy transformers with TensorRT/ONNX Runtime. Drop-in replacement for transformer-based spaCy pipelines with Docker-verified GPU benchmark workflows.

Installation

spacy-accelerate depends on a CUDA/TensorRT stack that must stay version-aligned. The two failure modes we hit in practice were:

  • a second dependency resolution pass upgrading parts of the stack to different CUDA majors;
  • CUDA/TensorRT shared libraries from pip wheels not being visible to CuPy / ONNX Runtime.

The package now pins the runtime versions in pyproject.toml, and it configures the pip-installed native libraries automatically on import.

Benchmark Docker files live under benchmarks/docker/, and canonical benchmark artifacts are saved under artifacts/benchmarks/docker/. The root .dockerignore is kept at repository level because Docker build context filtering applies to the whole repo root.

PyPI install

pip install spacy-accelerate
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

The second command is still required to guarantee the TensorRT-enabled onnxruntime-gpu build from NVIDIA.

Source / editable install

pip install -r requirements.txt
pip install -e . --no-deps

Do not run plain pip install -e . after that. It can trigger a second resolver pass and replace the pinned CUDA 12 stack with newer incompatible packages.

Verify the installation:

python -m spacy_accelerate

You should see TensorRT EP : OK and CUDA EP : OK in the output.

Requirements:

  • Python 3.11+
  • CUDA 12.x
  • NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
  • spaCy 3.8+ with spacy-transformers

Quick Start

import spacy
import spacy_accelerate

# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")

# Optimize with one line!
nlp = spacy_accelerate.optimize(nlp, precision="fp16")

# Use as normal - same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]

# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))

API Reference

optimize(nlp, **kwargs)

Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.

Parameters:

Parameter Type Default Description
nlp spacy.Language required spaCy pipeline with transformer
precision "fp32" | "fp16" "fp16" Model precision
provider "tensorrt" | "cuda" | "cpu" "cuda" Execution provider
cache_dir Path | str ~/.cache/spacy-accelerate ONNX model cache directory
warmup bool True Run warmup inference
device_id int 0 CUDA device ID
max_batch_size int 128 Max batch size for IO Binding
max_seq_length int 512 Max sequence length for IO Binding
use_io_binding bool True Use zero-copy IO Binding
verbose bool False Enable verbose logging

TensorRT-specific parameters:

Parameter Type Default Description
trt_max_workspace_size int 4GB TensorRT workspace size
trt_builder_optimization_level int 3 Optimization level (0-5)
trt_timing_cache bool True Enable timing cache

Returns: The optimized spacy.Language object (modified in-place).

Cache Management

import spacy_accelerate

# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")

# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")

# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")

Performance

Canonical benchmark results are the Docker runs under artifacts/benchmarks/docker.

Benchmark commands and runner details are maintained in benchmarks/README.md.

Latest full-pipeline Docker measurement for en_core_web_trf on NVIDIA RTX 4000 SFF Ada Generation, CoNLL-2003 test set, batch_size=128, 1 discarded prime pass and 3 measured passes averaged:

Execution Provider Speed (WPS) Speedup vs PyTorch Accuracy
PyTorch Baseline (FP32) 6,241 1.00x 100.00%
PyTorch Baseline (FP16) 6,166 0.99x 100.00%
CUDA FP32 9,910 1.59x 99.90%
CUDA FP16 15,763 2.53x 99.75%
TensorRT FP32 10,552 1.69x 99.95%
TensorRT FP16 16,935 2.71x 99.50%

Latest Docker NER-only measurement for en_core_web_trf with tagger, parser, attribute_ruler, and lemmatizer disabled:

Execution Provider Speed (WPS) Speedup vs PyTorch Accuracy
PyTorch Baseline (FP32) 7,066 1.00x 100.00%
PyTorch Baseline (FP16) 6,859 0.97x 100.00%
CUDA FP32 11,972 1.69x 99.90%
CUDA FP16 22,394 3.17x 99.75%
TensorRT FP32 13,138 1.86x 99.95%
TensorRT FP16 24,823 3.51x 99.65%

Examples

Using TensorRT for Maximum Performance

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    precision="fp16",
    trt_max_workspace_size=8 * 1024**3,  # 8GB
    trt_builder_optimization_level=5,     # Maximum optimization
)

# First inference builds TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")

Custom Cache Directory

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    cache_dir="/path/to/custom/cache",
    precision="fp16",
)

Verbose Mode for Debugging

import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    verbose=True,  # Print detailed logs
)

Supported Models

Right now the confirmed spaCy model support is:

  • en_core_web_trf

The earlier wording here listed transformer architecture families, not actual published spaCy package names. Internally, the exporter and architecture detection logic currently target curated-transformer / RoBERTa-style backbones, with partial code paths for BERT and XLM-RoBERTa families, but those are not yet claimed here as generally supported spaCy packages.

How It Works

  1. Weight Mapping: Extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.

  2. ONNX Export: Exports the mapped model to ONNX format with dynamic batch and sequence dimensions.

  3. FP16 Optimization (optional): Applies BERT-style optimizations and converts to FP16 for faster inference.

  4. Runtime Patching: Replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same interface.

  5. Caching: Converted models are cached to avoid re-conversion on subsequent loads.

Troubleshooting

TensorRT provider not available

Run the diagnostic tool first:

python -m spacy_accelerate

If you see TensorRT EP : MISSING, the NVIDIA build of onnxruntime-gpu is not installed. Fix with step 2 from the installation instructions:

pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2

libnvinfer.so / libcublas.so / libcublasLt.so not found

If you see errors like libnvinfer.so.10, libcublas.so.12, or libcublasLt.so.12: cannot open shared object file:

Automatic fix: spacy-accelerate automatically configures both TensorRT libraries and the CUDA libraries installed under site-packages/nvidia/*/lib. Import spacy_accelerate before creating ONNX Runtime sessions or calling spacy.require_gpu().

Manual fix: If the automatic configuration doesn't work (e.g., running scripts directly):

SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE_PACKAGES/tensorrt_libs:$SITE_PACKAGES/nvidia/cublas/lib:$SITE_PACKAGES/nvidia/cuda_runtime/lib:$SITE_PACKAGES/nvidia/cudnn/lib:$LD_LIBRARY_PATH"

CUDA out of memory

Reduce workspace size or batch size:

nlp = spacy_accelerate.optimize(
    nlp,
    trt_max_workspace_size=2 * 1024**3,  # 2GB instead of 4GB
    max_batch_size=16,                    # Smaller batches
)

First inference is slow

TensorRT builds optimized engines on first run. Enable caching:

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_timing_cache=True,  # Cache timing data
)

License

MIT License

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_accelerate-0.3.0.tar.gz (84.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spacy_accelerate-0.3.0-py3-none-any.whl (38.1 kB view details)

Uploaded Python 3

File details

Details for the file spacy_accelerate-0.3.0.tar.gz.

File metadata

  • Download URL: spacy_accelerate-0.3.0.tar.gz
  • Upload date:
  • Size: 84.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for spacy_accelerate-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0645bfcd71bfaef74cbfbe5710f1aba1e73cc333b3ac81b2610fbf5f2376d85d
MD5 046890487a090c02e5acaace772dfeb3
BLAKE2b-256 ef9bab822770bf9bc59d2cb8ba87f417030a331c1960f1584355d6946f7d2b97

See more details on using hashes here.

File details

Details for the file spacy_accelerate-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_accelerate-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a04bafaa98646caadc5e9e017f385e36c1102e58edb180e252cc3d28b2d99638
MD5 9a14d9535643bd0c7370f4627118dfe8
BLAKE2b-256 4b4ddc1f7d420bff0ec7320f5a2059bd47afe97f55a8b936e1d56b9f3d02de9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page