NeMo Export and Deploy - a library to export and deploy LLMs and MMs

Project description

NeMo Export-Deploy

The Export-Deploy library ("NeMo Export-Deploy") provides tools and APIs for exporting and deploying NeMo and 🤗Hugging Face models to production environments. It supports various deployment paths including TensorRT, TensorRT-LLM, and vLLM deployment through NVIDIA Triton Inference Server and Ray Serve.

📣 News

[03/12/2026] Deprecating Python 3.10 support: We're officially dropping Python 3.10 support with the upcoming 0.4.0 release. Downstream applications must raise their lower boundary to 3.12 to stay compatible with Export-Deploy.

🚀 Key Features

Support for Large Language Models (LLMs) and Multimodal Models (MMs)
Export Megatron-Brdige and Hugging Face models to optimized inference formats including TensorRT-LLM and vLLM
Deploy Megatron-Brdige and Hugging Face models using Ray Serve or NVIDIA Triton Inference Server
Multi-GPU and distributed inference capabilities
Multi-instance deployment options

Feature Support Matrix

Model Export Capabilities

Model / Checkpoint	TensorRT-LLM	vLLM	ONNX	TensorRT
Hugging Face	bf16	bf16	N/A	N/A
NIM Embedding	N/A	N/A	bf16, fp8, int8 (PTQ)	bf16, fp8, int8 (PTQ)
NIM Reranking	N/A	N/A	Coming Soon	Coming Soon

The support matrix above outlines the export capabilities for each model or checkpoint, including the supported precision options across various inference-optimized libraries. The export module enables exporting models that have been quantized using post-training quantization (PTQ) with the TensorRT Model Optimizer library, as shown above. Models trained with low precision or quantization-aware training are also supported, as indicated in the table.

The inference-optimized libraries listed above also support on-the-fly quantization during model export, with configurable parameters available in the export APIs. However, please note that the precision options shown in the table above indicate support for exporting models that have already been quantized, rather than the ability to quantize models during export.

Please note that not all large language models (LLMs) and multimodal models (MMs) are currently supported. For the most complete and up-to-date information, please refer to the LLM documentation and MM documentation.

Model Deployment Capabilities

Model / Checkpoint	RayServe	PyTriton
Megatron-LM	Limited	Limited
Hugging Face	Single-Node Multi-GPU, Multi-instance	Single-Node Multi-GPU
TensorRT-LLM	Single-Node Multi-GPU, Multi-instance	Multi-Node Multi-GPU
vLLM	N/A	Single-Node Multi-GPU

The support matrix above outlines the available deployment options for each model or checkpoint, highlighting multi-node and multi-GPU support where applicable. For comprehensive details, please refer to the documentation.

Refer to the table below for an overview of optimized inference and deployment support for NeMo Framework and Hugging Face models with Triton Inference Server.

Model / Checkpoint	TensorRT-LLM + Triton Inference Server	vLLM + Triton Inference Server	Direct Triton Inference Server
Hugging Face	☑	☑	☑

🔧 Install

For quick exploration of NeMo Export-Deploy, we recommend installing our pip package:

pip install nemo-export-deploy

This installation comes without extra dependencies like TransformerEngine, TensorRT-LLM or vLLM. The installation serves for navigating around and for exploring the project.

For a feature-complete install, please refer to the following sections.

Use NeMo-FW Container

Best experience, highest performance and full feature support is guaranteed by the NeMo Framework container. Please fetch the most recent $TAG and run the following command to start a container:

docker run --rm -it -w /workdir -v $(pwd):/workdir \
  --entrypoint bash \
  --gpus all \
  nvcr.io/nvidia/nemo:${TAG}

Build with Dockerfile

For containerized development, use our Dockerfile for building your own container. There are three flavors: INFERENCE_FRAMEWORK=inframework, INFERENCE_FRAMEWORK=trtllm and INFERENCE_FRAMEWORK=vllm:

docker build \
    -f docker/Dockerfile.pytorch \
    -t nemo-export-deploy \
    --build-arg INFERENCE_FRAMEWORK=$INFERENCE_FRAMEWORK \
    .

Start your container:

docker run --rm -it -w /workdir -v $(pwd):/workdir \
  --entrypoint bash \
  --gpus all \
  nemo-export-deploy

Install from Source

For complete feature coverage, we recommend to install TransformerEngine and additionally either TensorRT-LLM or vLLM.

Recommended Requirements

Python 3.12
PyTorch 2.7
CUDA 12.9
Ubuntu 24.04

Install TransformerEngine + InFramework

For highly optimized TransformerEngine path with PyTriton backend, please make sure to install the following prerequisites first:

pip install torch==2.7.0 setuptools pybind11 wheel_stub  # Required for TE

Now proceed with the main installation:

git clone https://github.com/NVIDIA-NeMo/Export-Deploy
cd Export-Deploy/
pip install --no-build-isolation .

Install TransformerEngine + TensorRT-LLM

For highly optimized TransformerEngine path with TensorRT-LLM backend, please make sure to install the following prerequisites first:

sudo apt-get -y install libopenmpi-dev  # Required for TensorRT-LLM
pip install torch==2.7.0 setuptools pybind11 wheel_stub  # Required for TE

Now proceed with the main installation:

pip install --no-build-isolation .[trtllm]

Install TransformerEngine + vLLM

For highly optimized TransformerEngine path with vLLM backend, please make sure to install the following prerequisites first:

pip install torch==2.7.0 setuptools pybind11 wheel_stub  # Required for TE

Now proceed with the main installation:

pip install --no-build-isolation .[vllm]

Install TransformerEngine + TRT-ONNX

For highly optimized TransformerEngine path with TRT-ONNX backend, please make sure to install the following prerequisites first:

pip install torch==2.7.0 setuptools pybind11 wheel_stub  # Required for TE

Now proceed with the main installation:

pip install --no-build-isolation .[trt-onnx]

🤝 Contributing

We welcome contributions to NeMo Export-Deploy! Please see our Contributing Guidelines for more information on how to get involved.

License

NeMo Export-Deploy is licensed under the Apache License 2.0.

Project details

Release history Release notifications | RSS feed

This version

0.5.0

Apr 16, 2026

0.4.0

Feb 26, 2026

0.3.1

Dec 15, 2025

0.3.0

Dec 4, 2025

0.3.0rc3 pre-release

Sep 15, 2025

0.3.0rc2 pre-release

Sep 8, 2025

0.3.0rc1 pre-release

Sep 1, 2025

0.3.0rc0 pre-release

Aug 25, 2025

0.2.1

Oct 22, 2025

0.2.0

Oct 9, 2025

0.2.0rc2 pre-release

Aug 18, 2025

0.2.0rc1 pre-release

Aug 14, 2025

0.2.0rc0 pre-release

Aug 3, 2025

0.1.1

Aug 15, 2025

0.1.0

Jul 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemo_export_deploy-0.5.0.tar.gz (109.3 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nemo_export_deploy-0.5.0-py3-none-any.whl (146.8 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file nemo_export_deploy-0.5.0.tar.gz.

File metadata

Download URL: nemo_export_deploy-0.5.0.tar.gz
Upload date: Apr 16, 2026
Size: 109.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_export_deploy-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`5e6c397e1f1c03f4fbaa7bd2f424a94b934b80dc260ec289df094dce2a1768ae`
MD5	`3f5927ebd4ee227432836ca2068fc00c`
BLAKE2b-256	`e932a70ae01664b3f1f075ae3ae5b08c9f0c85800503e67e1e22f9d426d53ee7`

See more details on using hashes here.

File details

Details for the file nemo_export_deploy-0.5.0-py3-none-any.whl.

File metadata

Download URL: nemo_export_deploy-0.5.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 146.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_export_deploy-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a41f8a21ae1d14d9a13b0daf5f9ee11a325c94b714eaf17658aae0bcdacfd1c0`
MD5	`1a523fb33fa93383769572383eca1f6f`
BLAKE2b-256	`7f910e3933d6bcb6b2fe1ae72173e43982d8c93a5762936a2a739a4e5086f655`

See more details on using hashes here.

NeMo-Export-Deploy 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

NeMo Export-Deploy

📣 News

🚀 Key Features

Feature Support Matrix

Model Export Capabilities

Model Deployment Capabilities

🔧 Install

Use NeMo-FW Container

Build with Dockerfile

Install from Source

Recommended Requirements

Install TransformerEngine + InFramework

Install TransformerEngine + TensorRT-LLM

Install TransformerEngine + vLLM

Install TransformerEngine + TRT-ONNX

🤝 Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes