Automatic Speech Recognition in Python using ONNX models
Project description
ONNX ASR
onnx-asr is a Python package for Automatic Speech Recognition using ONNX models. It's written in pure Python with minimal dependencies (no PyTorch, Transformers, or FFmpeg required):
[!TIP] Supports Parakeet v2 (En) / v3 (Multilingual), Canary v2 (Multilingual) and GigaAM v2/v3 (Ru) models!
The onnx-asr package supports many modern ASR models and the following features:
- Runs on Windows, Linux, and MacOS on a variety of devices, from IoT devices with Arm CPUs to servers with Nvidia GPUs (benchmarks)
- Loading models from hugging face or local folders (including quantized versions)
- Accepts wav files or NumPy arrays (built-in support for file reading and resampling)
- Batch processing
- (experimental) Longform recognition with VAD (Voice Activity Detection)
- (experimental) Returns token timestamps
- Simple CLI
- Online demo in HF Spaces
Supported models architectures
The package supports the following modern ASR model architectures (comparison with original implementations):
- Nvidia NeMo Conformer/FastConformer/Parakeet/Canary (with CTC, RNN-T, TDT and Transformer decoders)
- Kaldi Icefall Zipformer (with stateless RNN-T decoder) including Alpha Cephei Vosk 0.52+
- Sber GigaAM v2/v3 (with CTC and RNN-T decoders, including E2E versions)
- T-Tech T-one (with CTC decoder, no streaming support yet)
- OpenAI Whisper
When saving these models in onnx format, usually only the encoder and decoder are saved. To run them, the corresponding preprocessor and decoding must be implemented. Therefore, the package contains these implementations for all supported models:
- Log-mel spectrogram preprocessors
- Greedy search decoding
Installation
The package can be installed from PyPI:
- With CPU
onnxruntimeandhuggingface-hub
pip install onnx-asr[cpu,hub]
- With GPU
onnxruntimeandhuggingface-hub
[!IMPORTANT] First, you need to install the required version of CUDA.
pip install onnx-asr[gpu,hub]
- Without
onnxruntimeandhuggingface-hub(if you already have some version ofonnxruntimeinstalled and prefer to download the models yourself)
pip install onnx-asr
- To build onnx-asr from source, you need to install pdm. Then you can build onnx-asr with command:
pdm build
Usage examples
Load ONNX model from Hugging Face
Load ONNX model from Hugging Face and recognize wav file:
import onnx_asr
model = onnx_asr.load_model("gigaam-v2-rnnt")
print(model.recognize("test.wav"))
[!IMPORTANT] Supported wav file formats: PCM_U8, PCM_16, PCM_24 and PCM_32 formats. For other formats, you either need to convert them first, or use a library that can read them into a numpy array.
Supported model names:
gigaam-v2-ctcfor Sber GigaAM v2 CTC (origin, onnx)gigaam-v2-rnntfor Sber GigaAM v2 RNN-T (origin, onnx)gigaam-v3-ctcfor Sber GigaAM v3 CTC (origin, onnx)gigaam-v3-rnntfor Sber GigaAM v3 RNN-T (origin, onnx)gigaam-v3-e2e-ctcfor Sber GigaAM v3 E2E CTC (origin, onnx)gigaam-v3-e2e-rnntfor Sber GigaAM v3 E2E RNN-T (origin, onnx)nemo-fastconformer-ru-ctcfor Nvidia FastConformer-Hybrid Large (ru) with CTC decoder (origin, onnx)nemo-fastconformer-ru-rnntfor Nvidia FastConformer-Hybrid Large (ru) with RNN-T decoder (origin, onnx)nemo-parakeet-ctc-0.6bfor Nvidia Parakeet CTC 0.6B (en) (origin, onnx)nemo-parakeet-rnnt-0.6bfor Nvidia Parakeet RNNT 0.6B (en) (origin, onnx)nemo-parakeet-tdt-0.6b-v2for Nvidia Parakeet TDT 0.6B V2 (en) (origin, onnx)nemo-parakeet-tdt-0.6b-v3for Nvidia Parakeet TDT 0.6B V3 (multilingual) (origin, onnx)nemo-canary-1b-v2for Nvidia Canary 1B V2 (multilingual) (origin, onnx)whisper-basefor OpenAI Whisper Base exported with onnxruntime (origin, onnx)alphacep/vosk-model-rufor Alpha Cephei Vosk 0.54-ru (origin)alphacep/vosk-model-small-rufor Alpha Cephei Vosk 0.52-small-ru (origin)t-tech/t-onefor T-Tech T-one (origin)onnx-community/whisper-tiny,onnx-community/whisper-base,onnx-community/whisper-small,onnx-community/whisper-large-v3-turbo, etc. for OpenAI Whisper exported with Hugging Face optimum (onnx-community)
[!IMPORTANT] Some long-ago converted
onnx-communitymodels have a brokenfp16precision version.
[!IMPORTANT] Canary models do not work with the CoreML provider.
Example with soundfile:
import onnx_asr
import soundfile as sf
model = onnx_asr.load_model("whisper-base")
waveform, sample_rate = sf.read("test.wav", dtype="float32")
model.recognize(waveform, sample_rate=sample_rate)
Batch processing is also supported:
import onnx_asr
model = onnx_asr.load_model("nemo-fastconformer-ru-ctc")
print(model.recognize(["test1.wav", "test2.wav", "test3.wav", "test4.wav"]))
Some models have a quantized versions:
import onnx_asr
model = onnx_asr.load_model("alphacep/vosk-model-ru", quantization="int8")
print(model.recognize("test.wav"))
Return tokens and timestamps:
import onnx_asr
model = onnx_asr.load_model("alphacep/vosk-model-ru").with_timestamps()
print(model.recognize("test1.wav"))
VAD
Load VAD ONNX model from Hugging Face and recognize wav file:
import onnx_asr
vad = onnx_asr.load_vad("silero")
model = onnx_asr.load_model("gigaam-v2-rnnt").with_vad(vad)
for res in model.recognize("test.wav"):
print(res)
[!NOTE]
You will most likely need to adjust VAD parameters to get the correct results.
Supported VAD names:
CLI
Package has simple CLI interface
onnx-asr nemo-fastconformer-ru-ctc test.wav
For full usage parameters, see help:
onnx-asr -h
Gradio
Create simple web interface with Gradio:
import onnx_asr
import gradio as gr
model = onnx_asr.load_model("gigaam-v2-rnnt")
def recognize(audio):
if audio:
sample_rate, waveform = audio
waveform = waveform / 2**15
if waveform.ndim == 2:
waveform = waveform.mean(axis=1)
return model.recognize(waveform, sample_rate=sample_rate)
demo = gr.Interface(fn=recognize, inputs=gr.Audio(min_length=1, max_length=30), outputs="text")
demo.launch()
Load ONNX model from local directory
Load ONNX model from local directory and recognize wav file:
import onnx_asr
model = onnx_asr.load_model("gigaam-v2-ctc", "models/gigaam-onnx")
print(model.recognize("test.wav"))
Supported model types:
- All models from supported model names
nemo-conformer-ctcfor NeMo Conformer/FastConformer/Parakeet with CTC decodernemo-conformer-rnntfor NeMo Conformer/FastConformer/Parakeet with RNN-T decodernemo-conformer-tdtfor NeMo Conformer/FastConformer/Parakeet with TDT decodernemo-conformer-aedfor NeMo Canary with Transformer decoderkaldi-rnntorvoskfor Kaldi Icefall Zipformer with stateless RNN-T decoderwhisper-ortfor Whisper (exported with onnxruntime)whisperfor Whisper (exported with optimum)
Comparison with original implementations
Packages with original implementations:
gigaamfor GigaAM models (github)nemo-toolkitfor NeMo models (github)openai-whisperfor Whisper models (github)sherpa-onnxfor Vosk models (github, docs)T-onefor T-Tech T-one model (github)
Hardware:
- CPU tests were run on a laptop with an Intel i7-7700HQ processor.
- GPU tests were run in Google Colab on Nvidia T4
Tests of Russian ASR models were performed on a test subset of the Russian LibriSpeech dataset.
| Model | Package / decoding | CER | WER | RTFx (CPU) | RTFx (GPU) |
|---|---|---|---|---|---|
| GigaAM v2 CTC | default | 1.06% | 5.23% | 7.2 | 44.2 |
| GigaAM v2 CTC | onnx-asr | 1.06% | 5.23% | 11.6 | 64.3 |
| GigaAM v2 RNN-T | default | 1.10% | 5.22% | 5.5 | 23.3 |
| GigaAM v2 RNN-T | onnx-asr | 1.10% | 5.22% | 10.7 | 38.7 |
| GigaAM v3 CTC | default | 0.98% | 4.72% | 12.2 | 73.3 |
| GigaAM v3 CTC | onnx-asr | 0.98% | 4.72% | 14.5 | 68.3 |
| GigaAM v3 RNN-T | default | 0.93% | 4.39% | 8.2 | 41.6 |
| GigaAM v3 RNN-T | onnx-asr | 0.93% | 4.39% | 13.3 | 39.9 |
| GigaAM v3 E2E CTC | default | 1.50% | 7.10% | N/A | 178.0 |
| GigaAM v3 E2E CTC | onnx-asr | 1.56% | 7.80% | N/A | 65.6 |
| GigaAM v3 E2E RNN-T | default | 1.61% | 6.94% | N/A | 47.6 |
| GigaAM v3 E2E RNN-T | onnx-asr | 1.67% | 7.60% | N/A | 42.8 |
| Nemo FastConformer CTC | default | 3.11% | 13.12% | 29.1 | 143.0 |
| Nemo FastConformer CTC | onnx-asr | 3.11% | 13.12% | 45.8 | 103.3 |
| Nemo FastConformer RNN-T | default | 2.63% | 11.62% | 17.4 | 111.6 |
| Nemo FastConformer RNN-T | onnx-asr | 2.63% | 11.62% | 27.2 | 53.4 |
| Nemo Parakeet TDT 0.6B V3 | default | 2.34% | 10.95% | 5.6 | 75.4 |
| Nemo Parakeet TDT 0.6B V3 | onnx-asr | 2.38% | 10.95% | 9.7 | 59.7 |
| Nemo Canary 1B V2 | default | 4.89% | 20.00% | N/A | 14.0 |
| Nemo Canary 1B V2 | onnx-asr | 5.00% | 20.03% | N/A | 17.4 |
| T-Tech T-one | default | 1.28% | 6.56% | 11.9 | N/A |
| T-Tech T-one | onnx-asr | 1.28% | 6.57% | 11.7 | 16.5 |
| Vosk 0.52 small | greedy_search | 3.64% | 14.53% | 48.2 | 71.4 |
| Vosk 0.52 small | modified_beam_search | 3.50% | 14.25% | 29.0 | 24.7 |
| Vosk 0.52 small | onnx-asr | 3.64% | 14.53% | 45.5 | 75.2 |
| Vosk 0.54 | greedy_search | 2.21% | 9.89% | 34.8 | 64.2 |
| Vosk 0.54 | modified_beam_search | 2.21% | 9.85% | 23.9 | 24 |
| Vosk 0.54 | onnx-asr | 2.21% | 9.89% | 33.6 | 69.6 |
| Whisper base | default | 10.61% | 38.89% | 5.4 | 17.3 |
| Whisper base | onnx-asr* | 10.64% | 38.33% | 6.6 | 20.1 |
| Whisper large-v3-turbo | default | 2.96% | 10.27% | N/A | 13.6 |
| Whisper large-v3-turbo | onnx-asr** | 2.63% | 10.13% | N/A | 12.4 |
Tests of English ASR models were performed on a test subset of the Voxpopuli dataset.
| Model | Package / decoding | CER | WER | RTFx (CPU) | RTFx (GPU) |
|---|---|---|---|---|---|
| Nemo Parakeet CTC 0.6B | default | 4.09% | 7.20% | 8.3 | 107.7 |
| Nemo Parakeet CTC 0.6B | onnx-asr | 4.09% | 7.20% | 11.5 | 89.0 |
| Nemo Parakeet RNN-T 0.6B | default | 3.64% | 6.32% | 6.7 | 85.0 |
| Nemo Parakeet RNN-T 0.6B | onnx-asr | 3.64% | 6.32% | 8.7 | 48.0 |
| Nemo Parakeet TDT 0.6B V2 | default | 3.88% | 6.52% | 6.5 | 87.6 |
| Nemo Parakeet TDT 0.6B V2 | onnx-asr | 3.88% | 6.52% | 10.5 | 70.1 |
| Nemo Parakeet TDT 0.6B V3 | default | 3.97% | 6.76% | 6.1 | 90.0 |
| Nemo Parakeet TDT 0.6B V3 | onnx-asr | 3.97% | 6.75% | 9.5 | 68.2 |
| Nemo Canary 1B V2 | default | 4.62% | 7.42% | N/A | 17.5 |
| Nemo Canary 1B V2 | onnx-asr | 4.67% | 7.47% | N/A | 20.8 |
| Whisper base | default | 7.81% | 13.24% | 8.4 | 27.7 |
| Whisper base | onnx-asr* | 7.52% | 12.76% | 9.2 | 28.9 |
| Whisper large-v3-turbo | default | 6.85% | 11.16% | N/A | 20.4 |
| Whisper large-v3-turbo | onnx-asr** | 10.31% | 14.65% | N/A | 17.9 |
[!NOTE]
- *
whisper-ortmodel (model types).- **
whispermodel (model types) withfp16precision.- All other models were run with the default precision -
fp32on CPU andfp32orfp16(some of the original models) on GPU.
Benchmarks
Hardware:
- Arm tests were run on an Orange Pi Zero 3 with a Cortex-A53 processor.
- x64 tests were run on a laptop with an Intel i7-7700HQ processor.
- T4 tests were run in Google Colab on Nvidia T4
Russian ASR models
Notebook with benchmark code - benchmark-ru
| Model | RTFx (Arm) | RTFx (x64) | RTFx (T4) |
|---|---|---|---|
| GigaAM v2 CTC | 0.8 | 11.6 | 64.3 |
| GigaAM v2 RNN-T | 0.8 | 10.7 | 38.7 |
| GigaAM v3 CTC | N/A | 14.5 | 68.3 |
| GigaAM v3 RNN-T | N/A | 13.3 | 39.9 |
| Nemo FastConformer CTC | 4.0 | 45.8 | 103.3 |
| Nemo FastConformer RNN-T | 3.2 | 27.2 | 53.4 |
| Nemo Parakeet TDT 0.6B V3 | N/A | 9.7 | 59.7 |
| Nemo Canary 1B V2 | N/A | N/A | 17.4 |
| T-Tech T-one | N/A | 11.7 | 16.5 |
| Vosk 0.52 small | 5.1 | 45.5 | 75.2 |
| Vosk 0.54 | 3.8 | 33.6 | 69.6 |
| Whisper base | 0.8 | 6.6 | 20.1 |
| Whisper large-v3-turbo | N/A | N/A | 12.4 |
English ASR models
Notebook with benchmark code - benchmark-en
| Model | RTFx (Arm) | RTFx (x64) | RTFx (T4) |
|---|---|---|---|
| Nemo Parakeet CTC 0.6B | 1.1 | 11.5 | 89.0 |
| Nemo Parakeet RNN-T 0.6B | 1.0 | 8.7 | 48.0 |
| Nemo Parakeet TDT 0.6B V2 | 1.1 | 10.5 | 70.1 |
| Nemo Parakeet TDT 0.6B V3 | N/A | 9.5 | 68.2 |
| Nemo Canary 1B V2 | N/A | N/A | 20.8 |
| Whisper base | 1.2 | 9.2 | 28.9 |
| Whisper large-v3-turbo | N/A | N/A | 17.9 |
Convert model to ONNX
Save the model according to the instructions below and add config.json:
{
"model_type": "nemo-conformer-rnnt", // See "Supported model types"
"features_size": 80, // Size of preprocessor features for Whisper or Nemo models, supported 80 and 128
"subsampling_factor": 8, // Subsampling factor - 4 for conformer models and 8 for fastconformer and parakeet models
"max_tokens_per_step": 10 // Max tokens per step for RNN-T decoder
}
Then you can upload the model into Hugging Face and use load_model to download it.
Nvidia NeMo Conformer/FastConformer/Parakeet
Install NeMo Toolkit
pip install nemo_toolkit['asr']
Download model and export to ONNX format
import nemo.collections.asr as nemo_asr
from pathlib import Path
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_ru_fastconformer_hybrid_large_pc")
# For export Hybrid models with CTC decoder
# model.set_export_config({"decoder_type": "ctc"})
onnx_dir = Path("nemo-onnx")
onnx_dir.mkdir(exist_ok=True)
model.export(str(Path(onnx_dir, "model.onnx")))
with Path(onnx_dir, "vocab.txt").open("wt") as f:
for i, token in enumerate([*model.tokenizer.vocab, "<blk>"]):
f.write(f"{token} {i}\n")
Sber GigaAM v2/v3
Install GigaAM
git clone https://github.com/salute-developers/GigaAM.git
pip install ./GigaAM --extra-index-url https://download.pytorch.org/whl/cpu
Download model and export to ONNX format
import gigaam
from pathlib import Path
onnx_dir = "gigaam-onnx"
model_type = "rnnt" # or "ctc"
model = gigaam.load_model(
model_type,
fp16_encoder=False, # only fp32 tensors
use_flash=False, # disable flash attention
)
model.to_onnx(dir_path=onnx_dir)
with Path(onnx_dir, "v2_vocab.txt").open("wt") as f:
for i, token in enumerate(["\u2581", *(chr(ord("а") + i) for i in range(32)), "<blk>"]):
f.write(f"{token} {i}\n")
OpenAI Whisper (with onnxruntime export)
Read onnxruntime instruction for convert Whisper to ONNX.
Download model and export with Beam Search and Forced Decoder Input Ids:
python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-base --output ./whisper-onnx --use_forced_decoder_ids --optimize_onnx --precision fp32
Save tokenizer config
from transformers import WhisperTokenizer
processor = WhisperTokenizer.from_pretrained("openai/whisper-base")
processor.save_pretrained("whisper-onnx")
OpenAI Whisper (with optimum export)
Export model to ONNX with Hugging Face optimum-cli
optimum-cli export onnx --model openai/whisper-base ./whisper-onnx
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onnx_asr-0.9.0.tar.gz.
File metadata
- Download URL: onnx_asr-0.9.0.tar.gz
- Upload date:
- Size: 97.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.26.2 CPython/3.12.12 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
312b53fd6909fb9fb6b99c1990600e2aac2a12b93c962f6ddb1b5ead3ed30a58
|
|
| MD5 |
10bd4147cea7d84362520d596e66e5bc
|
|
| BLAKE2b-256 |
790b39297796333eaac4c60a8759716c23f2944c0a5453172c0bed2718eb4c98
|
File details
Details for the file onnx_asr-0.9.0-py3-none-any.whl.
File metadata
- Download URL: onnx_asr-0.9.0-py3-none-any.whl
- Upload date:
- Size: 92.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.26.2 CPython/3.12.12 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca91226f3386179868a687702d3483534f576c497feb76afafff331844753b6e
|
|
| MD5 |
ac0f9d5d1947880fbd698c805170b07b
|
|
| BLAKE2b-256 |
07c4b4aa81e53a66dd842111f22a8ce48265b3087e38ea9fee7881baa4d007d0
|