Pure-PyTorch inference port of NeMo ASR (VAD + Parakeet-TDT) for Windows/WSL

Project description

nemoasr2pytorch

Pure‑PyTorch inference port of several NeMo ASR models, with a focus on Windows / WSL support and no NeMo runtime dependency.

Currently supported:

Frame‑level VAD: Frame_VAD_Multilingual_MarbleNet_v2.0
ASR (RNNT‑TDT):
- parakeet-tdt-0.6b-v2 – English
- parakeet-tdt-0.6b-v3 – Multilingual

The project only targets inference – no training or data pipelines – and mirrors NeMo’s architecture closely so that results match NeMo as much as possible.

Installation

Install a suitable PyTorch + torchaudio build first (GPU or CPU), following the official instructions.
For example, on CUDA 12.6:

pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
  --index-url https://download.pytorch.org/whl/cu126

Then install this package:
```
pip install nemoasr2pytorch
```

Torch is not pinned as a dependency on purpose – you stay in control of the exact CUDA / CPU build.

Quick ASR usage (Parakeet‑TDT)

The simplest way to run ASR on a single WAV file:

from nemoasr2pytorch.asr.api import load_default_parakeet_tdt_model, transcribe

# lang="EN" -> parakeet-tdt-0.6b-v2 (English)
# lang="EU" -> parakeet-tdt-0.6b-v3 (multilingual)
model = load_default_parakeet_tdt_model(lang="EU")

text = transcribe(model, "your_audio.wav")
print(text)

Details:

On first use, the corresponding .pt weights are automatically downloaded from ModelScope and cached under exports/parakeet_tdt_0.6b_v{2,3}.pt in your working directory.
Subsequent runs reuse the local .pt directly.

Low‑precision inference (FP16 / BF16)

On GPU you can load the model directly in low precision to save memory:

from nemoasr2pytorch.asr.api import (
    load_parakeet_tdt_fp16,
    load_parakeet_tdt_bf16,
    transcribe_amp,
)

# FP16 model (GPU only)
model_fp16 = load_parakeet_tdt_fp16(lang="EU")
print("FP16:", transcribe_amp(model_fp16, "your_audio.wav"))

# BF16 model (if hardware supports it)
model_bf16 = load_parakeet_tdt_bf16(lang="EU")
print("BF16:", transcribe_amp(model_bf16, "your_audio.wav"))

transcribe_amp uses PyTorch AMP (torch.amp.autocast) on CUDA to run the model in mixed precision.

VAD (MarbleNet) for pre‑segmentation

Frame‑level VAD API:

from nemoasr2pytorch.vad.api import load_default_frame_vad_model, run_vad_on_waveform
import torchaudio

# Loads MarbleNet VAD; if the .pt is missing, it is auto-downloaded
# from ModelScope to ./exports/frame_vad_multilingual_marblenet_v2.0.pt
vad_model = load_default_frame_vad_model()

waveform, sr = torchaudio.load("your_audio.wav")
if sr != vad_model.preprocessor.sample_rate:
    waveform = torchaudio.functional.resample(
        waveform, sr, vad_model.preprocessor.sample_rate
    )

probs, segments = run_vad_on_waveform(vad_model, waveform.squeeze(0))
print("Segments:", segments)

Long‑audio inference (concept)

The repository version ships a reference script inference.py which:

loads a Parakeet model (v2/v3, chosen by lang);
optionally runs MarbleNet VAD to detect speech regions;
merges VAD segments into chunks based on min_seg / max_seg length;
runs Parakeet on each chunk and concatenates the results.

The core logic is implemented via the public APIs:

nemoasr2pytorch.vad.api – VAD model + run_vad_on_waveform
nemoasr2pytorch.asr.api – Parakeet model + transcribe / transcribe_amp

You can either:

copy inference.py from the GitHub repo and adapt it to your own CLI; or
re‑implement a similar pipeline in your application using the two APIs above.

Package APIs

Main public modules:

nemoasr2pytorch.asr.api
- load_default_parakeet_tdt_model(lang="EN" | "EU", device=None, dtype=torch.float32)
  Load Parakeet‑TDT in FP32; lang chooses v2 (EN) vs v3 (EU).
- load_parakeet_tdt_fp16(lang="EN" | "EU", device=None)
  Load FP16 model (usually on GPU).
- load_parakeet_tdt_bf16(lang="EN" | "EU", device=None)
  Load BF16 model (if supported).
- transcribe(model, audio)
  Greedy TDT decoding in full precision (CPU or GPU).
- transcribe_amp(model, audio)
  Greedy TDT decoding with AMP on CUDA for low‑precision models.
nemoasr2pytorch.vad.api
- load_default_frame_vad_model(device=None, dtype=torch.float32)
  Load the MarbleNet VAD model from a local .pt.
- run_vad_on_waveform(model, audio, ...)
  Compute per‑frame speech probabilities and return merged speech segments.

Notes / Limitations

This package focuses on inference only; training and NeMo’s full config stack (Hydra/Lightning) are intentionally omitted.
Parakeet weights (.pt) are auto‑downloaded from ModelScope on first use; VAD .pt is currently expected to be provided by the user (converted from NeMo).
For best performance and lower memory usage, a CUDA‑enabled PyTorch build is recommended; CPU‑only inference also works but will be slower on long audio.

Project details

Release history Release notifications | RSS feed

0.1.6

Jun 7, 2026

This version

0.1.5

Nov 26, 2025

0.1.4

Nov 25, 2025

0.1.3

Nov 22, 2025

0.1.2

Nov 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemoasr2pytorch-0.1.5.tar.gz (37.4 kB view details)

Uploaded Nov 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nemoasr2pytorch-0.1.5-py3-none-any.whl (40.8 kB view details)

Uploaded Nov 26, 2025 Python 3

File details

Details for the file nemoasr2pytorch-0.1.5.tar.gz.

File metadata

Download URL: nemoasr2pytorch-0.1.5.tar.gz
Upload date: Nov 26, 2025
Size: 37.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for nemoasr2pytorch-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`108622462ec3e9b9b6126c7597992e8a1e1e655e5039cc0382631f94df8a69cd`
MD5	`41cf766b6963dfb6e5e4d7cd61339634`
BLAKE2b-256	`a8bd91d7123ad6eca676ad2e4cf2299b84943593016972134de69f26b0319dcb`

See more details on using hashes here.

File details

Details for the file nemoasr2pytorch-0.1.5-py3-none-any.whl.

File metadata

Download URL: nemoasr2pytorch-0.1.5-py3-none-any.whl
Upload date: Nov 26, 2025
Size: 40.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for nemoasr2pytorch-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab302bdc8181cce89ab9b6f648f855aa21fa2f12c58c8529e71a2db54828db66`
MD5	`fd91d8eee5ff0ee1ae73e18dd86d02a8`
BLAKE2b-256	`8522d1cb644a191791ae7a99ffe93eec14a41c9214954724173fa638df860223`

See more details on using hashes here.

nemoasr2pytorch 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

nemoasr2pytorch

Installation

Quick ASR usage (Parakeet‑TDT)

Low‑precision inference (FP16 / BF16)

VAD (MarbleNet) for pre‑segmentation

Long‑audio inference (concept)

Package APIs

Notes / Limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes