Pure-PyTorch Parakeet TDT inference — no NeMo required
Project description
nano-parakeet
Pure-PyTorch inference for NVIDIA Parakeet TDT — no NeMo required.
from nano_parakeet import from_pretrained
model = from_pretrained()
print(model.transcribe("audio.wav"))
Why?
The official NeMo inference stack pulls in ~180 packages — PyTorch Lightning, Hydra, OmegaConf, apex, distributed training scaffolding — none of which are needed at inference time. This makes it painful to integrate Parakeet into existing projects: version conflicts, long installs, and a 30-second cold-start on every process launch.
nano-parakeet reimplements the full inference pipeline in plain PyTorch. The only dependencies are things you probably already have:
| nano-parakeet | NeMo | |
|---|---|---|
| Dependencies | 5 (torch, numpy, soundfile, sentencepiece, huggingface-hub) | ~180 |
| Cold start | ~3s (weights only) | ~30s (framework init + CUDA kernel compile) |
| Warm RTF (Jetson AGX Orin) | 93× | 73× |
Transcriptions are byte-identical to NeMo's output.
Install
pip install nano-parakeet
Requires Python 3.10+, PyTorch with CUDA, and ffmpeg.
Usage
Python API
from nano_parakeet import from_pretrained
model = from_pretrained() # downloads ~1.1GB on first run
text = model.transcribe("audio.wav") # path, numpy array, or tensor
print(text)
CLI
nano-parakeet audio.wav
# or
python -m nano_parakeet audio.wav
Accepts OGG, WAV, M4A, or any format ffmpeg can read.
Benchmark
RTF > 1.0 = faster than real-time. 5 timed runs after a warm-up; best time reported.
Warm throughput
| GPU | Audio | NeMo RTF | nano-parakeet RTF | Speedup |
|---|---|---|---|---|
| RTX 4090 | 12s | ~207× | ~519× | 2.5× |
| Jetson AGX Orin 64GB | 12s | ~73× | ~92× | 1.3× |
Note (RTX 4090): NeMo is run with
strategy='greedy'(single-item, not batch). The defaultgreedy_batchstrategy uses TDT label-looping CUDA graphs that fail to compile on NeMo 2.6.2 + cuda-python 12.9 (NVRTC is not permitted inside a graph capture context).strategy='greedy'uses a different CUDA graph path that works fine.
Cold start (first inference, including framework load)
| GPU | NeMo | nano-parakeet |
|---|---|---|
| RTX 4090 | ~30s | ~3s |
| Jetson AGX Orin 64GB | ~30s | ~3s |
Run both yourself:
git clone https://github.com/andimarafioti/nano-parakeet
cd parakeet-stt
./benchmark.sh sample.wav
How It Works
The full pipeline in plain PyTorch — no NeMo at runtime:
Audio (16 kHz, mono)
│
▼ pre-emphasis (α=0.97) → STFT (n_fft=512, hop=160, win=400)
→ Mel filterbank (128 bins) → log → per-feature normalisation
│
▼ FastConformer Encoder (24 layers, d_model=1024, 8 heads)
└─ ConvSubsampling (3× stride-2 → 8× time reduction)
└─ RelPositionalEncoding (Transformer-XL style)
└─ 24 × FastConformerLayer:
FF₁ (×0.5) → Self-Attn (rel-pos) → Conv (k=9) → FF₂ (×0.5) → LN
│
▼ TDT Decoder
└─ RNNT Prediction: Embed(8193, 640) + 2-layer LSTM(640)
└─ Joint: Linear(1024→640) + Linear(640→640) → ReLU → Linear(640→8198)
└─ TDT greedy decode (durations [0,1,2,3,4], blank_id=8192)
│
▼ SentencePiece decode → text
Weights are loaded directly from the .nemo file (a ZIP archive) without importing any NeMo module.
Optimisations
| Encoder | Decoder | Effect | |
|---|---|---|---|
| fp16 autocast | ✓ | ✗ | tensor cores for 1024→4096→1024 FFN matmuls × 24 layers |
| CUDA graph | ✗ | ✓ | ~20 kernel launches per decode step → 1 graph replay |
Jetson Setup
The PyPI wheel works on standard x86 CUDA machines. For Jetson (JetPack 6), PyTorch needs to be installed from NVIDIA's distribution first:
# Install CUDA-enabled PyTorch for JetPack 6
UV_SKIP_WHEEL_FILENAME_CHECK=1 uv pip install \
https://developer.download.nvidia.com/compute/redist/jp/v61/pytorch/torch-2.5.0a0+872d972e41.nv24.08.17622132-cp310-cp310-linux_aarch64.whl
# Then install nano-parakeet (skipping torch since it's already installed)
pip install nano-parakeet --no-deps
pip install numpy soundfile sentencepiece huggingface-hub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nano_parakeet-0.2.0.tar.gz.
File metadata
- Download URL: nano_parakeet-0.2.0.tar.gz
- Upload date:
- Size: 209.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca03cf5e328e8e23f92a9b1298b645ab4ce4d4382981005aef5e9c890783f000
|
|
| MD5 |
f06797408dc5516c336699545e939a85
|
|
| BLAKE2b-256 |
77a30af02e59ee369bf1c08586b134439b59c1f74ec5ecb878193bba7d627542
|
File details
Details for the file nano_parakeet-0.2.0-py3-none-any.whl.
File metadata
- Download URL: nano_parakeet-0.2.0-py3-none-any.whl
- Upload date:
- Size: 204.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dc7c1834c76f60482678d025cef3fef786c2090f2a92cf206517d99385b1c18
|
|
| MD5 |
b3f2acdd132c8d6ed8c8d4727f7dcbfb
|
|
| BLAKE2b-256 |
841f5a9fe27245ab265bc4eb89e4fde592f19b276c9b2bc9fa9b40c08739615a
|