PyTorch implementation of NVIDIA NeMo NanoCodec - Ultra-lightweight neural audio codec (0.6 kbps, 1764:1 compression)
Project description
NanoCodec PyTorch
A PyTorch implementation of NVIDIA NeMo NanoCodec, an ultra-lightweight neural audio codec achieving 0.6 kbps bitrate with 1764:1 compression ratio.
Features
- Ultra-Low Bitrate: 0.6 kbps at 22.05 kHz (12.5 fps frame rate)
- High Compression: 1764:1 compression ratio (2×3×6×7×7 downsampling)
- Multi-Device Support: CPU, CUDA (NVIDIA GPUs), MPS (Apple Silicon)
- Production Ready: 164/164 tests passing, comprehensive validation
- Causal Architecture: Supports streaming inference
- Efficient: ~105M parameters, optimized for real-time inference
Model Architecture
- Encoder: HiFiGAN-based encoder with 5 downsampling stages
- Quantizer: Grouped Finite Scalar Quantization (4 groups, 4032 codes per group)
- Decoder: Causal HiFiGAN decoder with HalfSnake activations
- Sample Rate: 22.05 kHz mono
- Parameters: ~105M.
Installation
From PyPI (when available)
pip install nanocodec-torch soundfile
From Source
git clone https://github.com/nineninesix-ai/nanocodec-torch.git
cd nanocodec-torch
pip install -e .
Dependencies
- Python 3.10+
- PyTorch 2.0+
- soundfile
- numpy
- huggingface-hub
- safetensors
Quick Start
Basic Usage
import torch
from nanocodec_torch.models.audio_codec import AudioCodecModel
import soundfile as sf
# Load pretrained model from HuggingFace Hub
model = AudioCodecModel.from_pretrained(
"nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch"
)
# Move to desired device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
# Load audio (will be resampled to 22050 Hz if needed)
audio, sr = sf.read("input.wav")
audio_tensor = torch.tensor(audio, dtype=torch.float32).unsqueeze(0).unsqueeze(0).to(device)
audio_len = torch.tensor([len(audio)], dtype=torch.int32).to(device)
# Encode and decode
with torch.no_grad():
tokens, tokens_len = model.encode(audio_tensor, audio_len)
reconstructed, recon_len = model.decode(tokens, tokens_len)
# Save reconstructed audio
output = reconstructed[0, 0, :int(recon_len[0])].cpu().numpy()
sf.write("output.wav", output, 22050)
print(f"Compression ratio: {len(audio) / tokens.shape[2]:.0f}:1")
print(f"Tokens shape: {tokens.shape}") # [B, 4, T/1764]
Device Selection
# CUDA (NVIDIA GPU)
if torch.cuda.is_available():
device = "cuda"
model = model.to(device)
# MPS (Apple Silicon M1/M2/M3)
if torch.backends.mps.is_available():
device = "mps"
model = model.to(device)
# CPU (fallback)
device = "cpu"
model = model.to(device)
Batch Processing
import torch
from nanocodec_torch.models.audio_codec import AudioCodecModel
import soundfile as sf
model = AudioCodecModel.from_pretrained(
"nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch"
).to("cuda").eval()
# Load multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
audio_list = []
audio_lens = []
for file in audio_files:
audio, sr = sf.read(file)
audio_list.append(torch.tensor(audio, dtype=torch.float32))
audio_lens.append(len(audio))
# Pad to same length
max_len = max(audio_lens)
audio_batch = torch.zeros(len(audio_list), 1, max_len)
for i, audio in enumerate(audio_list):
audio_batch[i, 0, :len(audio)] = audio
audio_lens = torch.tensor(audio_lens, dtype=torch.int32).to("cuda")
audio_batch = audio_batch.to("cuda")
# Process batch
with torch.no_grad():
tokens, tokens_len = model.encode(audio_batch, audio_lens)
reconstructed, recon_lens = model.decode(tokens, tokens_len)
# Save outputs
for i, (audio, length) in enumerate(zip(reconstructed, recon_lens)):
output = audio[0, :int(length)].cpu().numpy()
sf.write(f"output_{i}.wav", output, 22050)
Examples
Comprehensive examples are available in the examples/ directory:
- basic_encode_decode.py - Basic encode/decode workflow
- batch_processing.py - Batch process multiple files
- streaming_inference.py - Causal streaming inference
- device_examples.py - Multi-device usage examples
API Reference
AudioCodecModel
The main model class for audio encoding and decoding.
Methods
from_pretrained(repo_id: str, device: str = "cpu") -> AudioCodecModel
Load pretrained model from HuggingFace Hub.
model = AudioCodecModel.from_pretrained(
"nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch"
)
encode(audio: torch.Tensor, audio_len: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]
Encode audio to discrete tokens.
- Input:
audio: Audio tensor[B, 1, T], float32, range [-1, 1]audio_len: Length tensor[B], int32
- Output:
tokens: Discrete tokens[B, 4, T/1764], int32tokens_len: Token lengths[B], int32
decode(tokens: torch.Tensor, tokens_len: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]
Decode tokens back to audio.
- Input:
tokens: Discrete tokens[B, 4, T], int32tokens_len: Token lengths[B], int32
- Output:
audio: Reconstructed audio[B, 1, T*1764], float32, range [-1, 1]audio_len: Audio lengths[B], int32
forward(audio: torch.Tensor, audio_len: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Full encode-decode roundtrip.
- Output:
(reconstructed_audio, tokens, audio_len)
For detailed API documentation, see API_REFERENCE.md.
Input/Output Specifications
Input
- Type: Audio waveform
- Format: .wav, .mp3, .flac (any format supported by soundfile)
- Sample Rate: 22.05 kHz (audio will be resampled if necessary)
- Channels: Mono (stereo will be converted to mono)
- Range: [-1.0, 1.0] (normalized float32)
Output
- Type: Reconstructed audio waveform
- Format: .wav (or any format supported by soundfile)
- Sample Rate: 22.05 kHz
- Channels: Mono
- Range: [-1.0, 1.0] (clamped)
- Bitrate: 0.6 kbps (12.5 fps × 4 groups × log2(4032) ≈ 600 bps)
Performance Benchmarks
Inference Speed
-
Use
torch.compile(mode="reduce-overhead"). Note: torch.compile() not supported on MPS. -
Real-time factor = audio duration / processing time
Performance Optimization
For optimal inference performance on CUDA/CPU:
# Load and compile model (PyTorch 2.0+)
model = AudioCodecModel.from_pretrained(
"nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch",
device="cuda"
)
model.eval()
# Compile for 1.2-2x speedup
model.compile(mode="reduce-overhead")
# First inference includes compilation overhead (~5-10 seconds)
with torch.no_grad():
tokens, _ = model.encode(audio_tensor, audio_len)
# Subsequent inferences are faster
with torch.no_grad():
reconstructed, _ = model.decode(tokens, tokens_len)
Compilation modes:
default: Balanced optimizationreduce-overhead: Best for inference (recommended)max-autotune: Aggressive optimization (longer compile time)
Note: torch.compile() requires PyTorch 2.0+ and is not supported on MPS (Apple Silicon).
Memory Usage
| Batch Size | Audio Length | Model Size | Peak Memory (CUDA) |
|---|---|---|---|
| 1 | 5s | 420 MB | 500 MB |
| 4 | 5s | 420 MB | 650 MB |
| 16 | 5s | 420 MB | 1.2 GB |
Testing
The codebase includes comprehensive test coverage:
# Run all tests
pytest
# Run with coverage
pytest --cov=src/nanocodec_torch --cov-report=html
# Run specific test categories
pytest -m unit # Unit tests only
pytest -m integration # Integration tests only
pytest -m quality # Audio quality tests
Test Results: 164/164 tests passing (98.8%), 2 skipped (device-specific).
Documentation
- API_REFERENCE.md - Complete API reference
- CONTRIBUTING.md - Contribution guidelines
Known Limitations
- Audio Quality: As an ultra-low bitrate codec (0.6 kbps), expect significant quality degradation compared to higher bitrate codecs
- Sample Rate: Fixed at 22.05 kHz, not suitable for high-fidelity audio
- Mono Only: Stereo audio will be converted to mono
- Compression Artifacts: Extreme compression ratio (1764:1) introduces noticeable artifacts
- Use Case: Best suited for speech/voice applications, not music production
License
This code is licensed under the Apache License 2.0. See LICENSE for details.
The original NVIDIA NeMo NanoCodec model weights and architecture are developed by NVIDIA Corporation and are licensed under the NVIDIA Open Model License. See NOTICE for attribution.
When using this project, you must comply with both licenses.
Acknowledgments
- Original implementation by NVIDIA NeMo Team
- Architecture based on HiFi-GAN
- Quantization based on Finite Scalar Quantization (FSQ)
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Full Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanocodec_torch-1.0.0.tar.gz.
File metadata
- Download URL: nanocodec_torch-1.0.0.tar.gz
- Upload date:
- Size: 46.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
309f60a8ca17e3d9e4bbb126f926ce04e1a46346dd9de53b016f04f6fd288196
|
|
| MD5 |
7052f0bf8719e966feba3b82b9149215
|
|
| BLAKE2b-256 |
914c5933d7c84d44da346cf0f360455dd9fa131094215a0c90f8ec0fa48d4593
|
File details
Details for the file nanocodec_torch-1.0.0-py3-none-any.whl.
File metadata
- Download URL: nanocodec_torch-1.0.0-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b75897f375aefa191bafe1db0e6fc24dccf3851150846a4fb09b29db891968c
|
|
| MD5 |
ad30c88e7cf99542247608b31f758ec7
|
|
| BLAKE2b-256 |
693cc2c0c5f21a90aef0bc0509c2d29caab8582efa4c910a0de0a9d6d6821dae
|