Skip to main content

PyTorch implementation of NVIDIA NeMo NanoCodec - Ultra-lightweight neural audio codec (0.6 kbps, 1764:1 compression)

Project description

NanoCodec PyTorch

License Python 3.8+ PyTorch

A PyTorch implementation of NVIDIA NeMo NanoCodec, an ultra-lightweight neural audio codec achieving 0.6 kbps bitrate with 1764:1 compression ratio.

Features

  • Ultra-Low Bitrate: 0.6 kbps at 22.05 kHz (12.5 fps frame rate)
  • High Compression: 1764:1 compression ratio (2×3×6×7×7 downsampling)
  • Multi-Device Support: CPU, CUDA (NVIDIA GPUs), MPS (Apple Silicon)
  • Production Ready: 164/164 tests passing, comprehensive validation
  • Causal Architecture: Supports streaming inference
  • Efficient: ~105M parameters, optimized for real-time inference

Model Architecture

  • Encoder: HiFiGAN-based encoder with 5 downsampling stages
  • Quantizer: Grouped Finite Scalar Quantization (4 groups, 4032 codes per group)
  • Decoder: Causal HiFiGAN decoder with HalfSnake activations
  • Sample Rate: 22.05 kHz mono
  • Parameters: ~105M.

Installation

From PyPI (when available)

pip install nanocodec-torch soundfile

From Source

git clone https://github.com/nineninesix-ai/nanocodec-torch.git
cd nanocodec-torch
pip install -e .

Dependencies

  • Python 3.10+
  • PyTorch 2.0+
  • soundfile
  • numpy
  • huggingface-hub
  • safetensors

Quick Start

Basic Usage

import torch
from nanocodec_torch.models.audio_codec import AudioCodecModel
import soundfile as sf

# Load pretrained model from HuggingFace Hub
model = AudioCodecModel.from_pretrained(
    "nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch"
)

# Move to desired device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Load audio (will be resampled to 22050 Hz if needed)
audio, sr = sf.read("input.wav")
audio_tensor = torch.tensor(audio, dtype=torch.float32).unsqueeze(0).unsqueeze(0).to(device)
audio_len = torch.tensor([len(audio)], dtype=torch.int32).to(device)

# Encode and decode
with torch.no_grad():
    tokens, tokens_len = model.encode(audio_tensor, audio_len)
    reconstructed, recon_len = model.decode(tokens, tokens_len)

# Save reconstructed audio
output = reconstructed[0, 0, :int(recon_len[0])].cpu().numpy()
sf.write("output.wav", output, 22050)

print(f"Compression ratio: {len(audio) / tokens.shape[2]:.0f}:1")
print(f"Tokens shape: {tokens.shape}")  # [B, 4, T/1764]

Device Selection

# CUDA (NVIDIA GPU)
if torch.cuda.is_available():
    device = "cuda"
    model = model.to(device)

# MPS (Apple Silicon M1/M2/M3)
if torch.backends.mps.is_available():
    device = "mps"
    model = model.to(device)

# CPU (fallback)
device = "cpu"
model = model.to(device)

Batch Processing

import torch
from nanocodec_torch.models.audio_codec import AudioCodecModel
import soundfile as sf

model = AudioCodecModel.from_pretrained(
    "nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch"
).to("cuda").eval()

# Load multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
audio_list = []
audio_lens = []

for file in audio_files:
    audio, sr = sf.read(file)
    audio_list.append(torch.tensor(audio, dtype=torch.float32))
    audio_lens.append(len(audio))

# Pad to same length
max_len = max(audio_lens)
audio_batch = torch.zeros(len(audio_list), 1, max_len)
for i, audio in enumerate(audio_list):
    audio_batch[i, 0, :len(audio)] = audio

audio_lens = torch.tensor(audio_lens, dtype=torch.int32).to("cuda")
audio_batch = audio_batch.to("cuda")

# Process batch
with torch.no_grad():
    tokens, tokens_len = model.encode(audio_batch, audio_lens)
    reconstructed, recon_lens = model.decode(tokens, tokens_len)

# Save outputs
for i, (audio, length) in enumerate(zip(reconstructed, recon_lens)):
    output = audio[0, :int(length)].cpu().numpy()
    sf.write(f"output_{i}.wav", output, 22050)

Examples

Comprehensive examples are available in the examples/ directory:

API Reference

AudioCodecModel

The main model class for audio encoding and decoding.

Methods

from_pretrained(repo_id: str, device: str = "cpu") -> AudioCodecModel

Load pretrained model from HuggingFace Hub.

model = AudioCodecModel.from_pretrained(
    "nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch"
)

encode(audio: torch.Tensor, audio_len: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]

Encode audio to discrete tokens.

  • Input:
    • audio: Audio tensor [B, 1, T], float32, range [-1, 1]
    • audio_len: Length tensor [B], int32
  • Output:
    • tokens: Discrete tokens [B, 4, T/1764], int32
    • tokens_len: Token lengths [B], int32

decode(tokens: torch.Tensor, tokens_len: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]

Decode tokens back to audio.

  • Input:
    • tokens: Discrete tokens [B, 4, T], int32
    • tokens_len: Token lengths [B], int32
  • Output:
    • audio: Reconstructed audio [B, 1, T*1764], float32, range [-1, 1]
    • audio_len: Audio lengths [B], int32

forward(audio: torch.Tensor, audio_len: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Full encode-decode roundtrip.

  • Output: (reconstructed_audio, tokens, audio_len)

For detailed API documentation, see API_REFERENCE.md.

Input/Output Specifications

Input

  • Type: Audio waveform
  • Format: .wav, .mp3, .flac (any format supported by soundfile)
  • Sample Rate: 22.05 kHz (audio will be resampled if necessary)
  • Channels: Mono (stereo will be converted to mono)
  • Range: [-1.0, 1.0] (normalized float32)

Output

  • Type: Reconstructed audio waveform
  • Format: .wav (or any format supported by soundfile)
  • Sample Rate: 22.05 kHz
  • Channels: Mono
  • Range: [-1.0, 1.0] (clamped)
  • Bitrate: 0.6 kbps (12.5 fps × 4 groups × log2(4032) ≈ 600 bps)

Performance Benchmarks

Inference Speed

  • Use torch.compile(mode="reduce-overhead"). Note: torch.compile() not supported on MPS.

  • Real-time factor = audio duration / processing time

Performance Optimization

For optimal inference performance on CUDA/CPU:

# Load and compile model (PyTorch 2.0+)
model = AudioCodecModel.from_pretrained(
    "nineninesix/nemo-nano-codec-22khz-0.6kbps-12.5fps-pytorch",
    device="cuda"
)
model.eval()

# Compile for 1.2-2x speedup
model.compile(mode="reduce-overhead")

# First inference includes compilation overhead (~5-10 seconds)
with torch.no_grad():
    tokens, _ = model.encode(audio_tensor, audio_len)

# Subsequent inferences are faster
with torch.no_grad():
    reconstructed, _ = model.decode(tokens, tokens_len)

Compilation modes:

  • default: Balanced optimization
  • reduce-overhead: Best for inference (recommended)
  • max-autotune: Aggressive optimization (longer compile time)

Note: torch.compile() requires PyTorch 2.0+ and is not supported on MPS (Apple Silicon).

Memory Usage

Batch Size Audio Length Model Size Peak Memory (CUDA)
1 5s 420 MB 500 MB
4 5s 420 MB 650 MB
16 5s 420 MB 1.2 GB

Testing

The codebase includes comprehensive test coverage:

# Run all tests
pytest

# Run with coverage
pytest --cov=src/nanocodec_torch --cov-report=html

# Run specific test categories
pytest -m unit          # Unit tests only
pytest -m integration   # Integration tests only
pytest -m quality       # Audio quality tests

Test Results: 164/164 tests passing (98.8%), 2 skipped (device-specific).

Documentation

Known Limitations

  1. Audio Quality: As an ultra-low bitrate codec (0.6 kbps), expect significant quality degradation compared to higher bitrate codecs
  2. Sample Rate: Fixed at 22.05 kHz, not suitable for high-fidelity audio
  3. Mono Only: Stereo audio will be converted to mono
  4. Compression Artifacts: Extreme compression ratio (1764:1) introduces noticeable artifacts
  5. Use Case: Best suited for speech/voice applications, not music production

License

This code is licensed under the Apache License 2.0. See LICENSE for details.

The original NVIDIA NeMo NanoCodec model weights and architecture are developed by NVIDIA Corporation and are licensed under the NVIDIA Open Model License. See NOTICE for attribution.

When using this project, you must comply with both licenses.

Acknowledgments

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanocodec_torch-1.0.0.tar.gz (46.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nanocodec_torch-1.0.0-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file nanocodec_torch-1.0.0.tar.gz.

File metadata

  • Download URL: nanocodec_torch-1.0.0.tar.gz
  • Upload date:
  • Size: 46.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for nanocodec_torch-1.0.0.tar.gz
Algorithm Hash digest
SHA256 309f60a8ca17e3d9e4bbb126f926ce04e1a46346dd9de53b016f04f6fd288196
MD5 7052f0bf8719e966feba3b82b9149215
BLAKE2b-256 914c5933d7c84d44da346cf0f360455dd9fa131094215a0c90f8ec0fa48d4593

See more details on using hashes here.

File details

Details for the file nanocodec_torch-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for nanocodec_torch-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b75897f375aefa191bafe1db0e6fc24dccf3851150846a4fb09b29db891968c
MD5 ad30c88e7cf99542247608b31f758ec7
BLAKE2b-256 693cc2c0c5f21a90aef0bc0509c2d29caab8582efa4c910a0de0a9d6d6821dae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page