Skip to main content

MLX implementation of ContentVec/HuBERT for Apple Silicon

Project description

MLX ContentVec

MLX implementation of ContentVec / HuBERT for Apple Silicon.

This is the feature extraction backbone for RVC-MLX (coming soon), a native Apple Silicon implementation of Retrieval-based Voice Conversion.

What is ContentVec?

ContentVec extracts speaker-agnostic semantic features from audio. In the RVC pipeline, it captures the phonetic content of speech while discarding speaker identity, enabling voice conversion:

Input Audio (16kHz) → ContentVec → Semantic Features (768-dim) → RVC Decoder → Converted Voice

Installation

pip install mlx-contentvec

For development:

git clone https://github.com/lexandstuff/mlx-contentvec.git
cd mlx-contentvec
pip install -e .

Quick Start

import mlx.core as mx
import librosa
from mlx_contentvec import ContentvecModel

# Load model (auto-downloads weights from HuggingFace)
model = ContentvecModel.from_pretrained()

# Load audio at 16kHz
audio, sr = librosa.load("input.wav", sr=16000, mono=True)
source = mx.array(audio).reshape(1, -1)

# Extract features
result = model(source)
features = result["x"]  # Shape: (1, num_frames, 768)

print(f"Audio: {len(audio)/16000:.2f}s -> Features: {features.shape}")
# Example: Audio: 3.00s -> Features: (1, 93, 768)

Manual Weight Loading

If you prefer to manage weights yourself:

from huggingface_hub import hf_hub_download
from mlx_contentvec import ContentvecModel

# Download weights
weights_path = hf_hub_download(
    repo_id="lexandstuff/mlx-contentvec",
    filename="contentvec_base.safetensors"
)

# Load model manually
model = ContentvecModel(encoder_layers_1=0)
model.load_weights(weights_path)
model.eval()
Converting weights from PyTorch (advanced)

If you need to convert from PyTorch yourself:

# Download original PyTorch weights
wget -O weights/hubert_base.pt \
  "https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt"

# Convert (requires Python 3.9 + fairseq)
uv run --python 3.9 python scripts/convert_weights.py \
  --pytorch_ckpt weights/hubert_base.pt \
  --mlx_ckpt weights/contentvec_base.safetensors

See IMPLEMENTATION_NOTES.md for details.

API Reference

ContentvecModel

ContentvecModel(
    encoder_layers: int = 12,      # Number of transformer layers
    encoder_layers_1: int = 0,     # Speaker-conditioned layers (set to 0 for RVC)
    encoder_embed_dim: int = 768,  # Feature dimension
    ...
)

Methods:

Method Description
load_weights(path) Load weights from SafeTensors file
eval() Set to inference mode (disables dropout)
__call__(source, spk_emb=None) Extract features from audio

Input:

  • source: Audio waveform tensor, shape (batch, samples), 16kHz sample rate

Output:

  • Returns {"x": features, "padding_mask": None}
  • features shape: (batch, num_frames, 768)
  • Frame rate: ~50 frames/second (hop size = 320 samples at 16kHz)

RVC Integration

In the RVC voice conversion pipeline, ContentVec provides semantic features that preserve speech content while enabling voice transformation:

# 1. Extract content features with ContentVec
features = contentvec_model(audio)["x"]  # (1, T, 768)

# 2. Optional: Blend with voice index for timbre transfer
# features = faiss_index.search(features) * index_rate + features * (1 - index_rate)

# 3. Extract pitch (F0) with separate model (RMVPE, etc.)
f0 = pitch_extractor(audio)  # (1, T)

# 4. Generate converted audio with RVC synthesizer
output = rvc_synthesizer(features, f0, speaker_id)

The key insight is that ContentVec captures what is being said (phonetic content) while the RVC decoder adds who is saying it (speaker identity via F0 and speaker embedding).

Validation

This implementation produces numerically identical outputs to the PyTorch reference:

Metric Value
Max absolute difference 8e-6
Cosine similarity 1.000000

See IMPLEMENTATION_NOTES.md for detailed validation methodology.

Development

Project Structure

mlx-contentvec/
├── mlx_contentvec/
│   ├── __init__.py
│   ├── contentvec.py              # Main model class
│   ├── conv_feature_extraction.py # 7-layer CNN feature extractor
│   ├── transformer_encoder.py     # 12-layer transformer with pos conv
│   └── modules/
│       ├── multihead_attention.py # Multi-head self-attention
│       ├── weight_norm.py         # Weight normalization for pos conv
│       ├── group_norm.py          # Group norm (incl. masked variant)
│       └── cond_layer_norm.py     # Conditional layer norm (speaker)
├── scripts/
│   └── convert_weights.py         # PyTorch → SafeTensors conversion
├── tests/
│   ├── test_conv_feature_extraction.py
│   ├── test_end_to_end.py
│   └── test_weight_norm.py
├── IMPLEMENTATION_NOTES.md        # Technical details & validation
└── README.md

Setting Up for Development

Clone reference implementations for comparison:

mkdir -p vendor && cd vendor

# ContentVec reference
git clone https://github.com/auspicious3000/contentvec.git

# fairseq (required for loading PyTorch checkpoint)
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq && git checkout 0b21875

Running Tests

uv run pytest

Test suite (48 tests):

Test File Tests Description
test_conv_feature_extraction.py 24 CNN feature extractor unit tests
test_end_to_end.py 10 Integration tests (HuggingFace → inference)
test_weight_norm.py 16 Weight normalization unit tests

The end-to-end tests download weights from HuggingFace and verify:

  • Model loading and initialization
  • Inference with various input shapes
  • Deterministic output in eval mode
  • Feature statistics (no NaN/Inf)
  • Real audio file processing

Weight Conversion Details

The conversion from PyTorch to MLX requires:

  1. Tensor transposition: Conv1d weights change from (out, in, kernel) to (out, kernel, in)
  2. Weight normalization: The positional conv uses weight norm with g and v parameters
  3. Float32 precision: Weights must be saved as float32 (not float16) for numerical accuracy

See scripts/convert_weights.py and IMPLEMENTATION_NOTES.md for details.

Publishing to PyPI

  1. Update the version in pyproject.toml
  2. Update CHANGELOG.md with the new version
  3. Build and upload:
# Build distribution packages
uv run python -m build

# Upload to PyPI
uv run twine upload dist/*

License

MIT

Acknowledgments

  • ContentVec - Original implementation by Kaizhi Qian
  • fairseq - HuBERT/wav2vec2 implementation
  • RVC - Voice conversion pipeline
  • MLX - Apple's machine learning framework

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_contentvec-0.1.1.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_contentvec-0.1.1-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file mlx_contentvec-0.1.1.tar.gz.

File metadata

  • Download URL: mlx_contentvec-0.1.1.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for mlx_contentvec-0.1.1.tar.gz
Algorithm Hash digest
SHA256 af349e55849beb8a71bcf06763d125a2417ebd15e4c0b8991a775b3ac7c36d19
MD5 e2cc142cc8fe4f5276e0b44a4bb5d6fe
BLAKE2b-256 d99b88e4f18163778f0717425bb15652885a3d6ed92469e40d616cd66024de6c

See more details on using hashes here.

File details

Details for the file mlx_contentvec-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mlx_contentvec-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for mlx_contentvec-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 475b01b16bffe08d898fc752c3f234ae0340a6f66d5d84b12a39ff8fbea5de63
MD5 1bc610f8baf4533f062c68e8ed5ad9ad
BLAKE2b-256 11e755e1340dc6d22902bd6b990cf8b0714f8ddc3b0b1d85740e33561ba7b4ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page