MLX implementation of ContentVec/HuBERT for Apple Silicon
Project description
MLX ContentVec
MLX implementation of ContentVec / HuBERT for Apple Silicon.
This is the feature extraction backbone for RVC-MLX (coming soon), a native Apple Silicon implementation of Retrieval-based Voice Conversion.
What is ContentVec?
ContentVec extracts speaker-agnostic semantic features from audio. In the RVC pipeline, it captures the phonetic content of speech while discarding speaker identity, enabling voice conversion:
Input Audio (16kHz) → ContentVec → Semantic Features (768-dim) → RVC Decoder → Converted Voice
Installation
pip install mlx-contentvec
For development:
git clone https://github.com/lexandstuff/mlx-contentvec.git
cd mlx-contentvec
pip install -e .
Quick Start
1. Download Weights
Download pre-converted MLX weights from HuggingFace:
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download(
repo_id="lexandstuff/mlx-contentvec",
filename="contentvec_base.safetensors"
)
Or manually:
mkdir -p weights
wget -O weights/contentvec_base.safetensors \
"https://huggingface.co/lexandstuff/mlx-contentvec/resolve/main/contentvec_base.safetensors"
Converting weights manually (advanced)
If you need to convert from PyTorch yourself:
# Download original PyTorch weights
wget -O weights/hubert_base.pt \
"https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt"
# Convert (requires Python 3.9 + fairseq)
uv run --python 3.9 python scripts/convert_weights.py \
--pytorch_ckpt weights/hubert_base.pt \
--mlx_ckpt weights/contentvec_base.safetensors
See IMPLEMENTATION_NOTES.md for details.
2. Extract Features
import mlx.core as mx
import librosa
from mlx_contentvec import ContentvecModel
# Load model (12 transformer layers, no speaker conditioning)
model = ContentvecModel(encoder_layers_1=0)
model.load_weights("weights/contentvec_base.safetensors")
model.eval()
# Load audio at 16kHz
audio, sr = librosa.load("input.wav", sr=16000, mono=True)
source = mx.array(audio).reshape(1, -1)
# Extract features
result = model(source)
features = result["x"] # Shape: (1, num_frames, 768)
print(f"Audio: {len(audio)/16000:.2f}s -> Features: {features.shape}")
# Example: Audio: 3.00s -> Features: (1, 93, 768)
API Reference
ContentvecModel
ContentvecModel(
encoder_layers: int = 12, # Number of transformer layers
encoder_layers_1: int = 0, # Speaker-conditioned layers (set to 0 for RVC)
encoder_embed_dim: int = 768, # Feature dimension
...
)
Methods:
| Method | Description |
|---|---|
load_weights(path) |
Load weights from SafeTensors file |
eval() |
Set to inference mode (disables dropout) |
__call__(source, spk_emb=None) |
Extract features from audio |
Input:
source: Audio waveform tensor, shape(batch, samples), 16kHz sample rate
Output:
- Returns
{"x": features, "padding_mask": None} featuresshape:(batch, num_frames, 768)- Frame rate: ~50 frames/second (hop size = 320 samples at 16kHz)
RVC Integration
In the RVC voice conversion pipeline, ContentVec provides semantic features that preserve speech content while enabling voice transformation:
# 1. Extract content features with ContentVec
features = contentvec_model(audio)["x"] # (1, T, 768)
# 2. Optional: Blend with voice index for timbre transfer
# features = faiss_index.search(features) * index_rate + features * (1 - index_rate)
# 3. Extract pitch (F0) with separate model (RMVPE, etc.)
f0 = pitch_extractor(audio) # (1, T)
# 4. Generate converted audio with RVC synthesizer
output = rvc_synthesizer(features, f0, speaker_id)
The key insight is that ContentVec captures what is being said (phonetic content) while the RVC decoder adds who is saying it (speaker identity via F0 and speaker embedding).
Validation
This implementation produces numerically identical outputs to the PyTorch reference:
| Metric | Value |
|---|---|
| Max absolute difference | 8e-6 |
| Cosine similarity | 1.000000 |
See IMPLEMENTATION_NOTES.md for detailed validation methodology.
Development
Project Structure
mlx-contentvec/
├── mlx_contentvec/
│ ├── __init__.py
│ ├── contentvec.py # Main model class
│ ├── conv_feature_extraction.py # 7-layer CNN feature extractor
│ ├── transformer_encoder.py # 12-layer transformer with pos conv
│ └── modules/
│ ├── multihead_attention.py # Multi-head self-attention
│ ├── weight_norm.py # Weight normalization for pos conv
│ ├── group_norm.py # Group norm (incl. masked variant)
│ └── cond_layer_norm.py # Conditional layer norm (speaker)
├── scripts/
│ └── convert_weights.py # PyTorch → SafeTensors conversion
├── tests/
│ ├── test_conv_feature_extraction.py
│ ├── test_end_to_end.py
│ └── test_weight_norm.py
├── IMPLEMENTATION_NOTES.md # Technical details & validation
└── README.md
Setting Up for Development
Clone reference implementations for comparison:
mkdir -p vendor && cd vendor
# ContentVec reference
git clone https://github.com/auspicious3000/contentvec.git
# fairseq (required for loading PyTorch checkpoint)
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq && git checkout 0b21875
Running Tests
uv run pytest
Test suite (48 tests):
| Test File | Tests | Description |
|---|---|---|
test_conv_feature_extraction.py |
24 | CNN feature extractor unit tests |
test_end_to_end.py |
10 | Integration tests (HuggingFace → inference) |
test_weight_norm.py |
16 | Weight normalization unit tests |
The end-to-end tests download weights from HuggingFace and verify:
- Model loading and initialization
- Inference with various input shapes
- Deterministic output in eval mode
- Feature statistics (no NaN/Inf)
- Real audio file processing
Weight Conversion Details
The conversion from PyTorch to MLX requires:
- Tensor transposition: Conv1d weights change from
(out, in, kernel)to(out, kernel, in) - Weight normalization: The positional conv uses weight norm with
gandvparameters - Float32 precision: Weights must be saved as float32 (not float16) for numerical accuracy
See scripts/convert_weights.py and IMPLEMENTATION_NOTES.md for details.
Publishing to PyPI
- Update the version in
pyproject.toml - Update
CHANGELOG.mdwith the new version - Build and upload:
# Build distribution packages
uv run python -m build
# Upload to PyPI
uv run twine upload dist/*
License
MIT
Acknowledgments
- ContentVec - Original implementation by Kaizhi Qian
- fairseq - HuBERT/wav2vec2 implementation
- RVC - Voice conversion pipeline
- MLX - Apple's machine learning framework
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_contentvec-0.1.0.tar.gz.
File metadata
- Download URL: mlx_contentvec-0.1.0.tar.gz
- Upload date:
- Size: 32.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
271310bf6e2bc7c06fe1a9686c28a2733cc5585193a25f56e09e3ddcb9909f09
|
|
| MD5 |
d208796e9da0505da32aac8717a4ccdd
|
|
| BLAKE2b-256 |
9c4b5a819aa9b03b4fd36f41b2d98893abbb46ce5ec109a7a6a856d47889f959
|
File details
Details for the file mlx_contentvec-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mlx_contentvec-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8572d9173c0a898916ae7e8492b72930c37f0147e47fb4040db58366838473ba
|
|
| MD5 |
481a8c7be551e6ed6371e8515c2deb8b
|
|
| BLAKE2b-256 |
cd9ef56b8ab811c3eb1468503943146d5578717839e5582023321517be3d2c9d
|