MLX implementation of ContentVec/HuBERT for Apple Silicon

These details have not been verified by PyPI

Project links

Project description

MLX ContentVec

MLX implementation of ContentVec / HuBERT for Apple Silicon.

This is the feature extraction backbone for RVC-MLX (coming soon), a native Apple Silicon implementation of Retrieval-based Voice Conversion.

What is ContentVec?

ContentVec extracts speaker-agnostic semantic features from audio. In the RVC pipeline, it captures the phonetic content of speech while discarding speaker identity, enabling voice conversion:

Input Audio (16kHz) → ContentVec → Semantic Features (768-dim) → RVC Decoder → Converted Voice

Installation

pip install mlx-contentvec

For development:

git clone https://github.com/lexandstuff/mlx-contentvec.git
cd mlx-contentvec
pip install -e .

Quick Start

1. Download Weights

Download pre-converted MLX weights from HuggingFace:

from huggingface_hub import hf_hub_download

weights_path = hf_hub_download(
    repo_id="lexandstuff/mlx-contentvec",
    filename="contentvec_base.safetensors"
)

Or manually:

mkdir -p weights
wget -O weights/contentvec_base.safetensors \
  "https://huggingface.co/lexandstuff/mlx-contentvec/resolve/main/contentvec_base.safetensors"

Converting weights manually (advanced)

If you need to convert from PyTorch yourself:

# Download original PyTorch weights
wget -O weights/hubert_base.pt \
  "https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt"

# Convert (requires Python 3.9 + fairseq)
uv run --python 3.9 python scripts/convert_weights.py \
  --pytorch_ckpt weights/hubert_base.pt \
  --mlx_ckpt weights/contentvec_base.safetensors

See IMPLEMENTATION_NOTES.md for details.

2. Extract Features

import mlx.core as mx
import librosa
from mlx_contentvec import ContentvecModel

# Load model (12 transformer layers, no speaker conditioning)
model = ContentvecModel(encoder_layers_1=0)
model.load_weights("weights/contentvec_base.safetensors")
model.eval()

# Load audio at 16kHz
audio, sr = librosa.load("input.wav", sr=16000, mono=True)
source = mx.array(audio).reshape(1, -1)

# Extract features
result = model(source)
features = result["x"]  # Shape: (1, num_frames, 768)

print(f"Audio: {len(audio)/16000:.2f}s -> Features: {features.shape}")
# Example: Audio: 3.00s -> Features: (1, 93, 768)

API Reference

ContentvecModel

ContentvecModel(
    encoder_layers: int = 12,      # Number of transformer layers
    encoder_layers_1: int = 0,     # Speaker-conditioned layers (set to 0 for RVC)
    encoder_embed_dim: int = 768,  # Feature dimension
    ...
)

Methods:

Method	Description
`load_weights(path)`	Load weights from SafeTensors file
`eval()`	Set to inference mode (disables dropout)
`__call__(source, spk_emb=None)`	Extract features from audio

Input:

source: Audio waveform tensor, shape (batch, samples), 16kHz sample rate

Output:

Returns {"x": features, "padding_mask": None}
features shape: (batch, num_frames, 768)
Frame rate: ~50 frames/second (hop size = 320 samples at 16kHz)

RVC Integration

In the RVC voice conversion pipeline, ContentVec provides semantic features that preserve speech content while enabling voice transformation:

# 1. Extract content features with ContentVec
features = contentvec_model(audio)["x"]  # (1, T, 768)

# 2. Optional: Blend with voice index for timbre transfer
# features = faiss_index.search(features) * index_rate + features * (1 - index_rate)

# 3. Extract pitch (F0) with separate model (RMVPE, etc.)
f0 = pitch_extractor(audio)  # (1, T)

# 4. Generate converted audio with RVC synthesizer
output = rvc_synthesizer(features, f0, speaker_id)

The key insight is that ContentVec captures what is being said (phonetic content) while the RVC decoder adds who is saying it (speaker identity via F0 and speaker embedding).

Validation

This implementation produces numerically identical outputs to the PyTorch reference:

Metric	Value
Max absolute difference	8e-6
Cosine similarity	1.000000

See IMPLEMENTATION_NOTES.md for detailed validation methodology.

Development

Project Structure

mlx-contentvec/
├── mlx_contentvec/
│   ├── __init__.py
│   ├── contentvec.py              # Main model class
│   ├── conv_feature_extraction.py # 7-layer CNN feature extractor
│   ├── transformer_encoder.py     # 12-layer transformer with pos conv
│   └── modules/
│       ├── multihead_attention.py # Multi-head self-attention
│       ├── weight_norm.py         # Weight normalization for pos conv
│       ├── group_norm.py          # Group norm (incl. masked variant)
│       └── cond_layer_norm.py     # Conditional layer norm (speaker)
├── scripts/
│   └── convert_weights.py         # PyTorch → SafeTensors conversion
├── tests/
│   ├── test_conv_feature_extraction.py
│   ├── test_end_to_end.py
│   └── test_weight_norm.py
├── IMPLEMENTATION_NOTES.md        # Technical details & validation
└── README.md

Setting Up for Development

Clone reference implementations for comparison:

mkdir -p vendor && cd vendor

# ContentVec reference
git clone https://github.com/auspicious3000/contentvec.git

# fairseq (required for loading PyTorch checkpoint)
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq && git checkout 0b21875

Running Tests

uv run pytest

Test suite (48 tests):

Test File	Tests	Description
`test_conv_feature_extraction.py`	24	CNN feature extractor unit tests
`test_end_to_end.py`	10	Integration tests (HuggingFace → inference)
`test_weight_norm.py`	16	Weight normalization unit tests

The end-to-end tests download weights from HuggingFace and verify:

Model loading and initialization
Inference with various input shapes
Deterministic output in eval mode
Feature statistics (no NaN/Inf)
Real audio file processing

Weight Conversion Details

The conversion from PyTorch to MLX requires:

Tensor transposition: Conv1d weights change from (out, in, kernel) to (out, kernel, in)
Weight normalization: The positional conv uses weight norm with g and v parameters
Float32 precision: Weights must be saved as float32 (not float16) for numerical accuracy

See scripts/convert_weights.py and IMPLEMENTATION_NOTES.md for details.

Publishing to PyPI

Update the version in pyproject.toml
Update CHANGELOG.md with the new version
Build and upload:

# Build distribution packages
uv run python -m build

# Upload to PyPI
uv run twine upload dist/*

License

MIT

Acknowledgments

ContentVec - Original implementation by Kaizhi Qian
fairseq - HuBERT/wav2vec2 implementation
RVC - Voice conversion pipeline
MLX - Apple's machine learning framework

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jan 19, 2026

This version

0.1.0

Jan 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_contentvec-0.1.0.tar.gz (32.3 kB view details)

Uploaded Jan 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_contentvec-0.1.0-py3-none-any.whl (22.3 kB view details)

Uploaded Jan 19, 2026 Python 3

File details

Details for the file mlx_contentvec-0.1.0.tar.gz.

File metadata

Download URL: mlx_contentvec-0.1.0.tar.gz
Upload date: Jan 19, 2026
Size: 32.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for mlx_contentvec-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`271310bf6e2bc7c06fe1a9686c28a2733cc5585193a25f56e09e3ddcb9909f09`
MD5	`d208796e9da0505da32aac8717a4ccdd`
BLAKE2b-256	`9c4b5a819aa9b03b4fd36f41b2d98893abbb46ce5ec109a7a6a856d47889f959`

See more details on using hashes here.

File details

Details for the file mlx_contentvec-0.1.0-py3-none-any.whl.

File metadata

Download URL: mlx_contentvec-0.1.0-py3-none-any.whl
Upload date: Jan 19, 2026
Size: 22.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for mlx_contentvec-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8572d9173c0a898916ae7e8492b72930c37f0147e47fb4040db58366838473ba`
MD5	`481a8c7be551e6ed6371e8515c2deb8b`
BLAKE2b-256	`cd9ef56b8ab811c3eb1468503943146d5578717839e5582023321517be3d2c9d`

See more details on using hashes here.

mlx-contentvec 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLX ContentVec

What is ContentVec?

Installation

Quick Start

1. Download Weights

2. Extract Features

API Reference

ContentvecModel

RVC Integration

Validation

Development

Project Structure

Setting Up for Development

Running Tests

Weight Conversion Details

Publishing to PyPI

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes