VG-HuBERT: Simplified interface for speech segmentation with HuggingFace Hub integration
Project description
VG-HuBERT: Speech Segmentation with Simplified Interface
Unsupervised syllable and word segmentation using visually grounded HuBERT (VG-HuBERT). This fork provides a simplified interface with HuggingFace Hub integration, updated PyTorch version to eliminate the need for PyTorch multi_head_attention_forward patching, optimized MinCut algorithm (~40x speedup), and PyPI package distribution.
Quick Start
from vg_hubert import Segmenter
# Syllable segmentation (RECOMMENDED: includes MinCutMerge post-processing)
segmenter = Segmenter(mode="syllable", merge_threshold=0.3)
outputs = segmenter("audio.wav")
# Word segmentation
word_segmenter = Segmenter(mode="word")
word_outputs = word_segmenter("audio.wav")
Installation
pip install vg-hubert
Requirements: Python ≥3.8, PyTorch ≥2.0, transformers, scipy, soundfile
Optional (for development):
pip install git+https://github.com/human-ai-lab/VG-HuBERT.git
Features
✨ New in this fork:
- 🚀 40x faster MinCut: Optimized algorithm from SyllableLM (Baade et al., 2024)
- 🔧 MinCutMerge post-processing: Prevents over-segmentation (matches original paper)
- 🤗 HuggingFace integration: Auto-download models from Hub
- 🍎 Apple Silicon support: Native MPS acceleration
- 📦 PyPI distribution: Simple
pip install - 🧹 No fairseq for inference: Removed complex dependency
Usage
Basic Example
from vg_hubert import Segmenter
import soundfile as sf
# Load and segment
segmenter = Segmenter(
model_ckpt="hjvm/VG-HuBERT", # HuggingFace Hub or local path
mode="syllable",
device="cuda", # or "mps" or "cpu" (auto-detects best available)
merge_threshold=0.3 # Enable MinCutMerge (recommended)
)
outputs = segmenter("audio.wav")
# Access results
for start, end in outputs['segments']:
print(f"Segment: {start:.2f}s - {end:.2f}s")
# Access features
segment_features = outputs['segment_features'] # [num_segments, 768]
frame_features = outputs['hidden_states'] # [num_frames, 768]
MinCut Configuration
The package supports multiple MinCut configurations for different use cases:
# Configuration 1: RECOMMENDED (matches original paper)
# - Fast algorithm + MinCutMerge post-processing
# - Prevents over-segmentation
segmenter = Segmenter(
mode="syllable",
merge_threshold=0.3, # Original paper value
min_segment_frames=2 # Filter very short segments
)
# Configuration 2: Plain MinCut (no merging)
# - Useful for analysis or more granular segmentation
segmenter = Segmenter(
mode="syllable",
merge_threshold=None # Disable MinCutMerge
)
# Configuration 3: Custom merge threshold
# - Tune for your specific needs
# - Higher = more merging = fewer segments
# - Lower = less merging = more segments
segmenter = Segmenter(
mode="syllable",
merge_threshold=0.5 # More aggressive merging
)
See examples/mincut_comparison.py for detailed comparison.
Low-Level API
For advanced users who need full control:
from vg_hubert.mincut import segment_with_mincut
import numpy as np
# Extract features (see examples/ for full code)
features = ... # Shape: (num_frames, 768)
# Apply MinCut with full control
boundaries, ssm = segment_with_mincut(
features=features,
K=10, # Number of boundaries
merge_threshold=0.3, # Set to None for plain MinCut
min_segment_frames=2,
min_hop=3, # Minimum segment length
max_hop=50 # Maximum segment length
)
Parameters
- mode:
"syllable"(MinCut + feature similarity) or"word"(CLS attention) - layer: HuBERT layer to use (default: 8 for syllables, 9 for words)
- device:
"cuda","mps", or"cpu"(defaults to CUDA if available, falls back to MPS on Apple Silicon, then CPU) - sec_per_syllable: Target syllable duration for MinCut (default: 0.2)
- merge_threshold: Cosine similarity threshold for merging adjacent segments (default: 0.3, set to
Noneto disable) - min_segment_frames: Filter segments with ≤ this many frames (default: 2)
- attn_threshold: Attention threshold for word boundaries (default: 0.25)
See examples/ for more usage patterns.
Model Details
Checkpoints
Two pre-trained models optimized for different tasks:
| Checkpoint | Task | Layer | Algorithm | Size |
|---|---|---|---|---|
vg-hubert-syllable.pth |
Syllable | 8 | MinCut + MinCutMerge | 474 MB |
vg-hubert-word.pth |
Word | 9 | CLS Attention | 361 MB |
Algorithm Details
MinCut Segmentation (Syllables):
- Extract HuBERT features from layer 8
- Compute self-similarity matrix (SSM)
- Apply efficient MinCut algorithm (Baade et al., 2024)
- ~40x faster than original O(N²K) implementation
- Uses cumulative sums for O(1) range queries
- Optional: Apply MinCutMerge post-processing (Peng et al., 2023)
- Iteratively merge adjacent segments with cosine similarity ≥ threshold
- Prevents over-segmentation
- Recommended for production use
Performance Comparison:
| Configuration | F1 (LibriSpeech) | Speed (ms/utt) | Speedup |
|---|---|---|---|
| Original MinCut | 0.501 | 7524 | 1.0x |
| New MinCut | 0.501 | 169 | 44.5x |
| New + MinCutMerge-0.3 ⭐ | TBD | 171 | 44.0x |
Note: LibriSpeech results shown; original paper reports F1=0.603 on SpokenCOCO
Performance (SpokenCOCO - Original Paper)
Syllable Segmentation:
- Boundary F1: 0.603
- Boundary Precision: 0.574
- Boundary Recall: 0.636
Word Discovery:
- Token F1: 0.195
- Type F1: 0.174
- NED: 0.748
Training
VG-HuBERT uses visually-grounded contrastive learning to learn speech representations. The model jointly trains on speech and images using datasets like SpokenCOCO or Places.
Training Setup
- Install training dependencies:
pip install -r requirements.txt # Includes fairseq, apex, Pillow, etc.
-
Download datasets:
- SpokenCOCO: Spoken captions + MSCOCO images
- Places: Spoken descriptions + Places365 images
-
Download pre-trained models for initialization:
- HuBERT Base (pretrained on LibriSpeech 960h)
- DINO ViT (vision encoder)
-
Configure training:
# configs/spokencoco.yaml
train_audio_dataset_json_file: "/path/to/SpokenCOCO_train.json"
val_audio_dataset_json_file: "/path/to/SpokenCOCO_val.json"
load_hubert_weights: "/path/to/hubert_base_ls960.pt"
load_pretrained_vit: "/path/to/dino_vitsmall8_pretrain.pth"
batch_size: 32
n_epochs: 30
gpus: "0,1,2,3"
- Train:
python train.py --config configs/spokencoco.yaml
Training Outputs
- Checkpoints saved to
exp_dir/(default:./checkpoints/) - TensorBoard logs in experiment directory
- Config saved as
config.yamlin experiment directory
Architecture
Dual-encoder with cross-modal transformer:
- Audio encoder: HuBERT Base (12 layers, 768-dim)
- Vision encoder: ViT Small/Base (DINO pretrained)
- Cross-modal layers: 5 transformer layers for audio-image interaction
- Loss: Margin InfoNCE (contrastive learning in common embedding space)
The trained audio encoder can then be used for segmentation without the vision components.
Training from Scratch
The package includes all training code:
vg_hubert/model/: Dual encoder, audio/vision transformersvg_hubert/training/: Trainer, optimizers, utilitiesvg_hubert/datasets/: SpokenCOCO and Places data loaders
See configs/ for complete training examples.
What's Different in This Fork
- No PyTorch patching: Uses native
attn_implementation='eager'(PyTorch 2.0+) - Simplified interface: Single
Segmenterclass for all use cases - HuggingFace Hub: Automatic model downloading
- Complete package: Both training and inference (like Sylber)
- PyPI distribution: Easy installation via pip
- Apple Silicon support: Automatic MPS (Metal Performance Shaders) GPU acceleration
- Optimized MinCut: ~20-50x faster syllable segmentation using efficient algorithm from SyllableLM (Baade et al., 2024) with no quality degradation
Implementation Details
For inference, this package uses HuggingFace's transformers.HubertModel instead of the original fairseq implementation. This is possible because VG-HuBERT's audio encoder architecture is identical to the standard HuBERT model. The visual grounding training adds a vision encoder and cross-modal transformer layers, but these components are only used during training to learn better speech representations. At inference time, only the audio encoder weights are needed, which are fully compatible with the HuggingFace HuBERT architecture. This simplifies deployment and eliminates the fairseq dependency for inference.
Citations
VG-HuBERT Original Work
Syllable Segmentation:
@inproceedings{peng2023syllable,
title={Syllable Segmentation and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model},
author={Peng, Puyuan and Li, Shang-Wen and Räsänen, Okko and Mohamed, Abdelrahman and Harwath, David},
booktitle={Interspeech},
year={2023}
}
Word Discovery:
@inproceedings{peng2022word,
title={Word Discovery in Visually Grounded, Self-Supervised Speech Models},
author={Peng, Puyuan and Harwath, David},
booktitle={Interspeech},
year={2022}
}
Interface Design
This package follows the interface design of Sylber:
@article{cho2024sylber,
title={Sylber: Syllabic Embedding Representation of Speech from Raw Audio},
author={Cho, Cheol Jun and Lee, Nicholas and Gupta, Akshat and Agarwal, Dhruv and Chen, Ethan and Black, Alan W and Anumanchipalli, Gopala K},
journal={arXiv preprint arXiv:2410.07168},
year={2024}
}
Optimized MinCut Algorithm
The MinCut algorithm used for syllable segmentation has been updated to use the efficient implementation from SyllableLM (Baade et al., 2024), which provides ~20-50x speedup over the original with no statistically significant quality difference:
@misc{baade2024syllablelmlearningcoarsesemantic,
title={SyllableLM: Learning Coarse Semantic Units for Speech Language Models},
author={Alan Baade and Puyuan Peng and David Harwath},
year={2024},
eprint={2410.04029},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.04029},
}
Performance Comparison (LibriSpeech test-clean, 50 utterances):
- Speed: 6961ms → 133ms per utterance (52x faster)
- Quality: F1=0.377 → 0.372 (p=0.22, not significant)
- 82% of utterances produce identical segmentations
Key optimizations:
- Cumulative sum preprocessing for O(1) range queries
- Segment length constraints (min_hop=3, max_hop=50 frames)
- 5-component cost calculation
See vg_hubert/tests/mincut_validation.ipynb for full validation results.
Related Repositories
- Original implementations: word-discovery, syllable-discovery
- Fork parent: human-ai-lab/VG-HuBERT
- Interface inspiration: Sylber
License
BSD-3-Clause License (same as original repositories)
Contributing
Issues and pull requests welcome. Please ensure changes maintain compatibility with original model weights and include proper attribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vg_hubert-1.2.0.tar.gz.
File metadata
- Download URL: vg_hubert-1.2.0.tar.gz
- Upload date:
- Size: 412.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6622556309e1d63ab872459ac8be17dbc5e02f75ab59f97104be5d51720c865
|
|
| MD5 |
d89aadb99b7f070c42aee43a7e5a92c8
|
|
| BLAKE2b-256 |
eefeca2808f567360d18682606b298e671a10bd2c0d8bc41cc726a62d44b4ce4
|
File details
Details for the file vg_hubert-1.2.0-py3-none-any.whl.
File metadata
- Download URL: vg_hubert-1.2.0-py3-none-any.whl
- Upload date:
- Size: 66.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a961b8aeedae5dc14eab0e9f464274bd09910ee2f1091ae452803ff7a574b243
|
|
| MD5 |
5ce4ff64f9e1500666639715a270ac2d
|
|
| BLAKE2b-256 |
38a0a9cbdb4500981f7dc031da6a9141f7f612703102aa17a03f5b680544b046
|