Skip to main content

VG-HuBERT: Simplified interface for speech segmentation with HuggingFace Hub integration

Project description

VG-HuBERT: Speech Segmentation with Simplified Interface

Unsupervised syllable and word segmentation using visually grounded HuBERT (VG-HuBERT). This fork provides a simplified interface with HuggingFace Hub integration, updated PyTorch version to eliminate the need for PyTorch multi_head_attention_forward patching, optimized MinCut algorithm (~40x speedup), and PyPI package distribution.

Quick Start

from vg_hubert import Segmenter

# Syllable segmentation (RECOMMENDED: includes MinCutMerge post-processing)
segmenter = Segmenter(mode="syllable", merge_threshold=0.3)
outputs = segmenter("audio.wav")

# Word segmentation  
word_segmenter = Segmenter(mode="word")
word_outputs = word_segmenter("audio.wav")

Installation

# From source
pip install git+https://github.com/hjvm/VG-HuBERT.git

# Or PyPI (after publishing)
pip install vg-hubert

Requirements: Python ≥3.8, PyTorch ≥2.0, transformers, scipy, soundfile

Features

New in this fork:

  • 🚀 40x faster MinCut: Optimized algorithm from SyllableLM (Baade et al., 2024)
  • 🔧 MinCutMerge post-processing: Prevents over-segmentation (matches original paper)
  • 🤗 HuggingFace integration: Auto-download models from Hub
  • 🍎 Apple Silicon support: Native MPS acceleration
  • 📦 PyPI distribution: Simple pip install
  • 🧹 No fairseq for inference: Removed complex dependency

Usage

Basic Example

from vg_hubert import Segmenter
import soundfile as sf

# Load and segment
segmenter = Segmenter(
    model_ckpt="hjvm/VG-HuBERT",  # HuggingFace Hub or local path
    mode="syllable",
    device="cuda",  # or "mps" or "cpu" (auto-detects best available)
    merge_threshold=0.3  # Enable MinCutMerge (recommended)
)

outputs = segmenter("audio.wav")

# Access results
for start, end in outputs['segments']:
    print(f"Segment: {start:.2f}s - {end:.2f}s")

# Access features
segment_features = outputs['segment_features']  # [num_segments, 768]
frame_features = outputs['hidden_states']       # [num_frames, 768]

MinCut Configuration

The package supports multiple MinCut configurations for different use cases:

# Configuration 1: RECOMMENDED (matches original paper)
# - Fast algorithm + MinCutMerge post-processing
# - Prevents over-segmentation
segmenter = Segmenter(
    mode="syllable",
    merge_threshold=0.3,  # Original paper value
    min_segment_frames=2  # Filter very short segments
)

# Configuration 2: Plain MinCut (no merging)
# - Useful for analysis or more granular segmentation
segmenter = Segmenter(
    mode="syllable",
    merge_threshold=None  # Disable MinCutMerge
)

# Configuration 3: Custom merge threshold
# - Tune for your specific needs
# - Higher = more merging = fewer segments
# - Lower = less merging = more segments
segmenter = Segmenter(
    mode="syllable",
    merge_threshold=0.5  # More aggressive merging
)

See examples/mincut_comparison.py for detailed comparison.

Low-Level API

For advanced users who need full control:

from vg_hubert.mincut import segment_with_mincut
import numpy as np

# Extract features (see examples/ for full code)
features = ...  # Shape: (num_frames, 768)

# Apply MinCut with full control
boundaries, ssm = segment_with_mincut(
    features=features,
    K=10,  # Number of boundaries
    merge_threshold=0.3,  # Set to None for plain MinCut
    min_segment_frames=2,
    min_hop=3,  # Minimum segment length
    max_hop=50  # Maximum segment length
)

Parameters

  • mode: "syllable" (MinCut + feature similarity) or "word" (CLS attention)
  • layer: HuBERT layer to use (default: 8 for syllables, 9 for words)
  • device: "cuda", "mps", or "cpu" (defaults to CUDA if available, falls back to MPS on Apple Silicon, then CPU)
  • sec_per_syllable: Target syllable duration for MinCut (default: 0.2)
  • merge_threshold: Cosine similarity threshold for merging adjacent segments (default: 0.3, set to None to disable)
  • min_segment_frames: Filter segments with ≤ this many frames (default: 2)
  • attn_threshold: Attention threshold for word boundaries (default: 0.25)

See examples/ for more usage patterns.

Model Details

Checkpoints

Two pre-trained models optimized for different tasks:

Checkpoint Task Layer Algorithm Size
vg-hubert-syllable.pth Syllable 8 MinCut + MinCutMerge 474 MB
vg-hubert-word.pth Word 9 CLS Attention 361 MB

Algorithm Details

MinCut Segmentation (Syllables):

  1. Extract HuBERT features from layer 8
  2. Compute self-similarity matrix (SSM)
  3. Apply efficient MinCut algorithm (Baade et al., 2024)
    • ~40x faster than original O(N²K) implementation
    • Uses cumulative sums for O(1) range queries
  4. Optional: Apply MinCutMerge post-processing (Peng et al., 2023)
    • Iteratively merge adjacent segments with cosine similarity ≥ threshold
    • Prevents over-segmentation
    • Recommended for production use

Performance Comparison:

Configuration F1 (LibriSpeech) Speed (ms/utt) Speedup
Original MinCut 0.501 7524 1.0x
New MinCut 0.501 169 44.5x
New + MinCutMerge-0.3 ⭐ TBD 171 44.0x

Note: LibriSpeech results shown; original paper reports F1=0.603 on SpokenCOCO

Performance (SpokenCOCO - Original Paper)

Syllable Segmentation:

  • Boundary F1: 0.603
  • Boundary Precision: 0.574
  • Boundary Recall: 0.636

Word Discovery:

  • Token F1: 0.195
  • Type F1: 0.174
  • NED: 0.748

Training

VG-HuBERT uses visually-grounded contrastive learning to learn speech representations. The model jointly trains on speech and images using datasets like SpokenCOCO or Places.

Training Setup

  1. Install training dependencies:
pip install -r requirements.txt  # Includes fairseq, apex, Pillow, etc.
  1. Download datasets:

  2. Download pre-trained models for initialization:

  3. Configure training:

# configs/spokencoco.yaml
train_audio_dataset_json_file: "/path/to/SpokenCOCO_train.json"
val_audio_dataset_json_file: "/path/to/SpokenCOCO_val.json"
load_hubert_weights: "/path/to/hubert_base_ls960.pt"
load_pretrained_vit: "/path/to/dino_vitsmall8_pretrain.pth"
batch_size: 32
n_epochs: 30
gpus: "0,1,2,3"
  1. Train:
python train.py --config configs/spokencoco.yaml

Training Outputs

  • Checkpoints saved to exp_dir/ (default: ./checkpoints/)
  • TensorBoard logs in experiment directory
  • Config saved as config.yaml in experiment directory

Architecture

Dual-encoder with cross-modal transformer:

  • Audio encoder: HuBERT Base (12 layers, 768-dim)
  • Vision encoder: ViT Small/Base (DINO pretrained)
  • Cross-modal layers: 5 transformer layers for audio-image interaction
  • Loss: Margin InfoNCE (contrastive learning in common embedding space)

The trained audio encoder can then be used for segmentation without the vision components.

Training from Scratch

The package includes all training code:

  • vg_hubert/model/: Dual encoder, audio/vision transformers
  • vg_hubert/training/: Trainer, optimizers, utilities
  • vg_hubert/datasets/: SpokenCOCO and Places data loaders

See configs/ for complete training examples.

What's Different in This Fork

  1. No PyTorch patching: Uses native attn_implementation='eager' (PyTorch 2.0+)
  2. Simplified interface: Single Segmenter class for all use cases
  3. HuggingFace Hub: Automatic model downloading
  4. Complete package: Both training and inference (like Sylber)
  5. PyPI distribution: Easy installation via pip
  6. Apple Silicon support: Automatic MPS (Metal Performance Shaders) GPU acceleration
  7. Optimized MinCut: ~20-50x faster syllable segmentation using efficient algorithm from SyllableLM (Baade et al., 2024) with no quality degradation

Implementation Details

For inference, this package uses HuggingFace's transformers.HubertModel instead of the original fairseq implementation. This is possible because VG-HuBERT's audio encoder architecture is identical to the standard HuBERT model. The visual grounding training adds a vision encoder and cross-modal transformer layers, but these components are only used during training to learn better speech representations. At inference time, only the audio encoder weights are needed, which are fully compatible with the HuggingFace HuBERT architecture. This simplifies deployment and eliminates the fairseq dependency for inference.

Citations

VG-HuBERT Original Work

Syllable Segmentation:

@inproceedings{peng2023syllable,
  title={Syllable Segmentation and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model},
  author={Peng, Puyuan and Li, Shang-Wen and Räsänen, Okko and Mohamed, Abdelrahman and Harwath, David},
  booktitle={Interspeech},
  year={2023}
}

Word Discovery:

@inproceedings{peng2022word,
  title={Word Discovery in Visually Grounded, Self-Supervised Speech Models},
  author={Peng, Puyuan and Harwath, David},
  booktitle={Interspeech},
  year={2022}
}

Interface Design

This package follows the interface design of Sylber:

@article{cho2024sylber,
  title={Sylber: Syllabic Embedding Representation of Speech from Raw Audio},
  author={Cho, Cheol Jun and Lee, Nicholas and Gupta, Akshat and Agarwal, Dhruv and Chen, Ethan and Black, Alan W and Anumanchipalli, Gopala K},
  journal={arXiv preprint arXiv:2410.07168},
  year={2024}
}

Optimized MinCut Algorithm

The MinCut algorithm used for syllable segmentation has been updated to use the efficient implementation from SyllableLM (Baade et al., 2024), which provides ~20-50x speedup over the original with no statistically significant quality difference:

@misc{baade2024syllablelmlearningcoarsesemantic,
      title={SyllableLM: Learning Coarse Semantic Units for Speech Language Models}, 
      author={Alan Baade and Puyuan Peng and David Harwath},
      year={2024},
      eprint={2410.04029},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.04029}, 
}

Performance Comparison (LibriSpeech test-clean, 50 utterances):

  • Speed: 6961ms → 133ms per utterance (52x faster)
  • Quality: F1=0.377 → 0.372 (p=0.22, not significant)
  • 82% of utterances produce identical segmentations

Key optimizations:

  • Cumulative sum preprocessing for O(1) range queries
  • Segment length constraints (min_hop=3, max_hop=50 frames)
  • 5-component cost calculation

See vg_hubert/tests/mincut_validation.ipynb for full validation results.

Related Repositories

License

BSD-3-Clause License (same as original repositories)

Contributing

Issues and pull requests welcome. Please ensure changes maintain compatibility with original model weights and include proper attribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vg_hubert-1.0.0.tar.gz (412.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vg_hubert-1.0.0-py3-none-any.whl (68.4 kB view details)

Uploaded Python 3

File details

Details for the file vg_hubert-1.0.0.tar.gz.

File metadata

  • Download URL: vg_hubert-1.0.0.tar.gz
  • Upload date:
  • Size: 412.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for vg_hubert-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1adf562b7fe09121910096d803c59c7fd7f9a0b27b2628fe73225aa5ded851a9
MD5 30b45d31da34bb02c3a32cc0f5a47f93
BLAKE2b-256 74174c87bf3d9c561f3e771fb572fecf1e8e8b95cecdf32de6fee04c361e4190

See more details on using hashes here.

File details

Details for the file vg_hubert-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: vg_hubert-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 68.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for vg_hubert-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b9bbe76cf9cc824e347ef9a091bc417e512985ff98f58a5413137c7988b38de
MD5 62878e7440b67b1c008d2d5850903358
BLAKE2b-256 87a173bb73d7b7a8b379e2d51f2c6cfa1e568287e8444c1ffcf8300b2c4b3ebd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page