Unsupervised syllable segmentation, evaluation, and embedding extraction toolkit for speech audio

These details have not been verified by PyPI

Project links

Project description

findsylls

Unsupervised syllable segmentation, evaluation, and embedding extraction toolkit for speech audio. Extract amplitude/modulation envelopes, segment into syllables, extract per-syllable embeddings, and evaluate versus TextGrid annotations.

Features

Core Segmentation & Evaluation

Classical envelope-based methods: RMS, Hilbert, low-pass, spectral band subtraction (SBS), gammatone, theta oscillator
End-to-end neural methods (optional): Sylber (self-supervised syllabic distillation)
Peak & valley segmentation using Billauer's peakdetect algorithm
Robust TextGrid parsing for phones, syllables, words with vowel filtering
Multi-level evaluation metrics (precision/recall/F1 at nuclei, boundaries, spans)
Batch pipeline utilities + fuzzy filename matching

Syllable Embedding Pipeline ✨ NEW in v1.0.0

Extract per-syllable embeddings for clustering, classification, and cross-lingual analysis:

Feature Extraction Methods:

sylber - 768-dim self-supervised (GPU, ~50 fps, auto-downloads from HuggingFace)
vg_hubert - 768-dim VG-HuBERT (GPU, ~50 fps, auto-downloads from HuggingFace)
mfcc - 13/26/39-dim with delta/delta-delta (CPU, ~100 fps)
melspec - 80-dim mel-filterbank (CPU, ~100 fps)

Pooling Methods:

mean - Average frames within syllable (default)
onc - Onset-Nucleus-Coda template (3× dimensions, preserves structure)
max, median - Alternative aggregations

Storage Formats:

NPZ (NumPy, always available)
HDF5 (optional, requires h5py, memory-mapped for large corpora)

Install

# Core installation (envelope-based segmentation + MFCC/Mel embeddings)
pip install findsylls

# With end-to-end neural segmentation (Sylber)
pip install 'findsylls[end2end]'

# With HDF5 storage for large embedding datasets
pip install 'findsylls[storage]'

# With neural embedders (VG-HuBERT, requires PyTorch)
pip install 'findsylls[embedding]'

# With all optional features
pip install 'findsylls[all]'

# With visualization extras
pip install 'findsylls[viz]'

# Local development (editable)
pip install -e .[dev]

Note: Neural methods (Sylber, VG-HuBERT) require PyTorch and automatically download pre-trained models from HuggingFace Hub on first use (~500-800MB each).

Quick Start

Syllable Segmentation

from findsylls import segment_audio

# Segment audio into syllables
sylls, env, t = segment_audio(
    "example.wav",
    envelope_fn="sbs",
    segment_fn="peaks_and_valleys"
)
# sylls: [(start, peak, end), ...]
print(f"Found {len(sylls)} syllables")

Syllable Embeddings ✨

from findsylls import embed_audio

# Extract per-syllable embeddings (single file)
embeddings, metadata = embed_audio(
    'audio.wav',
    segmentation='peaks_and_valleys',  # or 'sylber'
    features='mfcc',                   # or 'sylber', 'vg_hubert'
    pooling='mean'                     # or 'onc', 'max'
)
# embeddings: (num_syllables, embedding_dim) NumPy array
# metadata: dict with boundaries, methods, timestamps

print(f"Shape: {embeddings.shape}")
print(f"Syllables: {metadata['num_syllables']}")

# Process entire corpus (batch)
from findsylls import embed_corpus, save_embeddings

results = embed_corpus(
    audio_paths=['audio1.wav', 'audio2.wav', 'audio3.wav'],
    segmentation='peaks_and_valleys',
    features='mfcc',
    pooling='mean',
    n_jobs=4  # Parallel processing
)

# Save to disk
save_embeddings(results, 'embeddings.npz')  # or .h5 for HDF5

MFCC with Delta Features

from findsylls import embed_audio

# 13-dim MFCC
embeddings_13, _ = embed_audio('audio.wav', features='mfcc')

# 26-dim MFCC (13 + 13 delta)
embeddings_26, _ = embed_audio(
    'audio.wav',
    features='mfcc',
    feature_kwargs={'include_delta': True}
)

# 39-dim MFCC (13 + 13 delta + 13 delta-delta)
embeddings_39, _ = embed_audio(
    'audio.wav',
    features='mfcc',
    feature_kwargs={'include_delta': True, 'include_delta_delta': True}
)

Batch Evaluation

from findsylls import run_evaluation, aggregate_results

results = run_evaluation(
    textgrid_paths="data/**/*.TextGrid",
    wav_paths="data/**/*.wav",
    phone_tier=1,
    syllable_tier=2,
    word_tier=3,
    envelope_fn="hilbert",
)
print(results.head())

# Aggregate metrics
summary = aggregate_results(results, dataset_name="MyCorpus")
print(summary)

CLI

After install:

# Segment audio
findsylls segment input.wav --envelope sbs --method peaks_and_valleys --out sylls.json

# Extract embeddings
findsylls embed input.wav --features mfcc --pooling mean --out embeddings.npz

# Evaluate against TextGrid annotations
findsylls evaluate "data/**/*.wav" "data/**/*.TextGrid" \
    --phone-tier 1 --syllable-tier 2 --word-tier 3 \
    --envelope hilbert --out results.csv

# Show help
findsylls --help
findsylls segment --help
findsylls embed --help
findsylls evaluate --help

Documentation

EMBEDDING_PIPELINE.md - Complete embedding guide with architecture
VALIDATION_RESULTS.md - Validation against legacy implementation (r=0.9990)
PHASE1_COMPLETE.md - Phase 1 implementation details
PHASE2_SUMMARY.md - Phase 2 enhancements (VG-HuBERT, MFCC deltas)
PHASE3_SUMMARY.md - Phase 3 corpus processing
VG_HUBERT_README.md - VG-HuBERT setup guide
examples/ - Usage examples for all features

Syllable Embedding Pipeline

The embedding pipeline extends findsylls to extract per-syllable embeddings for downstream tasks.

Key Concepts

Two orthogonal dimensions:

Features (feature extraction): Sylber, VG-HuBERT, MFCC, Mel-spectrogram
Pooling (frame → syllable aggregation): mean, ONC template, max, median

Any segmentation method can feed any feature extraction method.

Complete Example

from findsylls import embed_audio

# Single file with full control
embeddings, metadata = embed_audio(
    'audio.wav',
    segmentation='peaks_and_valleys',  # Segmentation method
    features='sylber',                 # Feature extractor
    pooling='mean',                    # Aggregation method
    sr=16000,                          # Sample rate
    return_metadata=True               # Include metadata
)

# embeddings: (num_syllables, 768) for Sylber
# metadata contains: boundaries, peaks, num_syllables, methods, etc.

# Corpus processing with storage
from findsylls import embed_corpus, save_embeddings

results = embed_corpus(
    audio_paths=['data/audio1.wav', 'data/audio2.wav'],
    segmentation='peaks_and_valleys',
    features='mfcc',
    pooling='mean',
    feature_kwargs={'include_delta': True},  # 26-dim MFCC
    n_jobs=4  # Parallel processing
)

# Save to disk (auto-detects format from extension)
save_embeddings(results, 'embeddings.npz')   # NumPy format
save_embeddings(results, 'embeddings.h5')    # HDF5 format (requires h5py)

Feature Extraction Options

Neural (GPU-enabled, requires [embedding] or [all]):

sylber - 768-dim, ~50 fps, self-supervised
vg_hubert - 768-dim, ~50 fps, visually-grounded (auto-downloads from HuggingFace)

Classical (CPU-optimized, always available):

mfcc - 13/26/39-dim (with deltas), ~100 fps
melspec - 80-dim mel-filterbank, ~100 fps

Pooling Options

mean - Average frames in syllable span (default)
onc - Onset-Nucleus-Coda template (3× dimensions)
- Onset: 30% from start to peak
- Nucleus: peak frame
- Coda: 70% from peak to end
max - Max pooling across time
median - Median pooling

Storage Options

from findsylls import save_embeddings, load_embeddings

# NumPy format (always available, good for small-medium datasets)
save_embeddings(results, 'embeddings.npz')
loaded = load_embeddings('embeddings.npz')

# HDF5 format (requires h5py, best for large corpora)
# Supports memory-mapped partial loading
save_embeddings(results, 'embeddings.h5')
loaded = load_embeddings('embeddings.h5')  # Full load
loaded = load_embeddings('embeddings.h5', indices=[0, 5, 10])  # Partial load

For complete documentation, see docs/EMBEDDING_PIPELINE.md.

API Reference

Segmentation & Evaluation

Function	Purpose
`segment_audio`	One-file end-to-end (load → envelope → segment)
`run_evaluation`	Batch match WAV/TextGrid and compute metrics
`get_amplitude_envelope`	Compute envelope via registered method
`segment_envelope`	Dispatch segmentation algorithm
`flatten_results` / `aggregate_results`	Reshape & aggregate evaluation outputs
`plot_segmentation_result`	Multi-panel qualitative plot (optional)

Embedding Extraction ✨

Function	Purpose
`embed_audio`	Extract syllable embeddings from single audio file
`embed_corpus`	Batch embedding extraction with parallel processing
`save_embeddings`	Save embeddings to NPZ or HDF5 format
`load_embeddings`	Load embeddings from NPZ or HDF5 format

Validation

v1.0.0 validated against legacy spot_the_word implementation:

✅ r=0.9990 correlation on MFCC embeddings
✅ 100% syllable count match (10/10 test files)
✅ Identical segmentation boundaries
Tested on Brent corpus (4,209 syllables, 862 utterances)

See docs/VALIDATION_RESULTS.md for full report.

Adding Methods

Envelope: implement compute_* returning (env, times) in envelope/ and register in envelope/dispatch.py.
Segmentation: implement segment_<name>(envelope, times, **kwargs) in segmentation/ and add branch in segmentation/dispatch.py.

TextGrid Tier Indexing

Indices are 0-based (as provided by the textgrid library). Pass None to skip a tier or -1 for placeholder syllable generation (currently returns empty list).

Evaluation Conventions

Default tolerance = 0.05s.
Evaluation keys are generated dynamically based on tier specifications (e.g., syllable_boundaries, word_spans).
To evaluate against a new tier, pass it via the tiers parameter: tiers={'my_tier': 3}.
Substitutions matter for span metrics; remain zero for nuclei/boundary F1 semantics.

Roadmap

Classical envelope-based segmentation (v0.1.0)
Multi-level TextGrid evaluation (v0.1.0)
End-to-end neural segmentation (Sylber) (v0.1.1)
Syllable embedding pipeline (v1.0.0)
- Core extractors (Sylber, MFCC, Mel-spectrogram)
- Pooling methods (mean, ONC, max, median)
- Corpus processing with parallel execution
- Storage utilities (NPZ, HDF5)
- VG-HuBERT support with MFCC delta features
Additional neural embedders (HuBERT, Wav2Vec2, WavLM)
Streaming / large-file handling
Alternative segmentation algorithms (Mermelstein, oscillator-based)
Enhanced CLI with progress tracking

Performance Notes

Audio loading prefers torchaudio (install separately) else falls back to soundfile/librosa
Envelope computation is vectorized (NumPy); SBS/Hilbert faster than theta/gammatone
Embedding extraction: Neural methods (GPU) ~50 fps, Classical methods (CPU) ~100 fps
Corpus processing supports parallel execution via n_jobs parameter

FAQ

Can I use my own segmentation boundaries?
Yes! Extract embeddings from pre-segmented syllables by calling the lower-level API (see docs/EMBEDDING_PIPELINE.md).

Why are boundary/spans columns missing in evaluation?
If a tier index is None or produces no intervals, those metrics are skipped.

How do I add a custom envelope method?
Implement a function returning (envelope, times) and register it in envelope/dispatch.py.

Can I stream long recordings?
Not yet; current design assumes full in-memory arrays. Streaming is on the roadmap.

Why do I get 0 TP for nuclei evaluation?
Likely vowel set mismatch; confirm phone tier labels match ARPABET or adjust SYLLABIC constant.

What's the difference between ONC pooling methods?
Our onc pooling = legacy spot_the_word onc-strict (30% onset, peak nucleus, 70% coda).

Roadmap / TODO

Implement generate_syllable_intervals (placeholder now).
Additional segmentation algorithms (Mermelstein, oscillator-based).
More robust CLI progress + JSON schema for outputs.
Optional streaming / large-file handling.

Roadmap / TODO

Implement generate_syllable_intervals (placeholder now).
Additional segmentation algorithms (Mermelstein, oscillator-based).
More robust CLI progress + JSON schema for outputs.
Optional streaming / large-file handling.

Legacy Code

The previous exploratory/monolithic implementations are retained under a legacy/ folder (formerly old/ and findsylls_old/) for reference only. They are excluded from distribution and not supported; prefer the public API described above.

License

MIT. See LICENSE.

Citation

If you use this software in academic work, please cite the preprint:

https://arxiv.org/abs/2603.26292

Plain text:

Vázquez Martínez, Héctor Javier. (2026). findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding. arXiv:2603.26292. https://arxiv.org/abs/2603.26292

BibTeX:

@misc{martinez2026findsyllslanguageagnostictoolkitsyllablelevel,
    title={findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding},
    author={Héctor Javier Vázquez Martínez},
    year={2026},
    eprint={2603.26292},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2603.26292},
}

For development guidelines see .github/copilot-instructions.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.0.1

May 4, 2026

3.0.0

May 4, 2026

2.0.0

Mar 30, 2026

This version

1.0.3

Mar 30, 2026

1.0.1

Dec 17, 2025

1.0.0

Dec 17, 2025

0.2.0

Dec 1, 2025

0.1.1

Sep 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findsylls-1.0.3.tar.gz (462.9 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

findsylls-1.0.3-py3-none-any.whl (95.9 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file findsylls-1.0.3.tar.gz.

File metadata

Download URL: findsylls-1.0.3.tar.gz
Upload date: Mar 30, 2026
Size: 462.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for findsylls-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`d6efbd11a0524337501412efab3be13ac489254836b90e22bd9f571235d141a7`
MD5	`b9f2ba566d1fe367df81e9a1b98ac6b6`
BLAKE2b-256	`1eb48df3328596125371fe8d01e791602dfc2296716c8f9f08827c7b72986473`

See more details on using hashes here.

File details

Details for the file findsylls-1.0.3-py3-none-any.whl.

File metadata

Download URL: findsylls-1.0.3-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 95.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for findsylls-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`496289fd95aa24cd155cf894d5cb90e4ea8c987b22c9b747b455acfb3af933ba`
MD5	`91ea5e9f9f0019eb7a50824def3c1ae7`
BLAKE2b-256	`dc1c4d1dfb6eebd7f3629def7f9aa1bdebad5b60e2030cdc41bfa3fe027c2db8`

See more details on using hashes here.

findsylls 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

findsylls

Features

Core Segmentation & Evaluation

Syllable Embedding Pipeline ✨ NEW in v1.0.0

Install

Quick Start

Syllable Segmentation

Syllable Embeddings ✨

MFCC with Delta Features

Batch Evaluation

CLI

Documentation

Syllable Embedding Pipeline

Key Concepts

Complete Example

Feature Extraction Options

Pooling Options

Storage Options

API Reference

Segmentation & Evaluation

Embedding Extraction ✨

Validation

Adding Methods

TextGrid Tier Indexing

Evaluation Conventions

Roadmap

Performance Notes

FAQ

Roadmap / TODO

Roadmap / TODO

Legacy Code

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes