Unsupervised syllable segmentation, evaluation, and embedding extraction toolkit for speech audio

These details have not been verified by PyPI

Project links

Project description

findsylls

Language-agnostic toolkit for unsupervised syllable-level speech segmentation, embedding extraction, and evaluation.

findsylls provides a full pipeline from raw audio to clustered syllable embeddings:

Envelope computation — RMS, Hilbert, low-pass, SBS, theta, and neural pseudo-envelopes
Syllable segmentation — classical peak detection and neural end-to-end methods (Sylber, VG-HuBERT)
Feature extraction — MFCC, mel spectrogram, HuBERT, Sylber, VG-HuBERT
Syllable embedding — pooled per-syllable vectors for downstream tasks
Unsupervised discovery — k-means, mini-batch k-means, agglomerative clustering
Evaluation — F1 against TextGrid annotations at phone, syllable, and word granularity
Visualization — waveform, envelope, segmentation, and feature-matrix plots

Install

pip install findsylls                  # core (classical methods)
pip install 'findsylls[embedding]'     # neural feature extraction (HuBERT, VG-HuBERT)
pip install 'findsylls[end2end]'       # neural segmenters (Sylber, VG-HuBERT)
pip install 'findsylls[viz]'           # plotting extras
pip install 'findsylls[storage]'       # HDF5 corpus storage
pip install 'findsylls[all]'           # everything

Quick Start

1 — Segment audio into syllables

from findsylls import segment_audio

# Classical: peak detection on an SBS amplitude envelope
syllables, envelope, times = segment_audio(
    "audio.wav",
    method="peakdetect",
    segmentation_kwargs={"envelope_method": "sbs"},
    return_envelope=True,
)

print(f"Found {len(syllables)} syllables")
# syllables: [(start_s, nucleus_s, end_s), ...]

Syllable segmentation on a sample utterance

Waveform (gray), SBS amplitude envelope (blue), syllable boundaries (green), and detected nuclei (red dots) for a sample utterance.

Module Guide

Envelope (`findsylls.envelope`)

The envelope module converts a raw audio waveform into a 1-D amplitude signal. All computers implement EnvelopeComputer.compute(audio, sr) → (envelope, times).

from findsylls.audio.utils import load_audio
from findsylls.envelope import (
    RMSEnvelope, HilbertEnvelope, ThetaEnvelope, SBSEnvelope,
    LowpassEnvelope, CLSAttentionEnvelope, GreedyCosineEnvelope,
)
from findsylls.plotting import plot_multiple_envelopes

audio, sr = load_audio("audio.wav")

envelopes = {}
for name, computer in [
    ("RMS",     RMSEnvelope()),
    ("Hilbert", HilbertEnvelope()),
    ("Theta",   ThetaEnvelope()),
    ("SBS",     SBSEnvelope()),
]:
    env, times = computer.compute(audio, sr)
    envelopes[name] = (env, times)

fig = plot_multiple_envelopes(audio, sr, envelopes)

Envelope method comparison

Four classical envelope methods on the same utterance. SBS and Theta track syllabic rhythm most closely; Hilbert and RMS give a more continuous energy contour.

You can also call the functional dispatch directly:

from findsylls import get_amplitude_envelope

envelope, times = get_amplitude_envelope(audio, sr, method="theta")

Available envelope methods: rms, hilbert, lowpass, sbs, theta, cls_attention, greedy_cosine, mincut

Segmentation (`findsylls.segmentation`)

All segmenters return List[(start_s, nucleus_s, end_s)].

Classical — peak detection

from findsylls import segment_audio
from findsylls.plotting import plot_multiple_envelope_segmentations
from findsylls.audio.utils import load_audio
from findsylls.envelope import HilbertEnvelope, ThetaEnvelope, SBSEnvelope
from findsylls.segmentation import get_segmenter

audio, sr = load_audio("audio.wav")

results = {}
for name, env_method in [("Hilbert", "hilbert"), ("Theta", "theta"), ("SBS", "sbs")]:
    env_computer = {"hilbert": HilbertEnvelope, "theta": ThetaEnvelope, "sbs": SBSEnvelope}[env_method]()
    env, times = env_computer.compute(audio, sr)
    segmenter = get_segmenter("peakdetect", envelope_method=env_method)
    segments = segmenter.segment(audio=audio, sr=sr)
    results[name] = (env, times, segments)

fig = plot_multiple_envelope_segmentations(audio, sr, results)

Peak detection with three envelope methods

The same audio segmented by peakdetect using three different envelope methods. Each panel shows how the chosen envelope shape influences where boundaries fall.

Preset segmenters (paper-replication configurations)

Preset classes replicate the exact configurations from published papers. Each carries a REFERENCE attribute and a cite() method — see Preset Citations below.

from findsylls.segmentation.presets import (
    ThetaOscillatorSegmenter,  # Räsänen et al. 2018 — gammatone + oscillator (no GPU)
    SylberSegmenter,           # Cho et al. 2025 — greedy cosine on Sylber HuBERT
    VGHubertMinCutSegmenter,   # Peng et al. 2023 — SSM MinCut on VG-HuBERT
    VGHubertCLSSegmenter,      # Peng & Harwath 2022 — CLS attention on VG-HuBERT
)
from findsylls.audio.utils import load_audio

audio, sr = load_audio("audio.wav")

# Theta oscillator (no model download, paper defaults: f=5, Q=0.5, N=8)
theta = ThetaOscillatorSegmenter()
syllables = theta.segment(audio, sr=sr)

# Sylber (requires findsylls[end2end])
sylber = SylberSegmenter()
syllables = sylber.segment(audio, sr=sr)

# VG-HuBERT MinCut (syllable mode, layer 8; requires findsylls[end2end])
vgh_mincut = VGHubertMinCutSegmenter(mode="syllable")
syllables = vgh_mincut.segment(audio, sr=sr)

# VG-HuBERT CLS attention (word mode, layer 9; requires findsylls[end2end])
vgh_cls = VGHubertCLSSegmenter(mode="word")
words = vgh_cls.segment(audio, sr=sr)

Generic dispatch

from findsylls.segmentation import get_segmenter, list_segmenters, list_segmenter_presets

print(list_segmenters())
# ['peakdetect', 'cls_attention', 'mincut', 'greedy_cosine']

print(list_segmenter_presets())
# {'theta_oscillator': ThetaOscillatorSegmenter, 'sylber': SylberSegmenter, ...}

segmenter = get_segmenter("mincut")
syllables = segmenter.segment(audio, sr=sr)

Feature Extraction (`findsylls.features`)

Feature extractors implement FeatureExtractor.extract(audio, sr) → np.ndarray (shape: [T, D]).

from findsylls.audio.utils import load_audio
from findsylls.features import MFCCExtractor, MelSpectrogramExtractor, HuBERTExtractor
from findsylls.plotting import plot_multiple_feature_matrices
import numpy as np

audio, sr = load_audio("audio.wav")

mfcc    = MFCCExtractor(n_mfcc=13)
melspec = MelSpectrogramExtractor(n_mels=64)

mfcc_feat = mfcc.extract(audio, sr)
mel_feat  = melspec.extract(audio, sr)

feature_results = {
    "MFCC (13 coeffs)":        (mfcc_feat,  np.linspace(0, len(audio)/sr, mfcc_feat.shape[0])),
    "Mel Spectrogram (64 bins)": (mel_feat, np.linspace(0, len(audio)/sr, mel_feat.shape[0])),
}

fig = plot_multiple_feature_matrices(audio, sr, feature_results)

Feature matrix comparison

MFCC and mel spectrogram feature matrices for the same utterance. Color encodes feature value; brighter = higher activation.

Available extractors: mfcc, melspectrogram, hubert, sylber, vghubert

from findsylls.features import get_extractor

extractor = get_extractor("hubert")          # vanilla HuBERT base (layer 9)
features  = extractor.extract(audio, sr)     # shape: [T, 768]

Embedding (`findsylls.embedding`)

Embedding wraps feature extraction + segmentation + pooling into a single call.

Single file

from findsylls import embed_audio

embeddings, metadata = embed_audio(
    "audio.wav",
    segmentation="peakdetect",
    features="mfcc",
    pooling="mean",                          # mean | max | median | onc
    segmentation_kwargs={"envelope_method": "hilbert"},
    return_metadata=True,
)

print(embeddings.shape)                      # (n_syllables, 13)
print(metadata["num_syllables"])
print(metadata["boundaries"])                # [(start, end), ...]

Corpus

from findsylls import embed_corpus, save_embeddings

results = embed_corpus(
    audio_files=["a.wav", "b.wav", "c.wav"],
    segmentation="peakdetect",
    features="mfcc",
    pooling="mean",
    segmentation_kwargs={"envelope_method": "hilbert"},
    n_jobs=4,
)

save_embeddings(results, "embeddings.npz")

Storage-backed corpus (large datasets)

For datasets that don't fit in RAM, write embeddings directly to disk:

from findsylls.embedding import embed_corpus_to_storage

bundle = embed_corpus_to_storage(
    audio_files=["a.wav", "b.wav", ...],
    output_dir="./embeddings",
    segmentation="peakdetect",
    features="mfcc",
    pooling="mean",
    segmentation_kwargs={"envelope_method": "hilbert"},
)

print(f"Embedded {bundle['num_success']}/{bundle['num_files']} files")
# Writes: ./embeddings/embedding_manifest.csv + ./embeddings/000000_*.npz

Preset-based embedding

from findsylls.embedding import EmbeddingPipeline

pipeline = EmbeddingPipeline(preset="sylber", pooling="mean")
embeddings, metadata = pipeline.embed_audio("audio.wav", return_metadata=True)

Available pooling methods: mean, max, median, onc

Discovery (`findsylls.discovery`)

Discovery clusters syllable embeddings into unsupervised categories.

from findsylls import embed_corpus, save_embeddings
from findsylls.discovery import DiscoveryPipeline
import numpy as np

# Embed a corpus
results = embed_corpus(audio_files=["a.wav", "b.wav", "c.wav"],
                       segmentation="peakdetect", features="mfcc", pooling="mean",
                       segmentation_kwargs={"envelope_method": "hilbert"})
embeddings = np.vstack([r["embeddings"] for r in results if r.get("success")])

# Cluster
pipeline = DiscoveryPipeline(method="kmeans", model_kwargs={"n_clusters": 50})
result   = pipeline.discover(embeddings)

print(result.labels)                          # cluster assignment per syllable
print(result.fit_metrics["silhouette"])
print(result.fit_metrics["davies_bouldin"])

Streaming clustering (corpus too large for RAM)

from findsylls.embedding import embed_corpus_to_storage
from findsylls.discovery import DiscoveryPipeline

bundle = embed_corpus_to_storage(audio_files=[...], output_dir="./embeddings",
                                  segmentation="peakdetect", features="mfcc", pooling="mean",
                                  segmentation_kwargs={"envelope_method": "hilbert"})

pipeline = DiscoveryPipeline(method="minibatch_kmeans", model_kwargs={"n_clusters": 50})
result   = pipeline.discover_from_storage(manifest_path=bundle["manifest_path"])

Memory comparison:

Approach	~500K syllables × 768-D
`embed_corpus` + `vstack` + `KMeans`	~10 GB RAM
`embed_corpus_to_storage` + `discover_from_storage`	~500 MB RAM

Available methods: kmeans, minibatch_kmeans, agglomerative

Full Corpus Workflow (`findsylls.pipeline`)

FindSyllsOrchestrator and discover_corpus run the entire pipeline — embed, discover, build manifests — in one call:

from findsylls import discover_corpus

result = discover_corpus(
    audio_files="data/**/*.wav",
    output_dir="./output",
    segmentation_method="peakdetect",
    features_method="mfcc",
    pooling_method="mean",
    discovery_method="kmeans",
    segmentation_kwargs={"envelope_method": "hilbert"},
)

print(result["corpus_manifest"])             # joined DataFrame
print(result["discovery_manifest_path"])
print(result["discovery_metrics"])

Or use the class directly:

from findsylls.pipeline.orchestrator import FindSyllsOrchestrator

orch = FindSyllsOrchestrator()

# Single file: segment + embed
embeddings, metadata = orch.segment_and_embed_audio(
    "audio.wav",
    segmentation_method="peakdetect",
    features_method="mfcc",
    pooling_method="mean",
    segmentation_kwargs={"envelope_method": "hilbert"},
)

Evaluation (`findsylls.evaluation`)

Evaluate segmentation against TextGrid annotations

from findsylls import segment_audio, evaluate_segmentation

syllables, _, _ = segment_audio(
    "audio.wav",
    method="peakdetect",
    segmentation_kwargs={"envelope_method": "hilbert"},
)

peaks = [nucleus for _, nucleus, _ in syllables]
spans = [(start, end) for start, _, end in syllables]

metrics = evaluate_segmentation(
    peaks=peaks,
    spans=spans,
    textgrid_path="annotations.TextGrid",
    tiers={"phone": 2, "syllable": 1, "word": 0},
)

# Keys: nuclei, syllable_boundaries, syllable_spans, word_boundaries, word_spans
print(metrics["syllable_boundaries"])
# {'TP': 12, 'Ins': 2, 'Del': 1, 'Sub': 0, 'Precision': ..., 'Recall': ..., 'F1': ...}

Batch evaluation over a corpus

from findsylls import run_evaluation

df = run_evaluation(
    textgrid_paths="data/**/*.TextGrid",
    wav_paths="data/**/*.wav",
    tiers={"phone": 2, "syllable": 1, "word": 0},
    method="peakdetect",
    segmentation_kwargs={"envelope_method": "hilbert"},
)

print(df.groupby("method")[["syllable_boundaries_f1", "word_spans_f1"]].mean())

Discovery label metrics

Connect cluster assignments to ground-truth TextGrid labels:

from findsylls.evaluation import (
    attach_textgrid_labels_to_manifest,
    compute_discovery_label_metrics,
)

labeled = attach_textgrid_labels_to_manifest(
    manifest=corpus_manifest,
    file_manifest=file_manifest_df,
    wav_paths=["a.wav", "b.wav"],
    textgrid_paths=["a.TextGrid", "b.TextGrid"],
    textgrid_tier_index=2,                       # phone tier
)

metrics = compute_discovery_label_metrics(labeled)
print(f"Cluster purity:  {metrics['cluster_purity']:.3f}")
print(f"Label purity:    {metrics['label_purity']:.3f}")
print(f"Normalized MI:   {metrics['label_norm_mutual_info']:.3f}")
print(f"Macro F1:        {metrics['macro_f1']:.3f}")

Visualize evaluation results

from findsylls import plot_segmentation_result

# df = output of run_evaluation(), file_id = stem of the audio file
fig, ax = plot_segmentation_result(
    df,
    file_id="SP20_117",
    envelope_fn="sbs",
    syll_tier=1,
    phone_tier=2,
    word_tier=0,
)

Preset System (`findsylls.presets`)

Named presets bundle segmentation + feature + pooling configurations from published papers:

from findsylls import get_preset, resolve_preset, list_presets

print(list_presets())
# ['sylber', 'vg_hubert_cls', 'vg_hubert_mincut']

cfg = get_preset("sylber")
# {'segmentation': 'greedy_cosine', 'features': 'sylber', 'pooling': 'mean', ...}

# Merge a preset with user overrides
cfg = resolve_preset("sylber", pooling="onc")

# Use directly with EmbeddingPipeline
from findsylls.embedding import EmbeddingPipeline
pipeline = EmbeddingPipeline(preset="sylber", pooling="mean")

CLI

# Segment audio into syllable boundaries
findsylls segment audio.wav --envelope hilbert --method peakdetect --out syllables.json

# Batch evaluation against TextGrid annotations
findsylls evaluate "data/**/*.wav" "data/**/*.TextGrid" \
  --phone-tier 2 --syllable-tier 1 --word-tier 0 \
  --envelope hilbert --method peakdetect \
  --out results.csv --aggregate summary.csv

Methods Reference

Envelope methods

rms · hilbert · lowpass · sbs · theta · cls_attention · greedy_cosine · mincut

Segmentation methods (dispatch strings)

peakdetect · cls_attention · mincut · greedy_cosine

Preset segmenters (paper-replication classes)

ThetaOscillatorSegmenter · SylberSegmenter · VGHubertMinCutSegmenter · VGHubertCLSSegmenter

Feature extractors

mfcc · melspectrogram · hubert · sylber · vghubert

Pooling methods

mean · max · median · onc

Discovery methods

kmeans · minibatch_kmeans · agglomerative

Preset Citations

Every preset segmenter ships with the full citation for its source paper. Access it programmatically without loading any model:

from findsylls.segmentation.presets import list_segmenter_presets

for name, cls in list_segmenter_presets().items():
    print(f"[{name}]")
    print(cls.REFERENCE)
    print()

Or on an instance (useful when you already have the object):

seg = ThetaOscillatorSegmenter()
seg.cite()

Theta Oscillator — Räsänen, Doyle & Frank (2018)

Räsänen, O., Doyle, G., & Frank, M. C. (2018). "Pre-linguistic segmentation of speech into syllable-like units." Cognition, 171, 130–150. https://doi.org/10.1016/j.cognition.2017.11.003

MATLAB implementation: https://github.com/orasanen/thetaOscillator

Sylber — Cho et al. (2025)

Cho, C. J., Lee, N., Gupta, A., Agarwal, D., Chen, E., Black, A. W., & Anumanchipalli, G. K. (2025). "Sylber: Syllabic Embedding Representation of Speech from Raw Audio." ICLR 2025. https://arxiv.org/abs/2410.07168

VG-HuBERT MinCut — Peng et al. (2023)

Peng, P., Shang, Z., Harwath, D., & others (2023). "Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model." Interspeech 2023. https://doi.org/10.21437/Interspeech.2023-1430

Code: https://github.com/jasonppy/syllable-discovery

VG-HuBERT CLS Attention — Peng & Harwath (2022)

Peng, P., & Harwath, D. (2022). "Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling." Interspeech 2022. https://doi.org/10.21437/Interspeech.2022-10631

Code: https://github.com/jasonppy/word-discovery

Citation

@misc{martinez2026findsyllslanguageagnostictoolkitsyllablelevel,
  title={findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding},
  author={Héctor Javier Vázquez Martínez},
  year={2026},
  eprint={2603.26292},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.26292},
}

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.2.1

Jun 15, 2026

3.2.0

Jun 4, 2026

3.1.1

May 30, 2026

3.1.0

May 29, 2026

This version

3.0.2

May 28, 2026

3.0.1

May 4, 2026

3.0.0

May 4, 2026

2.0.0

Mar 30, 2026

1.0.3

Mar 30, 2026

1.0.1

Dec 17, 2025

1.0.0

Dec 17, 2025

0.2.0

Dec 1, 2025

0.1.1

Sep 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findsylls-3.0.2.tar.gz (500.5 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

findsylls-3.0.2-py3-none-any.whl (145.6 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file findsylls-3.0.2.tar.gz.

File metadata

Download URL: findsylls-3.0.2.tar.gz
Upload date: May 28, 2026
Size: 500.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for findsylls-3.0.2.tar.gz
Algorithm	Hash digest
SHA256	`b2d8fb7312a1ecf5a249cfc7a5bf28d0d5610dbaae3a3e4cd1967844b2f710f1`
MD5	`270e881845f793cd3b8bd03dfa0d1c82`
BLAKE2b-256	`8c7bf43e75ff02426aeaf14c7dda915267c9bc93d2e532446700bcf7cd0ef91a`

See more details on using hashes here.

File details

Details for the file findsylls-3.0.2-py3-none-any.whl.

File metadata

Download URL: findsylls-3.0.2-py3-none-any.whl
Upload date: May 28, 2026
Size: 145.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for findsylls-3.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5feff6e376eff596233e73133831a395c91c0e0bdb9e5acb3249f173129a39c7`
MD5	`4601a7947a41ff16589bd46317b6bea8`
BLAKE2b-256	`89fdd6577f40766f08eb0748e2a012f68cbd102738509a575917bebf4de370a1`

See more details on using hashes here.

findsylls 3.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

findsylls

Install

Quick Start

1 — Segment audio into syllables

Module Guide

Envelope (findsylls.envelope)

Segmentation (findsylls.segmentation)

Classical — peak detection

Preset segmenters (paper-replication configurations)

Generic dispatch

Feature Extraction (findsylls.features)

Embedding (findsylls.embedding)

Single file

Corpus

Storage-backed corpus (large datasets)

Preset-based embedding

Discovery (findsylls.discovery)

Streaming clustering (corpus too large for RAM)

Full Corpus Workflow (findsylls.pipeline)

Evaluation (findsylls.evaluation)

Evaluate segmentation against TextGrid annotations

Batch evaluation over a corpus

Discovery label metrics

Visualize evaluation results

Preset System (findsylls.presets)

CLI

Methods Reference

Envelope methods

Segmentation methods (dispatch strings)

Preset segmenters (paper-replication classes)

Feature extractors

Pooling methods

Discovery methods

Preset Citations

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Envelope (`findsylls.envelope`)

Segmentation (`findsylls.segmentation`)

Feature Extraction (`findsylls.features`)

Embedding (`findsylls.embedding`)

Discovery (`findsylls.discovery`)

Full Corpus Workflow (`findsylls.pipeline`)

Evaluation (`findsylls.evaluation`)

Preset System (`findsylls.presets`)