Unsupervised syllable segmentation, evaluation, and embedding extraction toolkit for speech audio
Project description
findsylls
Language-agnostic toolkit for syllable-level speech tokenization and embedding extraction.
findsylls provides:
- Envelope computation from waveform (RMS, Hilbert, low-pass, SBS, gammatone, theta)
- Syllable segmentation (peak/valley and neural options)
- Evaluation against TextGrid annotations (nuclei, boundaries, spans)
- Per-syllable embedding extraction for downstream tasks
Install
# Core package
pip install findsylls
# Optional extras
pip install 'findsylls[viz]' # plotting helpers
pip install 'findsylls[embedding]' # neural feature extraction
pip install 'findsylls[end2end]' # neural segmentation methods
pip install 'findsylls[storage]' # HDF5 storage support
pip install 'findsylls[all]' # all extras
Quick Start
1) Segment a file into syllables
from findsylls import segment_audio
sylls, envelope, times = segment_audio(
"example.wav",
envelope_fn="sbs",
segment_fn="peakdetect",
)
print(f"Found {len(sylls)} syllables")
# sylls: [(start, peak, end), ...]
2) Evaluate against TextGrid annotations
from findsylls import run_evaluation, aggregate_results
results = run_evaluation(
textgrid_paths="data/**/*.TextGrid",
wav_paths="data/**/*.wav",
phone_tier=1,
syllable_tier=2,
word_tier=3,
envelope_fn="hilbert",
)
summary = aggregate_results(results, dataset_name="MyCorpus")
print(summary)
3) Extract syllable embeddings
from findsylls import embed_audio
embeddings, metadata = embed_audio(
"example.wav",
segmentation="peakdetect",
features="mfcc", # mfcc | melspec | sylber | vg_hubert
pooling="mean", # mean | onc | max | median
)
print(embeddings.shape)
print(metadata["num_syllables"])
4) Batch embedding extraction
from findsylls import embed_corpus, save_embeddings
results = embed_corpus(
audio_paths=["a.wav", "b.wav", "c.wav"],
segmentation="peakdetect",
features="mfcc",
pooling="mean",
n_jobs=4,
)
save_embeddings(results, "embeddings.npz")
CLI
# Segment audio
findsylls segment input.wav --envelope sbs --method peakdetect --out sylls.json
# Extract embeddings
findsylls embed input.wav --features mfcc --pooling mean --out embeddings.npz
# Evaluate against TextGrid annotations
findsylls evaluate "data/**/*.wav" "data/**/*.TextGrid" \
--phone-tier 1 --syllable-tier 2 --word-tier 3 \
--envelope hilbert --out results.csv
Methods Overview
Envelope Methods
rmshilbertlowpasssbsgammatonetheta- Feature-based envelopes (e.g., SSM / GreedyCosine / CLS-attention where available)
Segmentation Methods
peakdetectcls_attentionsylbergreedy_cosinevg_hubert_mincut(aliases:vg_hubert,vg_hubert_ssm,featssm)vg_hubert_clssyllablelm
Backward-compatibility aliases are still accepted by the dispatcher, but the canonical names above are what the notebook and API docs should use.
Embedding Features
mfcc(13/26/39 dims with deltas)melspec(mel-filterbank)sylbervghubert(also accepted asvg-hubertorvg_hubert)
Examples and Notebook
- Interactive demo notebook: findsylls_demo.ipynb
- Example scripts: examples/
- Streaming workflow tutorial: notebooks/streaming_workflows.ipynb (coming soon)
Corpus-Scale Workflows
For large corpora, findsylls supports storage-backed embedding extraction and streaming clustering to avoid loading all embeddings into memory.
Storage-First Embedding Extraction
Use embed_corpus_to_storage() to write embeddings directly to disk per-file, with a manifest CSV for indexing:
from findsylls import embed_corpus_to_storage
info = embed_corpus_to_storage(
audio_files=['a.wav', 'b.wav', 'c.wav', ...],
output_dir='./embeddings',
segmentation='peakdetect',
features='mfcc',
pooling='mean',
)
print(f"Embedded {info['num_success']}/{info['num_files']} files")
# Output: ./embeddings/embedding_manifest.csv + ./embeddings/000000*.npz
The manifest CSV contains:
file_id: File indexaudio_path: Original audio file pathembedding_path: Path to.npzfile with embeddingsnum_rows: Number of syllablesembedding_dim: Embedding dimensionalitysuccess: Whether embedding succeedederror: Error message if failed
Streaming Clustering Discovery
Cluster large embeddings without loading all into memory using MiniBatchKMeans:
from findsylls import DiscoveryPipeline
from findsylls.embedding.storage import load_embedding_manifest
# Load manifest from storage-backed embedding
manifest_path = './embeddings/embedding_manifest.csv'
pipeline = DiscoveryPipeline(
discovery_method='minibatch_kmeans',
n_clusters=50,
)
# Fit and predict in chunks (default: 10K embeddings per chunk)
labels_by_file = pipeline.discover_from_storage(
manifest_path=manifest_path,
chunk_size=10000,
)
print(f"Discovered clusters across all files")
for file_id, labels in labels_by_file.items():
print(f" File {file_id}: {len(labels)} syllables")
Memory Profile:
- In-Memory Clustering (
embed_corpus()+vstack()+KMeans): ~10 GB for 500K syllables × 768D embeddings - Streaming Clustering (
embed_corpus_to_storage()+discover_from_storage()with MiniBatchKMeans): ~500 MB
This makes corpus-scale analysis practical on commodity hardware.
Evaluation & Metrics
Intrinsic Clustering Metrics (No Ground Truth Required)
When you run discovery, findsylls automatically computes:
- Silhouette Score (-1 to +1, higher is better): Measures how close samples are to their cluster vs other clusters
- Davies-Bouldin Index (lower is better): Ratio of within-cluster to between-cluster distances
- Calinski-Harabasz Index (higher is better): Ratio of between-cluster to within-cluster dispersion
Example:
from findsylls import DiscoveryPipeline
pipeline = DiscoveryPipeline(method='kmeans', n_clusters=50)
result = pipeline.discover(embeddings)
print(f"Silhouette: {result.fit_metrics['silhouette']:.3f}")
print(f"Davies-Bouldin: {result.fit_metrics['davies_bouldin']:.3f}")
print(f"Calinski-Harabasz: {result.fit_metrics['calinski_harabasz']:.1f}")
Evaluating Against TextGrid Annotations
Compare segmentation and discovered clusters against manual annotations:
from findsylls import evaluate_segmentation, compute_discovery_label_metrics
# Evaluate segmentation at multiple granularities
eval_result = evaluate_segmentation(
peaks=[0.15, 0.35, ...], # syllable nuclei in seconds
spans=[(0.1, 0.2), (0.3, 0.4), ...], # syllable boundaries
textgrid_path="annotations.TextGrid",
tiers={'phone': 2, 'syllable': 1, 'word': 0} # TextGrid tier indices
)
# Displays metrics like:
# - nuclei_f1: align detected nuclei with vowel intervals
# - syllable_boundaries_f1: align boundaries with syllable tiers
# - word_spans_f1: align with word-level boundaries
Label-Aware Discovery Metrics
Connect discovered clusters to ground-truth labels:
from findsylls.evaluation import attach_textgrid_labels_to_manifest, compute_discovery_label_metrics
# Add TextGrid labels to discovery results
labeled_manifest = attach_textgrid_labels_to_manifest(
manifest=discovery_result_manifest,
wav_paths=['audio.wav', ...],
textgrid_paths=['annotations.TextGrid', ...],
textgrid_tier_index=2, # phone tier
)
# Compute metrics
metrics = compute_discovery_label_metrics(labeled_manifest)
print(f"Cluster Purity: {metrics['cluster_purity']:.3f}")
print(f"Label Purity: {metrics['label_purity']:.3f}")
print(f"Normalized MI: {metrics['label_norm_mutual_info']:.3f}")
print(f"Macro F1: {metrics['macro_f1']:.3f}")
Metrics Glossary:
- Cluster Purity: What fraction of each cluster's members share the most common label (0-1)
- Label Purity: What fraction of each label's instances fall in the most common cluster (0-1)
- Normalized MI: Mutual information between cluster and label assignments, normalized by entropy (0-1)
- Macro F1: Unweighted average F1 across clusters (treating each cluster's dominant label as class)
Common Workflows
-
Corpus discovery with evaluation:
- Run
embed_corpus_to_storage()to extract syllable embeddings - Run
discover_from_storage()to cluster them - Attach ground truth via
attach_textgrid_labels_to_manifest() - Compute metrics with
compute_discovery_label_metrics()
- Run
-
Comparing segmentation methods:
- Run
evaluate_segmentation()on each segmentation method - Compare F1 scores across methods
- Run
-
Hyperparameter tuning:
- Extract embeddings with different pooling methods
- Cluster with varying
n_clusters - Compare intrinsic metrics (Silhouette, Davies-Bouldin) to choose best hyperparameters
Citation
If you use findsylls in academic work, please cite:
Plain text:
Vázquez Martínez, Héctor Javier. (2026). findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding. arXiv:2603.26292. https://arxiv.org/abs/2603.26292
BibTeX:
@misc{martinez2026findsyllslanguageagnostictoolkitsyllablelevel,
title={findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding},
author={Héctor Javier Vázquez Martínez},
year={2026},
eprint={2603.26292},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.26292},
}
License
MIT. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file findsylls-3.0.0.tar.gz.
File metadata
- Download URL: findsylls-3.0.0.tar.gz
- Upload date:
- Size: 494.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ce85cd665adeda432f005955969ea038aa22d9377d4d648bd93e93067282c38
|
|
| MD5 |
41c7bd4de1efa0841ace9e426e5da10e
|
|
| BLAKE2b-256 |
aa6d1a814a861d09d148a26280dc85eac248f03d5ab2c837b238cd820afad7c2
|
File details
Details for the file findsylls-3.0.0-py3-none-any.whl.
File metadata
- Download URL: findsylls-3.0.0-py3-none-any.whl
- Upload date:
- Size: 141.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d474dcd04701302871788e2da24d9d871046f0a5cad8027cced414fdbae0c872
|
|
| MD5 |
8d7dd3ddde2f86d4624e4e6a3e93c010
|
|
| BLAKE2b-256 |
3a10fb595eaf38ba32f6cbccce786543ca79d52d54261c2cd74a701218fe3df5
|