A Python library for detecting speech segments and non-speech gaps in audio/video files using FSMN-VAD-ONNX with streaming processing
Project description
speech-detect
A Python library for detecting speech segments and non-speech gaps in audio/video files using FSMN-VAD-ONNX with streaming processing.
Features
- Streaming VAD detection: Process large audio/video files in chunks without loading everything into memory
- Speech segment detection: Detect all speech segments in audio/video files
- Non-speech gap derivation: Compute non-speech gaps from speech segments
- Format support: Supports all audio/video formats that FFmpeg supports (MP3, WAV, FLAC, Opus, MP4, etc.)
- Time range support: Support start time and duration parameters for partial processing
- Memory efficient: Constant memory usage regardless of audio file duration
Installation
pip install speech-detect
Note: This package requires:
- FFmpeg to be installed on your system and available in PATH
- FSMN-VAD-ONNX model files (see Model Setup below)
Model Setup
This package requires FSMN-VAD-ONNX model files. The model is available on Hugging Face:
Model Repository: funasr/fsmn-vad-onnx
Download the Model
-
Install Git LFS (required for downloading large model files):
git lfs install
-
Clone the model repository:
git clone https://huggingface.co/funasr/fsmn-vad-onnx
This will download the model files including
model_quant.onnx,config.yaml,am.mvn, etc. -
Set the
MODEL_FSMN_VAD_DIRenvironment variable to point to the model directory:export MODEL_FSMN_VAD_DIR=/path/to/fsmn-vad-onnx
Alternatively, you can specify the model directory when initializing SpeechDetector:
from speech_detect import SpeechDetector
detector = SpeechDetector(model_dir="/path/to/fsmn-vad-onnx")
Quick Start
Detect Speech Segments and Gaps
from speech_detect import SpeechDetector
# Initialize detector (reads MODEL_FSMN_VAD_DIR from environment)
detector = SpeechDetector()
# Detect speech segments and non-speech gaps in an audio file
speech_segments, gaps = detector.detect("audio.mp3")
# speech_segments is a list of dictionaries: [{"start": 0, "end": 500}, ...]
for segment in speech_segments:
start_ms = segment["start"]
end_ms = segment["end"]
duration = end_ms - start_ms
print(f"Speech segment: {start_ms}ms - {end_ms}ms (duration: {duration}ms)")
# gaps is a list of dictionaries: [{"start": 0, "end": 500}, ...]
for gap in gaps:
start_ms = gap["start"]
end_ms = gap["end"]
duration = end_ms - start_ms
print(f"Non-speech gap: {start_ms}ms - {end_ms}ms (duration: {duration}ms)")
Processing Specific Time Range
# Process only the first 30 seconds
speech_segments, gaps = detector.detect(
file_path="audio.mp3",
start_ms=0,
duration_ms=30000,
)
# Process from 10 seconds, duration 5 seconds
speech_segments, gaps = detector.detect(
file_path="audio.mp3",
start_ms=10000,
duration_ms=5000,
)
Custom Chunk Size
# Use 1-minute chunks instead of default 20-minute chunks
speech_segments, gaps = detector.detect(
file_path="audio.mp3",
chunk_duration_sec=60,
)
API Reference
SpeechDetector
Main class for speech detection. All methods are instance methods.
SpeechDetector.__init__(model_dir=None)
Initialize speech detector.
Parameters:
model_dir(str, optional): Path to the FSMN-VAD model directory. If None, reads fromMODEL_FSMN_VAD_DIRenvironment variable.
Note: The FSMN-VAD model only has a quantized version, so quantize=True is always used internally.
Raises:
VadModelNotFoundError: If model directory is not found or not setVadModelInitializationError: If model initialization fails
SpeechDetector.detect(file_path, chunk_duration_sec=None, start_ms=None, duration_ms=None)
Detect speech segments in audio/video file using streaming processing.
Parameters:
file_path(str): Path to the audio/video file (supports all FFmpeg formats)chunk_duration_sec(int, optional): Duration of each chunk in seconds. Defaults to 1200 (20 minutes). Must be > 0 if provided.start_ms(int, optional): Start position in milliseconds. None means from file beginning. If None butduration_msis provided, defaults to 0.duration_ms(int, optional): Total duration to process in milliseconds. None means process until end. If specified, processing stops when this duration is reached.
Returns:
tuple[list[VadSegment], list[VadSegment]]: Tuple of (speech_segments, gaps)speech_segments: List of speech segments, format:[{"start": ms, "end": ms}, ...]- Timestamps are relative to audio start (from 0)
- Unit: milliseconds
gaps: List of non-speech gaps, format:[{"start": ms, "end": ms}, ...]- Timestamps are relative to audio start (from 0)
- Unit: milliseconds
Raises:
VadProcessingError: If processing fails
Data Types
VadSegment
A TypedDict representing a time segment (can be a speech segment or a non-speech gap).
Fields:
start(int): Segment start time in millisecondsend(int): Segment end time in milliseconds
Example:
segment: VadSegment = {"start": 100, "end": 500}
Exceptions
VadModelNotFoundError
Raised when VAD model directory is not found or not set.
Attributes:
message: Human-readable error message
VadModelInitializationError
Raised when VAD model initialization fails.
Attributes:
message: Primary error messagemodel_dir: Path to the model directory that caused the error
VadProcessingError
Raised when VAD processing fails.
Attributes:
message: Primary error messagefile_path: Path to the file being processeddetails: Additional error details dictionary
Requirements
- Python >= 3.10
- FFmpeg (must be installed separately)
- numpy >= 1.26.4
- funasr-onnx >= 0.4.1
- ffmpeg-audio >= 0.1.3
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file speech_detect-0.1.1.tar.gz.
File metadata
- Download URL: speech_detect-0.1.1.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3bb584485304d1117e432cb465dfa2d9505f9e652ac0ac5e1dca24911f3b801
|
|
| MD5 |
9482269f08225deacab518431019826d
|
|
| BLAKE2b-256 |
4b314f8a912170d32889fce1871e34330b382bb47bec7f1da9b8e676c15de942
|
File details
Details for the file speech_detect-0.1.1-py3-none-any.whl.
File metadata
- Download URL: speech_detect-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1d18b2591420aea00146c325376eec47b9e8f287e18fa1fb8ef08eae30b06a7
|
|
| MD5 |
df1e60935cdc780e4a149fce2c4cd9aa
|
|
| BLAKE2b-256 |
2a20b015039404006e64ecd33e5fb449d5255bb93fed75afc20c7f394c41e537
|