Skip to main content

JaVAD: Just Another Voice Activity Detector

Project description

JaVAD: Just Another Voice Activity Detector

JaVAD is a state-of-the-art Voice Activity Detection package, lightweight and fast, built on PyTorch with minimal dependencies. Core functionality (without audio loading) requires only NumPy and PyTorch, with no registration, tokens, or installation of large unnecessary packages. While it is built using sliding windows over mel spectrograms, it supports streaming. You can also export results to RTTM, CSV, or TextGrid.

There are three models:

  • tiny: 0.64s window, optimal for quickest voice detection
  • balanced: 1.92s window, fastest while small
  • precise: 3.84s window with extra DirectionalAlignment layer and 2 additional transformer layers for best accuracy.

All models use mono audio with a sample rate of 16,000 Hz.

Comparison

  • For evaluation, Google's AVA Speech dataset was used, or, to be accurate, only clips that are still available (74 in total). Since AVA Speech has only 15 minutes from each clip labeled, 18.5 hours of audio in total were used.
  • For evaluation purposes, JaVAD was trained on a custom, manually labelled dataset using a separate, different collection of YouTube clips. Production models were trained on all available data.
Model Precision Recall F1 Score AUROC Time, GPU
Nvidia 3090
Time, CPU
Ryzen 3900XT
Nvidia NEMO 0.7676 0.9526 0.8502 0.9201 26.24s 56.94s
WebRTC (via py-webrtc) 0.6099 0.9454 0.7415 59.85s
Google Speechbrain 0.8213 0.8534 0.8370 0.8961 1371.00s 1981.40s
Pyannote 0.9173 0.8463 0.8804 0.9495 75.49s 823.19s
Silero 0.9678 0.6503 0.9050 0.9169 830.27s³ 695.58s
JaVAD tiny⁴* 0.9263 0.8846 0.8961 0.9550 22.32s 476.93s
JaVAD balanced* 0.9284 0.8938 0.9108 0.9642 16.38s 220.00s
JaVAD precise* 0.9359 0.8980 0.9166 0.9696 18.58s 236.61s

¹ WebRTC does not return logits ² WebRTC via py-webrtc can be run only on CPU
³ Silero JIT model is slower on GPU, and ONNX model cannot be run on GPU.
Tiny model is the slowest here due to the smaller window size of 0.64s. It is best applicable for immediate speech detection in the streaming pipeline.
*For information about training dataset see text above the table

ROCs

Installation

Requirements

  • Python 3.8+
  • PyTorch 2.0.0+
  • NumPy 1.20.0+
  • Optional: soundfile for loading audio and simplified processing

Install via pip

pip install javad  # or
pip install javad[extras]  # with audio loading

Usage

Basic Usage (if installed with [extras]), single file/CPU:

from javad.extras import get_speech_intervals

intervals = get_speech_intervals("path/to/audio.wav")
print(intervals)

Usage via Processor class, single file/CUDA[if available]:

import torch
from javad import Processor
from javad.extras import load_audio

# Load audio file
audio = load_audio("path/to/audio.wav")

# Initialize Processor with default 'balanced' model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Processor(device=device)
print(processor)

# Process audio
# Get logits
logits = processor.logits(audio).cpy().numpy() 
print(logits)
# Get boolean predictions based on threshold
predictions = processor.predict(audio).cpy().numpy() 
print(predictions)
# Get speech intervals
intervals = processor.intervals(audio) 
print(intervals)

You can increase accuracy by specifying the step size for the sliding window. The smaller the step, the longer it takes to compute and average predictions, resulting in a more accurate outcome.

Stream Processing, stream/MPS[if available]:

import torch
from javad.stream import Pipeline
from javad.extras import load_audio

# Initialize pipeline
pipeline = Pipeline()  # by default, Pipeline uses 'tiny' model
pipeline.to(torch.device("mps" if torch.mps.is_available() else "cpu"))
print(pipeline)

# Load audio file
audio = load_audio("path/to/audio.wav")

# Process audio in chunks
chunk_size = int(pipeline.config.sample_rate * 0.5)  # 0.5-second chunks
for i in range(0, len(audio), chunk_size):
    audio_chunk = audio[i : i + chunk_size]
    predictions = pipeline.intervals(audio_chunk)
    print(predictions)

There are two modes for streaming: instant and gradual. The instant mode returns results only for the current chunk pushed into the pipeline, while the gradual mode updates and averages predictions while the chunk is within the audio buffer. For example, with a chunk size of 0.25s and the balanced model's window size of 1.92s, it will provide 8 updates for that chunk.

TL;DR: Use instant mode for the fastest response, or gradual mode for the most accurate results in stream mode.

Instant detection

import torch
from javad.stream import Pipeline
from javad.extras import load_audio

# Initialize pipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline = Pipeline(device=device) # by default, Pipeline uses 'tiny' model

# Generate chunk of audio
audio = load_audio("path/to/audio.wav")
chunk_size = int(pipeline.config.sample_rate * 0.5)  # 0.5-second chunks
audio_chunk = audio[:chunk_size]

# Process and detect speech once per stream
bool_prediction = pipeline.detect(audio_chunk)
print(bool_prediction)

# Reset Pipeline for new stream
pipeline.reset()

License

This project is licensed under the MIT License.

Citation

If you use this package in your research, please cite it as follows:

@misc{JaVAD, author = {Sergey Skrebnev}, title = {JaVAD: Just Another Voice Activity Detector}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/skrbnv/javad}}, }

Alternatively, you can use the following BibTeX entry:

@software{JaVAD, author = {Sergey Skrebnev}, title = {JaVAD: Just Another Voice Activity Detector}, year = {2024}, url = {https://github.com/skrbnv/javad}, }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

javad-0.1.0.tar.gz (33.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

javad-0.1.0-py3-none-any.whl (33.3 MB view details)

Uploaded Python 3

javad-0.1.0-1-py3-none-any.whl (33.3 MB view details)

Uploaded Python 3

File details

Details for the file javad-0.1.0.tar.gz.

File metadata

  • Download URL: javad-0.1.0.tar.gz
  • Upload date:
  • Size: 33.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.4

File hashes

Hashes for javad-0.1.0.tar.gz
Algorithm Hash digest
SHA256 77cb869eb7d413fab03efe2ac9d743fe804b565549bcbd6c78989daf3e1d1a28
MD5 8bff4e96b6ffa18adf52ada465308e67
BLAKE2b-256 8dddbf1a04af6709c057a0bed70c00560d1fde8cefff6b9295546457dfe5377b

See more details on using hashes here.

File details

Details for the file javad-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: javad-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.4

File hashes

Hashes for javad-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2929724968115cbafd4636a57f263cc4ce90b2fbde5c91a17c4d355eb18ba47f
MD5 1b8aaf09c54edccb029ce774b9943fbf
BLAKE2b-256 ed72a4f8af72df91eb48e35e40118f0d9b36e1fd62089a9e866edbdf63351d2b

See more details on using hashes here.

File details

Details for the file javad-0.1.0-1-py3-none-any.whl.

File metadata

  • Download URL: javad-0.1.0-1-py3-none-any.whl
  • Upload date:
  • Size: 33.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.4

File hashes

Hashes for javad-0.1.0-1-py3-none-any.whl
Algorithm Hash digest
SHA256 68afff913fc306c40c30396352e1c77bfc5a06d198b51927478b386944278ad7
MD5 9e87743bec5cca5a9864f92c7932abfc
BLAKE2b-256 a971a61c3baf35edd2aaf9744ddb0fb2cd8405b217cb42da50805278c0f3cf10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page