Skip to main content

DeepTalk Active Speaker Detection

Project description

DeepTalk-ASD

LR-ASD is a SOTA Active Speaker Detection (ASD) model. Building upon LR-ASD, this project provides an industrial-grade ASD system featuring full-pipeline ONNX conversion, integrated speaker verification, and real-time processing support. It runs efficiently on CPUs without relying on large PyTorch/GPU environments.


DeepTalk-ASD is an efficient Active Speaker Detection (ASD) system. By fusing audio, video features, and speaker embeddings, it determines in real-time which face in a video frame is speaking.

Key Features

  • Multimodal Fusion: Deeply integrates face detection, voice activity detection (VAD), audio-visual ASD models, and Speaker Embeddings.
  • Speaker Verification Enhancement (New):
    • Feature Fusion: Weighted fusion of ASD probability and audio speaker similarity to effectively suppress off-screen voice interference.
    • Fast Matching: Supports accelerated determination for tracks with known speaker embeddings, skipping complex calculations.
    • Automatic Profile Update: Automatically updates speaker embedding profiles for detected speakers using EMA (Exponential Moving Average).
  • Modular Design:
    • FaceDetector: Provides face detection and tracking (currently supports InspireFace).
    • TurnDetector: Audio VAD and turn management (integrates Silero VAD).
    • SpeakerDetector: Core decision layer performing audio-visual feature extraction, fusion decision, and speaker comparison (based on LR-ASD ONNX).
  • High Performance: Full-pipeline ONNX inference optimized for CPU, supporting real-time operation on mobile devices or standard laptops.
  • Auto Model Management: Models are automatically downloaded on first use; supports offline mode and custom cache directories.
  • Multiple Scenarios: Provides demos for real-time camera, video files, speaker embedding extraction, and pVAD.

System Architecture

The system works through three main sub-components:

  1. Face Detection (FaceDetector): Locates and tracks faces (Tracks) in each video frame.
  2. Turn Detection (TurnDetector): Determines if the current audio stream contains speech segments and manages their lifecycle (START, CONTINUE, END).
  3. Speaker Detection (SpeakerDetector):
    • Extracts audio MFCC features and 112x112 grayscale mouth images.
    • Uses Sherpa-ONNX to extract speaker embeddings (Voiceprint).
    • Calculates raw ASD scores through a three-stage ONNX model (Audio/Visual Frontend + AV Backend).
    • Final scores are dynamically fused from ASD scores and speaker matching scores.

Quick Start

1. Install

pip install deeptalk_asd

2. Zero-Configuration Usage

Models are automatically downloaded on first use (~46 MB total). No manual setup required:

from deeptalk_asd import ASDDetectorFactory

asd = ASDDetectorFactory().create()

3. Running Demos

Core ASD Demos

  • Real-time Camera Demo (Recommended first choice):
    python3 demo/realtime_asd_demo.py
    
  • Offline Video File Processing:
    python3 demo/video_asd_demo.py --input demo/demo.mp4 --display
    

Model Management

Auto-Download (Default)

Models are downloaded to ~/.cache/deeptalk_asd/ on first use. Subsequent runs load from cache instantly.

Model Size Description
audio_frontend.onnx 0.9 MB LR-ASD Audio Frontend
visual_frontend.onnx 1.5 MB LR-ASD Visual Frontend
av_backend.onnx 0.8 MB LR-ASD AV Backend
silero_vad.onnx 2.2 MB Silero VAD
Pikachu 16 MB InspireFace Detection Resources
wespeaker_zh_cnceleb_resnet34.onnx 25 MB Speaker Embedding (WeSpeaker)

Pre-Download Models (for offline environments)

# Download all models
python3 -m deeptalk_asd download-models

# Download to a specific directory
python3 -m deeptalk_asd download-models --cache-dir /path/to/models

# Check cache status
python3 -m deeptalk_asd info

Offline Mode

Set DEEPTALK_ASD_OFFLINE=1 to disable all network requests. Models must be pre-downloaded:

export DEEPTALK_ASD_OFFLINE=1
export DEEPTALK_ASD_CACHE_DIR=/path/to/models

Environment Variables

Variable Default Description
DEEPTALK_ASD_OFFLINE (unset) Set to 1 to enable offline mode
DEEPTALK_ASD_CACHE_DIR ~/.cache/deeptalk_asd/ Custom model cache directory

Configuration

Control components and their parameters precisely through the factory method:

from deeptalk_asd import ASDDetectorFactory

# Zero-configuration (recommended)
asd = ASDDetectorFactory().create()

# Custom configuration
config = {
    "face_detector": {
        "type": "inspireface",
        "model_dir": "weights"
    },
    "turn_detector": {
        "type": "silero-vad", 
        "model_dir": "weights"
    },
    "speaker_detector": {
        "type": "LR-ASD-ONNX", 
        "model_dir": "weights",
        "voiceprint_model_name": "wespeaker_zh_cnceleb_resnet34.onnx"
    }
}

asd = ASDDetectorFactory(**config).create()

License

The code in this project is licensed under the MIT License. However, the integrated pre-trained models are subject to their respective licenses:

  1. InspireFace: Code is open-source, but weight files are typically restricted to non-commercial research. Please check official terms for commercial use.
  2. Silero VAD: Licensed under the MIT License.
  3. LR-ASD: Licensed under the MIT License.
  4. Sherpa-ONNX: Licensed under the Apache-2.0 License.

Note: When releasing or commercializing applications based on this project, you must verify the license compliance for each model's weights.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeptalk_asd-0.3.0.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeptalk_asd-0.3.0-py3-none-any.whl (58.3 kB view details)

Uploaded Python 3

File details

Details for the file deeptalk_asd-0.3.0.tar.gz.

File metadata

  • Download URL: deeptalk_asd-0.3.0.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for deeptalk_asd-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c4f72ac83319bad4e8dfd7729ee73917f744ffb2468a04d00025d4eebd9f9295
MD5 a895720b3f09e39499bbf2ed6107fd95
BLAKE2b-256 780626d9d179c27886dbdd55075de004767c4363491c54d57fb30c0b42ab6e8c

See more details on using hashes here.

File details

Details for the file deeptalk_asd-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: deeptalk_asd-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 58.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for deeptalk_asd-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e52ecd5f439571794ba8dd9f7392a5712203968518672708402fbe0cc4f3883f
MD5 4073353d3c75e28cf200460208544fcb
BLAKE2b-256 fe66ffd7942841ec1a785bae28c52a1156818d95ee0be523f552c8c845157eec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page