DeepTalk Active Speaker Detection

Project description

DeepTalk-ASD

LR-ASD is a SOTA Active Speaker Detection (ASD) model. Building upon LR-ASD, this project provides an industrial-grade ASD system featuring full-pipeline ONNX conversion, integrated speaker verification, and real-time processing support. It runs efficiently on CPUs without relying on large PyTorch/GPU environments.

DeepTalk-ASD is an efficient Active Speaker Detection (ASD) system. By fusing audio, video features, and speaker embeddings, it determines in real-time which face in a video frame is speaking.

Key Features

Multimodal Fusion: Deeply integrates face detection, voice activity detection (VAD), audio-visual ASD models, and Speaker Embeddings.
Speaker Verification Enhancement (New):
- Feature Fusion: Weighted fusion of ASD probability and audio speaker similarity to effectively suppress off-screen voice interference.
- Fast Matching: Supports accelerated determination for tracks with known speaker embeddings, skipping complex calculations.
- Automatic Profile Update: Automatically updates speaker embedding profiles for detected speakers using EMA (Exponential Moving Average).
Modular Design:
- FaceDetector: Provides face detection and tracking (currently supports InspireFace).
- TurnDetector: Audio VAD and turn management (integrates Silero VAD).
- SpeakerDetector: Core decision layer performing audio-visual feature extraction, fusion decision, and speaker comparison (based on LR-ASD ONNX).
High Performance: Full-pipeline ONNX inference optimized for CPU, supporting real-time operation on mobile devices or standard laptops.
Auto Model Management: Models are automatically downloaded on first use; supports offline mode and custom cache directories.
Multiple Scenarios: Provides demos for real-time camera, video files, speaker embedding extraction, and pVAD.

System Architecture

The system works through three main sub-components:

Face Detection (FaceDetector): Locates and tracks faces (Tracks) in each video frame.
Turn Detection (TurnDetector): Determines if the current audio stream contains speech segments and manages their lifecycle (START, CONTINUE, END).
Speaker Detection (SpeakerDetector):
- Extracts audio MFCC features and 112x112 grayscale mouth images.
- Uses Sherpa-ONNX to extract speaker embeddings (Voiceprint).
- Calculates raw ASD scores through a three-stage ONNX model (Audio/Visual Frontend + AV Backend).
- Final scores are dynamically fused from ASD scores and speaker matching scores.

Quick Start

1. Install

pip install deeptalk_asd

2. Zero-Configuration Usage

Models are automatically downloaded on first use (~46 MB total). No manual setup required:

from deeptalk_asd import ASDDetectorFactory

asd = ASDDetectorFactory().create()

3. Running Demos

Core ASD Demos

Real-time Camera Demo (Recommended first choice):
```
python3 demo/realtime_asd_demo.py
```

Offline Video File Processing:

python3 demo/video_asd_demo.py --input demo/demo.mp4 --display

Model Management

Auto-Download (Default)

Models are downloaded to ~/.cache/deeptalk_asd/ on first use. Subsequent runs load from cache instantly.

Model	Size	Description
`audio_frontend.onnx`	0.9 MB	LR-ASD Audio Frontend
`visual_frontend.onnx`	1.5 MB	LR-ASD Visual Frontend
`av_backend.onnx`	0.8 MB	LR-ASD AV Backend
`silero_vad.onnx`	2.2 MB	Silero VAD
`Pikachu`	16 MB	InspireFace Detection Resources
`wespeaker_zh_cnceleb_resnet34.onnx`	25 MB	Speaker Embedding (WeSpeaker)

Pre-Download Models (for offline environments)

# Download all models
python3 -m deeptalk_asd download-models

# Download to a specific directory
python3 -m deeptalk_asd download-models --cache-dir /path/to/models

# Check cache status
python3 -m deeptalk_asd info

Offline Mode

Set DEEPTALK_ASD_OFFLINE=1 to disable all network requests. Models must be pre-downloaded:

export DEEPTALK_ASD_OFFLINE=1
export DEEPTALK_ASD_CACHE_DIR=/path/to/models

Environment Variables

Variable	Default	Description
`DEEPTALK_ASD_OFFLINE`	(unset)	Set to `1` to enable offline mode
`DEEPTALK_ASD_CACHE_DIR`	`~/.cache/deeptalk_asd/`	Custom model cache directory

Configuration

Control components and their parameters precisely through the factory method:

from deeptalk_asd import ASDDetectorFactory

# Zero-configuration (recommended)
asd = ASDDetectorFactory().create()

# Custom configuration
config = {
    "face_detector": {
        "type": "inspireface",
        "model_dir": "weights"
    },
    "turn_detector": {
        "type": "silero-vad", 
        "model_dir": "weights"
    },
    "speaker_detector": {
        "type": "LR-ASD-ONNX", 
        "model_dir": "weights",
        "voiceprint_model_name": "wespeaker_zh_cnceleb_resnet34.onnx"
    }
}

asd = ASDDetectorFactory(**config).create()

License

The code in this project is licensed under the MIT License. However, the integrated pre-trained models are subject to their respective licenses:

InspireFace: Code is open-source, but weight files are typically restricted to non-commercial research. Please check official terms for commercial use.
Silero VAD: Licensed under the MIT License.
LR-ASD: Licensed under the MIT License.
Sherpa-ONNX: Licensed under the Apache-2.0 License.

Note: When releasing or commercializing applications based on this project, you must verify the license compliance for each model's weights.

Project details

Release history Release notifications | RSS feed

0.3.1

Mar 20, 2026

This version

0.3.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeptalk_asd-0.3.0.tar.gz (46.1 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deeptalk_asd-0.3.0-py3-none-any.whl (58.3 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file deeptalk_asd-0.3.0.tar.gz.

File metadata

Download URL: deeptalk_asd-0.3.0.tar.gz
Upload date: Mar 19, 2026
Size: 46.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for deeptalk_asd-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c4f72ac83319bad4e8dfd7729ee73917f744ffb2468a04d00025d4eebd9f9295`
MD5	`a895720b3f09e39499bbf2ed6107fd95`
BLAKE2b-256	`780626d9d179c27886dbdd55075de004767c4363491c54d57fb30c0b42ab6e8c`

See more details on using hashes here.

File details

Details for the file deeptalk_asd-0.3.0-py3-none-any.whl.

File metadata

Download URL: deeptalk_asd-0.3.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 58.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for deeptalk_asd-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e52ecd5f439571794ba8dd9f7392a5712203968518672708402fbe0cc4f3883f`
MD5	`4073353d3c75e28cf200460208544fcb`
BLAKE2b-256	`fe66ffd7942841ec1a785bae28c52a1156818d95ee0be523f552c8c845157eec`

See more details on using hashes here.

deeptalk-asd 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

DeepTalk-ASD

Key Features

System Architecture

Quick Start

1. Install

2. Zero-Configuration Usage

3. Running Demos

Core ASD Demos

Model Management

Auto-Download (Default)

Pre-Download Models (for offline environments)

Offline Mode

Environment Variables

Configuration

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes