ASR pipeline for the ASR project
Project description
IISY ASR Pipeline
An automated speech recognition (ASR) pipeline with speech enhancement, transcription, and speaker identification capabilities.
Overview
The IISY ASR Pipeline is a comprehensive solution for processing audio input in real-time. It combines multiple processing stages:
- Speech Enhancement - Using DeepFilterNet to improve audio quality
- Speech Transcription - Converting speech to text with Faster Whisper
- Speaker Identification - Identifying speakers using SpeechBrain models
Installation
Requirements
- Python 3.11.10
- CUDA-compatible GPU (recommended for optimal performance)
Installation
You can install the package directly from PyPI:
# For CPU-only installation
pip install iisy-asr-pipeline
# For GPU support (CUDA)
pip install iisy-asr-pipeline[cuda]
For GPU support, you'll need to manually install the CUDA-compatible version of PyTorch first:
# Install CUDA-compatible PyTorch (example for CUDA 11.8)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
# Then install the package with CUDA support
pip install iisy-asr-pipeline[cuda]
You can adjust the CUDA version (cu117, cu118, cu121, etc.) based on your system's requirements.
Usage
Listing Available Audio Devices
Before running the pipeline, you may want to identify the correct audio input device:
python -m iisy.run_pipeline --list-devices
Basic Usage
Run the ASR pipeline with default settings:
python -m iisy.run_pipeline --input-device-index 1
Command Line Options
The pipeline can be customized with various command line arguments:
python -m iisy.run_pipeline [OPTIONS]
Device Settings
--device- Device to run models on (cudaorcpu, default:cudaif available, otherwisecpu)--input-device-index- Input audio device index (default:1)--list-devices- List all available audio devices and exit
Audio Parameters
--chunk-size- Number of audio frames per buffer (default:2048)--channels- Number of audio channels (1=mono, 2=stereo, default:1)--buffer-size- Size of the audio buffer (default:1000)
Model Parameters
--whisper-model- Whisper model size (tiny, base, small, medium, large, turbo, default:medium)--speaker-model- Speaker identification model path (default:speechbrain/spkrec-resnet-voxceleb)
Silence Detection Parameters
--silence-threshold- Energy threshold for silence detection (default:0.01)--min-silence-duration- Minimum duration of silence for sentence boundary in seconds (default:2.0)
Other Parameters
--speaker-threshold- Threshold for speaker identification (default:0.55)--verbose- Enable verbose logging
Example Commands
Run with a larger Whisper model for better transcription accuracy:
python -m iisy.run_pipeline --whisper-model large
Use a different microphone (device index 2) and enable verbose logging:
python -m iisy.run_pipeline --input-device-index 2 --verbose
Use ECAPA-TDNN model for speaker identification:
python -m iisy.run_pipeline --speaker-model speechbrain/spkrec-ecapa-voxceleb
Advanced Usage
Programmatic Integration
You can integrate the ASR pipeline into your own Python applications:
import threading
import pyaudio
import torch
from iisy.context_window import ContextWindow
from iisy.pipeline.asr_pipeline import AsrPipeline
# Initialize audio capture
p = pyaudio.PyAudio()
in_stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
input_device_index=1,
frames_per_buffer=2048
)
# Create audio buffer
audio_buffer = ContextWindow(1000)
# Configure pipeline
pipeline_config = {
'speaker': {
'model': "speechbrain/spkrec-resnet-voxceleb",
'savedir': "spkrec-resnet-voxceleb",
'speaker_threshold': 0.55
},
'whisper': {
"model_size": "medium",
"device_index": 0,
"compute_type": "float16"
}
}
# Create pipeline
pipeline = AsrPipeline(
input_sr=16000,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
min_silence_duration=2.0,
verbose=True,
**pipeline_config
)
# Set up audio capture
def audio_capture():
while True:
try:
audio_data = in_stream.read(2048, exception_on_overflow=False)
audio_buffer.add(audio_data)
except Exception as e:
print(f"Audio capture error: {e}")
break
# Start capture thread
capture_thread = threading.Thread(target=audio_capture, daemon=True)
capture_thread.start()
# Run pipeline
try:
pipeline.run(audio_buffer)
finally:
in_stream.stop_stream()
in_stream.close()
p.terminate()
Custom Processing Steps
You can customize each processing step of the pipeline:
from iisy.pipeline.speech_enhancement_step import SpeechEnhancementStep
from iisy.pipeline.speech_transcription_step import SpeechTranscriptionStep
from iisy.pipeline.speaker_identification_step import SpeakerIdentificationStep
# Create custom steps
enhancement_step = SpeechEnhancementStep(...)
transcription_step = SpeechTranscriptionStep(...)
identification_step = SpeakerIdentificationStep(...)
# Create pipeline with custom steps
pipeline = AsrPipeline(
enhancement_step=enhancement_step,
transcription_step=transcription_step,
identification_step=identification_step
)
Troubleshooting
Common Issues
-
Audio device not found: Verify your input device with
--list-devicesand select the correct index. -
CUDA out of memory: Try using a smaller Whisper model (
--whisper-model smallor--whisper-model base). -
Poor transcription quality: Consider the following:
- Try a larger Whisper model
- Ensure your microphone is positioned correctly
- Adjust
--min-silence-durationfor better sentence boundaries
-
Speaker identification issues: Try adjusting the
--speaker-thresholdvalue. Higher values require more confidence for speaker identification.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
This project utilizes several open-source libraries:
- DeepFilterNet for speech enhancement
- Faster-Whisper for speech transcription
- SpeechBrain for speaker identification
Authors
- Paul Roloff - paul.roloff@uni-bielefeld.de
- Felix Hostert - felix.hostert@uni-bielefeld.de
- Kai Titgens - kai.titgens@uni-bielefeld.de
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iisy_asr_pipeline-1.0.0.post1.tar.gz.
File metadata
- Download URL: iisy_asr_pipeline-1.0.0.post1.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec763b8199a881270157b7bc86ebb70c3524fb4f58ab398043814400ba926b3c
|
|
| MD5 |
82d3f79f56eae7c406578556c3627079
|
|
| BLAKE2b-256 |
45dbb5def6110590712d65569bc249a46c27210cdebe584ec6b7591d0faa3ae9
|
File details
Details for the file iisy_asr_pipeline-1.0.0.post1-py3-none-any.whl.
File metadata
- Download URL: iisy_asr_pipeline-1.0.0.post1-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8790c4415580002c5c90136b2dab93c661a12880fcc46200e0b77236d939e192
|
|
| MD5 |
82630524ce74b466795dd779f71fbcb7
|
|
| BLAKE2b-256 |
c4af90bd0ee2c5047deee5490685706d847a4f79748705b15317e552147cc3e5
|