Skip to main content

ASR pipeline for the ASR project

Project description

IISY ASR Pipeline

An automated speech recognition (ASR) pipeline with speech enhancement, transcription, and speaker identification capabilities.

Overview

The IISY ASR Pipeline is a comprehensive solution for processing audio input in real-time. It combines multiple processing stages:

  1. Speech Enhancement - Using DeepFilterNet to improve audio quality
  2. Speech Transcription - Converting speech to text with Faster Whisper
  3. Speaker Identification - Identifying speakers using SpeechBrain models

Installation

Requirements

  • Python 3.11.10
  • CUDA-compatible GPU (recommended for optimal performance)

Installation

You can install the package directly from PyPI:

# For CPU-only installation
pip install iisy-asr-pipeline

# For GPU support (CUDA)
pip install iisy-asr-pipeline[cuda]

For GPU support, you'll need to manually install the CUDA-compatible version of PyTorch first:

# Install CUDA-compatible PyTorch (example for CUDA 11.8)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

# Then install the package with CUDA support
pip install iisy-asr-pipeline[cuda]

You can adjust the CUDA version (cu117, cu118, cu121, etc.) based on your system's requirements.

Usage

Listing Available Audio Devices

Before running the pipeline, you may want to identify the correct audio input device:

python -m iisy.run_pipeline --list-devices

Basic Usage

Run the ASR pipeline with default settings:

python -m iisy.run_pipeline --input-device-index 1

Command Line Options

The pipeline can be customized with various command line arguments:

python -m iisy.run_pipeline [OPTIONS]

Device Settings

  • --device - Device to run models on (cuda or cpu, default: cuda if available, otherwise cpu)
  • --input-device-index - Input audio device index (default: 1)
  • --list-devices - List all available audio devices and exit

Audio Parameters

  • --chunk-size - Number of audio frames per buffer (default: 2048)
  • --channels - Number of audio channels (1=mono, 2=stereo, default: 1)
  • --buffer-size - Size of the audio buffer (default: 1000)

Model Parameters

  • --whisper-model - Whisper model size (tiny, base, small, medium, large, turbo, default: medium)
  • --speaker-model - Speaker identification model path (default: speechbrain/spkrec-resnet-voxceleb)

Silence Detection Parameters

  • --silence-threshold - Energy threshold for silence detection (default: 0.01)
  • --min-silence-duration - Minimum duration of silence for sentence boundary in seconds (default: 2.0)

Other Parameters

  • --speaker-threshold - Threshold for speaker identification (default: 0.55)
  • --verbose - Enable verbose logging

Example Commands

Run with a larger Whisper model for better transcription accuracy:

python -m iisy.run_pipeline --whisper-model large

Use a different microphone (device index 2) and enable verbose logging:

python -m iisy.run_pipeline --input-device-index 2 --verbose

Use ECAPA-TDNN model for speaker identification:

python -m iisy.run_pipeline --speaker-model speechbrain/spkrec-ecapa-voxceleb

Advanced Usage

Programmatic Integration

You can integrate the ASR pipeline into your own Python applications:

import threading
import pyaudio
import torch
from iisy.context_window import ContextWindow
from iisy.pipeline.asr_pipeline import AsrPipeline

# Initialize audio capture
p = pyaudio.PyAudio()
in_stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    input_device_index=1,
    frames_per_buffer=2048
)

# Create audio buffer
audio_buffer = ContextWindow(1000)

# Configure pipeline
pipeline_config = {
    'speaker': {
        'model': "speechbrain/spkrec-resnet-voxceleb",
        'savedir': "spkrec-resnet-voxceleb",
        'speaker_threshold': 0.55
    },
    'whisper': {
        "model_size": "medium",
        "device_index": 0,
        "compute_type": "float16"
    }
}

# Create pipeline
pipeline = AsrPipeline(
    input_sr=16000,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    min_silence_duration=2.0,
    verbose=True,
    **pipeline_config
)

# Set up audio capture
def audio_capture():
    while True:
        try:
            audio_data = in_stream.read(2048, exception_on_overflow=False)
            audio_buffer.add(audio_data)
        except Exception as e:
            print(f"Audio capture error: {e}")
            break

# Start capture thread
capture_thread = threading.Thread(target=audio_capture, daemon=True)
capture_thread.start()

# Run pipeline
try:
    pipeline.run(audio_buffer)
finally:
    in_stream.stop_stream()
    in_stream.close()
    p.terminate()

Custom Processing Steps

You can customize each processing step of the pipeline:

from iisy.pipeline.speech_enhancement_step import SpeechEnhancementStep
from iisy.pipeline.speech_transcription_step import SpeechTranscriptionStep
from iisy.pipeline.speaker_identification_step import SpeakerIdentificationStep

# Create custom steps
enhancement_step = SpeechEnhancementStep(...)
transcription_step = SpeechTranscriptionStep(...)
identification_step = SpeakerIdentificationStep(...)

# Create pipeline with custom steps
pipeline = AsrPipeline(
    enhancement_step=enhancement_step,
    transcription_step=transcription_step,
    identification_step=identification_step
)

Troubleshooting

Common Issues

  1. Audio device not found: Verify your input device with --list-devices and select the correct index.

  2. CUDA out of memory: Try using a smaller Whisper model (--whisper-model small or --whisper-model base).

  3. Poor transcription quality: Consider the following:

    • Try a larger Whisper model
    • Ensure your microphone is positioned correctly
    • Adjust --min-silence-duration for better sentence boundaries
  4. Speaker identification issues: Try adjusting the --speaker-threshold value. Higher values require more confidence for speaker identification.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project utilizes several open-source libraries:

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iisy_asr_pipeline-1.0.0.post1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iisy_asr_pipeline-1.0.0.post1-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file iisy_asr_pipeline-1.0.0.post1.tar.gz.

File metadata

File hashes

Hashes for iisy_asr_pipeline-1.0.0.post1.tar.gz
Algorithm Hash digest
SHA256 ec763b8199a881270157b7bc86ebb70c3524fb4f58ab398043814400ba926b3c
MD5 82d3f79f56eae7c406578556c3627079
BLAKE2b-256 45dbb5def6110590712d65569bc249a46c27210cdebe584ec6b7591d0faa3ae9

See more details on using hashes here.

File details

Details for the file iisy_asr_pipeline-1.0.0.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for iisy_asr_pipeline-1.0.0.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 8790c4415580002c5c90136b2dab93c661a12880fcc46200e0b77236d939e192
MD5 82630524ce74b466795dd779f71fbcb7
BLAKE2b-256 c4af90bd0ee2c5047deee5490685706d847a4f79748705b15317e552147cc3e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page