Skip to main content

ASR pipeline for the ASR project

Project description

IISY ASR Pipeline

An automated speech recognition (ASR) pipeline with speech enhancement, transcription, and speaker identification capabilities.

Overview

The IISY ASR Pipeline is a comprehensive solution for processing audio input in real-time. It combines multiple processing stages:

  1. Speech Enhancement - Using DeepFilterNet to improve audio quality
  2. Speech Transcription - Converting speech to text with Faster Whisper
  3. Speaker Identification - Identifying speakers using SpeechBrain models

Installation

Requirements

  • Python 3.11.10
  • CUDA-compatible GPU (recommended for optimal performance)

Installation

You can install the package directly from PyPI:

# For CPU-only installation
pip install iisy-asr-pipeline

# For GPU support (CUDA)
pip install iisy-asr-pipeline[cuda]

For GPU support, you'll need to manually install the CUDA-compatible version of PyTorch first:

# Install CUDA-compatible PyTorch (example for CUDA 11.8)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

# Then install the package with CUDA support
pip install iisy-asr-pipeline[cuda]

You can adjust the CUDA version (cu117, cu118, cu121, etc.) based on your system's requirements.

Usage

Listing Available Audio Devices

Before running the pipeline, you may want to identify the correct audio input device:

python -m iisy.run_pipeline --list-devices

Basic Usage

Run the ASR pipeline with default settings:

python -m iisy.run_pipeline --input-device-index 1

Command Line Options

The pipeline can be customized with various command line arguments:

python -m iisy.run_pipeline [OPTIONS]

Device Settings

  • --device - Device to run models on (cuda or cpu, default: cuda if available, otherwise cpu)
  • --input-device-index - Input audio device index (default: 1)
  • --list-devices - List all available audio devices and exit

Audio Parameters

  • --chunk-size - Number of audio frames per buffer (default: 2048)
  • --channels - Number of audio channels (1=mono, 2=stereo, default: 1)
  • --buffer-size - Size of the audio buffer (default: 1000)

Model Parameters

  • --whisper-model - Whisper model size (tiny, base, small, medium, large, turbo, default: medium)
  • --speaker-model - Speaker identification model path (default: speechbrain/spkrec-resnet-voxceleb)

Silence Detection Parameters

  • --silence-threshold - Energy threshold for silence detection (default: 0.01)
  • --min-silence-duration - Minimum duration of silence for sentence boundary in seconds (default: 2.0)

Other Parameters

  • --speaker-threshold - Threshold for speaker identification (default: 0.55)
  • --verbose - Enable verbose logging

Example Commands

Run with a larger Whisper model for better transcription accuracy:

python -m iisy.run_pipeline --whisper-model large

Use a different microphone (device index 2) and enable verbose logging:

python -m iisy.run_pipeline --input-device-index 2 --verbose

Use ECAPA-TDNN model for speaker identification:

python -m iisy.run_pipeline --speaker-model speechbrain/spkrec-ecapa-voxceleb

Advanced Usage

Programmatic Integration

You can integrate the ASR pipeline into your own Python applications:

import threading
import pyaudio
import torch
from iisy.context_window import ContextWindow
from iisy.pipeline.asr_pipeline import AsrPipeline

# Initialize audio capture
p = pyaudio.PyAudio()
in_stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    input_device_index=1,
    frames_per_buffer=2048
)

# Create audio buffer
audio_buffer = ContextWindow(1000)

# Configure pipeline
pipeline_config = {
    'speaker': {
        'model': "speechbrain/spkrec-resnet-voxceleb",
        'savedir': "spkrec-resnet-voxceleb",
        'speaker_threshold': 0.55
    },
    'whisper': {
        "model_size": "medium",
        "device_index": 0,
        "compute_type": "float16"
    }
}

# Create pipeline
pipeline = AsrPipeline(
    input_sr=16000,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    min_silence_duration=2.0,
    verbose=True,
    **pipeline_config
)

# Set up audio capture
def audio_capture():
    while True:
        try:
            audio_data = in_stream.read(2048, exception_on_overflow=False)
            audio_buffer.add(audio_data)
        except Exception as e:
            print(f"Audio capture error: {e}")
            break

# Start capture thread
capture_thread = threading.Thread(target=audio_capture, daemon=True)
capture_thread.start()

# Run pipeline
try:
    pipeline.run(audio_buffer)
finally:
    in_stream.stop_stream()
    in_stream.close()
    p.terminate()

Custom Processing Steps

You can customize each processing step of the pipeline:

from iisy.pipeline.speech_enhancement_step import SpeechEnhancementStep
from iisy.pipeline.speech_transcription_step import SpeechTranscriptionStep
from iisy.pipeline.speaker_identification_step import SpeakerIdentificationStep

# Create custom steps
enhancement_step = SpeechEnhancementStep(...)
transcription_step = SpeechTranscriptionStep(...)
identification_step = SpeakerIdentificationStep(...)

# Create pipeline with custom steps
pipeline = AsrPipeline(
    enhancement_step=enhancement_step,
    transcription_step=transcription_step,
    identification_step=identification_step
)

Troubleshooting

Common Issues

  1. Audio device not found: Verify your input device with --list-devices and select the correct index.

  2. CUDA out of memory: Try using a smaller Whisper model (--whisper-model small or --whisper-model base).

  3. Poor transcription quality: Consider the following:

    • Try a larger Whisper model
    • Ensure your microphone is positioned correctly
    • Adjust --min-silence-duration for better sentence boundaries
  4. Speaker identification issues: Try adjusting the --speaker-threshold value. Higher values require more confidence for speaker identification.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project utilizes several open-source libraries:

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iisy_asr_pipeline-1.0.1.post1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iisy_asr_pipeline-1.0.1.post1-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file iisy_asr_pipeline-1.0.1.post1.tar.gz.

File metadata

File hashes

Hashes for iisy_asr_pipeline-1.0.1.post1.tar.gz
Algorithm Hash digest
SHA256 1d4f7a0d10551637787c1be4a8bd7a56492cf65a7e88d642232c5c8c3d6b23bf
MD5 c7c8761452eec558a2ac88f139797f7b
BLAKE2b-256 d208eaf60c474464ffc025013bd8726b511c696fdf608540b41b2ca988ef5bbf

See more details on using hashes here.

File details

Details for the file iisy_asr_pipeline-1.0.1.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for iisy_asr_pipeline-1.0.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 e51e05f2a806a0444eedc1f6592edb632b7545158dbfc934d248b471ccad81bd
MD5 e61499ddb835829509da4ec2ff8e251d
BLAKE2b-256 ec812ac7b07ac68108aff53e3654fbb22a68a563e8026201ad2a67eb3520efb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page