Skip to main content

ASR pipeline for the ASR project

Project description

IISY ASR Pipeline

An automated speech recognition (ASR) pipeline with speech enhancement, transcription, and speaker identification capabilities.

Overview

The IISY ASR Pipeline is a comprehensive solution for processing audio input in real-time. It combines multiple processing stages:

  1. Speech Enhancement - Using DeepFilterNet to improve audio quality
  2. Speech Transcription - Converting speech to text with Faster Whisper
  3. Speaker Identification - Identifying speakers using SpeechBrain models

Installation

Requirements

  • Python 3.11.10
  • CUDA-compatible GPU (recommended for optimal performance)

Installation

You can install the package directly from PyPI:

# For CPU-only installation
pip install iisy-asr-pipeline

# For GPU support (CUDA)
pip install iisy-asr-pipeline[cuda]

For GPU support, you'll need to manually install the CUDA-compatible version of PyTorch first:

# Install CUDA-compatible PyTorch (example for CUDA 11.8)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

# Then install the package with CUDA support
pip install iisy-asr-pipeline[cuda]

You can adjust the CUDA version (cu117, cu118, cu121, etc.) based on your system's requirements.

Usage

Listing Available Audio Devices

Before running the pipeline, you may want to identify the correct audio input device:

python -m iisy.run_pipeline --list-devices

Basic Usage

Run the ASR pipeline with default settings:

python -m iisy.run_pipeline --input-device-index 1

Command Line Options

The pipeline can be customized with various command line arguments:

python -m iisy.run_pipeline [OPTIONS]

Device Settings

  • --device - Device to run models on (cuda or cpu, default: cuda if available, otherwise cpu)
  • --input-device-index - Input audio device index (default: 1)
  • --list-devices - List all available audio devices and exit

Audio Parameters

  • --chunk-size - Number of audio frames per buffer (default: 2048)
  • --channels - Number of audio channels (1=mono, 2=stereo, default: 1)
  • --buffer-size - Size of the audio buffer (default: 1000)

Model Parameters

  • --whisper-model - Whisper model size (tiny, base, small, medium, large, turbo, default: medium)
  • --speaker-model - Speaker identification model path (default: speechbrain/spkrec-resnet-voxceleb)

Silence Detection Parameters

  • --silence-threshold - Energy threshold for silence detection (default: 0.01)
  • --min-silence-duration - Minimum duration of silence for sentence boundary in seconds (default: 2.0)

Other Parameters

  • --speaker-threshold - Threshold for speaker identification (default: 0.55)
  • --verbose - Enable verbose logging

Example Commands

Run with a larger Whisper model for better transcription accuracy:

python -m iisy.run_pipeline --whisper-model large

Use a different microphone (device index 2) and enable verbose logging:

python -m iisy.run_pipeline --input-device-index 2 --verbose

Use ECAPA-TDNN model for speaker identification:

python -m iisy.run_pipeline --speaker-model speechbrain/spkrec-ecapa-voxceleb

Advanced Usage

Programmatic Integration

You can integrate the ASR pipeline into your own Python applications:

import threading
import pyaudio
import torch
from iisy.context_window import ContextWindow
from iisy.pipeline.asr_pipeline import AsrPipeline

# Initialize audio capture
p = pyaudio.PyAudio()
in_stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    input_device_index=1,
    frames_per_buffer=2048
)

# Create audio buffer
audio_buffer = ContextWindow(1000)

# Configure pipeline
pipeline_config = {
    'speaker': {
        'model': "speechbrain/spkrec-resnet-voxceleb",
        'savedir': "spkrec-resnet-voxceleb",
        'speaker_threshold': 0.55
    },
    'whisper': {
        "model_size": "medium",
        "device_index": 0,
        "compute_type": "float16"
    }
}

# Create pipeline
pipeline = AsrPipeline(
    input_sr=16000,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    min_silence_duration=2.0,
    verbose=True,
    **pipeline_config
)

# Set up audio capture
def audio_capture():
    while True:
        try:
            audio_data = in_stream.read(2048, exception_on_overflow=False)
            audio_buffer.add(audio_data)
        except Exception as e:
            print(f"Audio capture error: {e}")
            break

# Start capture thread
capture_thread = threading.Thread(target=audio_capture, daemon=True)
capture_thread.start()

# Run pipeline
try:
    pipeline.run(audio_buffer)
finally:
    in_stream.stop_stream()
    in_stream.close()
    p.terminate()

Custom Processing Steps

You can customize each processing step of the pipeline:

from iisy.pipeline.speech_enhancement_step import SpeechEnhancementStep
from iisy.pipeline.speech_transcription_step import SpeechTranscriptionStep
from iisy.pipeline.speaker_identification_step import SpeakerIdentificationStep

# Create custom steps
enhancement_step = SpeechEnhancementStep(...)
transcription_step = SpeechTranscriptionStep(...)
identification_step = SpeakerIdentificationStep(...)

# Create pipeline with custom steps
pipeline = AsrPipeline(
    enhancement_step=enhancement_step,
    transcription_step=transcription_step,
    identification_step=identification_step
)

Troubleshooting

Common Issues

  1. Audio device not found: Verify your input device with --list-devices and select the correct index.

  2. CUDA out of memory: Try using a smaller Whisper model (--whisper-model small or --whisper-model base).

  3. Poor transcription quality: Consider the following:

    • Try a larger Whisper model
    • Ensure your microphone is positioned correctly
    • Adjust --min-silence-duration for better sentence boundaries
  4. Speaker identification issues: Try adjusting the --speaker-threshold value. Higher values require more confidence for speaker identification.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project utilizes several open-source libraries:

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iisy_asr_pipeline-1.0.1.post2.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iisy_asr_pipeline-1.0.1.post2-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file iisy_asr_pipeline-1.0.1.post2.tar.gz.

File metadata

File hashes

Hashes for iisy_asr_pipeline-1.0.1.post2.tar.gz
Algorithm Hash digest
SHA256 6f7f97a3a9c81091ca2b72f7ef726cf279a5edeefe395ef505c4bb86e1bc7b58
MD5 cee10686f35bb711b6052408e130d816
BLAKE2b-256 bb10a8ce9a3509832526960e605ca350a48361d17d750a30b970d853b6d16989

See more details on using hashes here.

File details

Details for the file iisy_asr_pipeline-1.0.1.post2-py3-none-any.whl.

File metadata

File hashes

Hashes for iisy_asr_pipeline-1.0.1.post2-py3-none-any.whl
Algorithm Hash digest
SHA256 99c0749d6d55963314bdcb833056899bfd8151bdc5bc3d60983050001105151f
MD5 75e7dde8a69a1f33c4bab171f1552b46
BLAKE2b-256 a0b97a4bc783c5224afb97b2dbeab7863a96e5d9e7a75271c391ff27f5ef4398

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page