Skip to main content

Library for transcribing audio conversations with accurate speaker identification

Project description

PyHearingAI

GitHub License

The official library for transcribing audio conversations with accurate speaker identification.

Current Status

PyHearingAI follows Clean Architecture principles with a well-organized code structure. The library provides a complete pipeline for audio transcription with speaker diarization and supports multiple output formats.

Features

  • Audio format conversion (supports mp3, wav, mp4, and more)
  • Transcription pipeline powered by OpenAI Whisper
  • Speaker diarization using Pyannote
  • Speaker assignment using GPT-4o
  • Support for multiple output formats:
    • TXT
    • JSON
    • SRT
    • VTT
    • Markdown
  • Clean Architecture design for maintainability and extensibility
  • End-to-end testing framework
  • Progress tracking for long-running processes
  • Comprehensive error handling
  • Command-line interface

Requirements

  • Python 3.8+
  • FFmpeg for audio conversion
  • API keys:
    • OpenAI API key (for Whisper transcription and GPT-4o speaker assignment)
    • Hugging Face API key (for Pyannote speaker diarization)

Installation

System Dependencies

First, install FFmpeg:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows (using Chocolatey)
choco install ffmpeg

Using Poetry (Recommended)

poetry add pyhearingai

Using pip

pip install pyhearingai

API Key Setup

Set up your API keys as environment variables:

# In your terminal or .env file
export OPENAI_API_KEY=your_openai_api_key
export HUGGINGFACE_API_KEY=your_huggingface_api_key

Or in your Python code:

import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
os.environ["HUGGINGFACE_API_KEY"] = "your_huggingface_api_key"

Quick Start

# Simple one-line usage
from pyhearingai import transcribe

# Process an audio file with default settings
result = transcribe("meeting.mp3")

# Print the full transcript with speaker labels
print(result.text)

# Save in different formats
result.save("transcript.txt")  # Plain text
result.save("transcript.json")  # JSON with segments, timestamps
result.save("transcript.srt")   # Subtitle format
result.save("transcript.md")    # Markdown format

Advanced Usage

Configuring the Transcription Process

from pyhearingai import transcribe

# Configure transcription with specific options
result = transcribe(
    "interview.mp3",
    transcriber="whisper_openai",  # Specify transcriber
    diarizer="pyannote",           # Specify diarizer
    verbose=True                   # Enable verbose output
)

Progress Tracking

def progress_callback(progress_info):
    stage = progress_info.get('stage', 'unknown')
    percent = progress_info.get('progress', 0) * 100
    print(f"Processing {stage}: {percent:.1f}% complete")

result = transcribe(
    "long_recording.mp3",
    progress_callback=progress_callback
)

Working with Results

# Access the segments
for segment in result.segments:
    print(f"Speaker {segment.speaker_id}: {segment.text}")
    print(f"Time: {segment.start:.2f}s - {segment.end:.2f}s")

# Available output formats
from pyhearingai import list_output_formatters, get_output_formatter

# List available formatters
formatters = list_output_formatters()  # ['txt', 'json', 'srt', 'vtt', 'md']

# Get a specific formatter and format output
json_formatter = get_output_formatter('json')
json_content = json_formatter.format(result)
with open("transcript.json", "w") as f:
    f.write(json_content)

Command Line Interface

PyHearingAI includes a command-line interface:

# Basic usage
transcribe meeting.mp3

# Specify output format
transcribe meeting.mp3 --output transcript.txt

# Configure models
transcribe meeting.mp3 --transcriber whisper-openai --diarizer pyannote --speaker-assigner gpt-4o

# Get help
transcribe --help

Testing

The library includes an end-to-end test that validates the complete pipeline:

# Install test dependencies
pip install -r requirements_test.txt

# Run the end-to-end test
python -m pytest tests/test_end_to_end.py -v

Repository

PyHearingAI is hosted on GitHub:

Architecture

PyHearingAI follows Clean Architecture principles, with clear separation of concerns:

  • Core (Domain Layer): Contains domain models and business rules
  • Application Layer: Implements use cases like transcription and speaker assignment
  • Infrastructure Layer: Provides concrete implementations of interfaces (OpenAI Whisper, Pyannote, GPT-4o)
  • Presentation Layer: Offers user interfaces (CLI, future REST API)

For more details on the solution design and architecture, see the documentation:

Extending PyHearingAI

The library is designed for extensibility:

Custom Transcriber

from pyhearingai.extensions import register_transcriber
from pyhearingai.models import Transcriber

@register_transcriber("my-transcriber")
class MyTranscriber(Transcriber):
    def transcribe(self, audio_path, **kwargs):
        # Custom transcription logic
        return segments

Custom Diarizer

from pyhearingai.extensions import register_diarizer
from pyhearingai.models import Diarizer

@register_diarizer("my-diarizer")
class MyDiarizer(Diarizer):
    def diarize(self, audio_path, **kwargs):
        # Custom diarization logic
        return speaker_segments

Custom Speaker Assigner

from pyhearingai.extensions import register_speaker_assigner
from pyhearingai.models import SpeakerAssigner

@register_speaker_assigner("my-assigner")
class MySpeakerAssigner(SpeakerAssigner):
    def assign_speakers(self, transcript_segments, diarization_segments, **kwargs):
        # Custom speaker assignment logic
        return labeled_segments

Custom Output Format

from pyhearingai.extensions import register_output_formatter
from pyhearingai.models import OutputFormatter

@register_output_formatter("my-format")
class MyOutputFormatter(OutputFormatter):
    def format(self, result):
        # Custom formatting logic
        return formatted_output

Logging

Configure logging to control verbosity:

import logging
logging.basicConfig(level=logging.INFO)

# Set specific logger levels
logging.getLogger('pyhearingai.transcription').setLevel(logging.DEBUG)
logging.getLogger('pyhearingai.diarization').setLevel(logging.WARNING)

Directory Structure

The library creates the following directory structure for outputs:

content/
├── audio_conversion/    # Converted audio files
├── transcription/       # Transcription results
├── diarization/         # Speaker diarization results
└── speaker_assignment/  # Final output with speaker labels

Privacy and Data Handling

When using PyHearingAI, be aware that:

  • Audio data is sent to third-party APIs (OpenAI and Hugging Face)
  • OpenAI's data usage policies apply to audio sent for transcription
  • Hugging Face's data usage policies apply to audio sent for diarization
  • Consider data processing agreements when processing sensitive information

API Rate Limits and Quotas

Users should be aware of:

  • OpenAI has rate limits for the Whisper API (requests per minute)
  • GPT-4o has token limits per request and rate limits
  • Hugging Face API may have usage quotas

Environment Variables

Required environment variables:

OPENAI_API_KEY=your_openai_api_key
HUGGINGFACE_API_KEY=your_huggingface_api_key

Optional environment variables:

PYHEARINGAI_DEFAULT_TRANSCRIBER=whisper-openai
PYHEARINGAI_DEFAULT_DIARIZER=pyannote
PYHEARINGAI_DEFAULT_SPEAKER_ASSIGNER=gpt-4o
PYHEARINGAI_OUTPUT_DIR=./content
PYHEARINGAI_LOG_LEVEL=INFO

License

Apache 2.0

Implemented Features

  • Multiple output formats (TXT, JSON, SRT, VTT, Markdown)
  • Transcription models:
    • OpenAI Whisper API (default)
  • Diarization models:
    • Pyannote
  • Speaker assignment models:
    • GPT-4o (using OpenAI API)

Features Under Development

  • 🎛️ Extended Model Support:

    • Local Whisper models
    • Faster Whisper
    • Additional diarization models
  • 🚀 Performance Features:

    • GPU Acceleration
    • Batch processing
    • Memory optimization

Contributing

We welcome contributions! Please check our GitHub repository for guidelines.

Acknowledgments

  • OpenAI for the Whisper and GPT models
  • Pyannote for the diarization technology
  • The open-source community for various contributions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhearingai-0.1.0.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyhearingai-0.1.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file pyhearingai-0.1.0.tar.gz.

File metadata

  • Download URL: pyhearingai-0.1.0.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/24.2.0

File hashes

Hashes for pyhearingai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 02e4dddf5fea38816915f18171c4e90964f117e78a4a2d3ee268d87c798d3d14
MD5 1652f5a9f78bfccf979a4dd3db31e620
BLAKE2b-256 747ac24406ec001556f2ffeeaa37321bee4af22b2c47c1698b2d4dd601690cc8

See more details on using hashes here.

File details

Details for the file pyhearingai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyhearingai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/24.2.0

File hashes

Hashes for pyhearingai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b61d81082dc74232ae50bd5af2ffb11beb8108e94e775d5f1a9b4050def506f1
MD5 ffce685782490c6761e3fc31ec1fe960
BLAKE2b-256 d2192b78c12a06b12fb38b3a2c0ea3e1b080c28bba01c73f7d1c766df6c62932

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page