Library for transcribing audio conversations with accurate speaker identification
Project description
PyHearingAI
The official library for transcribing audio conversations with accurate speaker identification.
Current Status
PyHearingAI follows Clean Architecture principles with a well-organized code structure. The library provides a complete pipeline for audio transcription with speaker diarization and supports multiple output formats.
Features
- Audio format conversion (supports mp3, wav, mp4, and more)
- Transcription pipeline powered by OpenAI Whisper
- Speaker diarization using Pyannote
- Speaker assignment using GPT-4o
- Support for multiple output formats:
- TXT
- JSON
- SRT
- VTT
- Markdown
- Clean Architecture design for maintainability and extensibility
- End-to-end testing framework
- Progress tracking for long-running processes
- Comprehensive error handling
- Command-line interface
Requirements
- Python 3.8+
- FFmpeg for audio conversion
- API keys:
- OpenAI API key (for Whisper transcription and GPT-4o speaker assignment)
- Hugging Face API key (for Pyannote speaker diarization)
Installation
System Dependencies
First, install FFmpeg:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows (using Chocolatey)
choco install ffmpeg
Using Poetry (Recommended)
poetry add pyhearingai
Using pip
pip install pyhearingai
API Key Setup
Set up your API keys as environment variables:
# In your terminal or .env file
export OPENAI_API_KEY=your_openai_api_key
export HUGGINGFACE_API_KEY=your_huggingface_api_key
Or in your Python code:
import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
os.environ["HUGGINGFACE_API_KEY"] = "your_huggingface_api_key"
Quick Start
# Simple one-line usage
from pyhearingai import transcribe
# Process an audio file with default settings
result = transcribe("meeting.mp3")
# Print the full transcript with speaker labels
print(result.text)
# Save in different formats
result.save("transcript.txt") # Plain text
result.save("transcript.json") # JSON with segments, timestamps
result.save("transcript.srt") # Subtitle format
result.save("transcript.md") # Markdown format
Advanced Usage
Configuring the Transcription Process
from pyhearingai import transcribe
# Configure transcription with specific options
result = transcribe(
"interview.mp3",
transcriber="whisper_openai", # Specify transcriber
diarizer="pyannote", # Specify diarizer
verbose=True # Enable verbose output
)
Progress Tracking
def progress_callback(progress_info):
stage = progress_info.get('stage', 'unknown')
percent = progress_info.get('progress', 0) * 100
print(f"Processing {stage}: {percent:.1f}% complete")
result = transcribe(
"long_recording.mp3",
progress_callback=progress_callback
)
Working with Results
# Access the segments
for segment in result.segments:
print(f"Speaker {segment.speaker_id}: {segment.text}")
print(f"Time: {segment.start:.2f}s - {segment.end:.2f}s")
# Available output formats
from pyhearingai import list_output_formatters, get_output_formatter
# List available formatters
formatters = list_output_formatters() # ['txt', 'json', 'srt', 'vtt', 'md']
# Get a specific formatter and format output
json_formatter = get_output_formatter('json')
json_content = json_formatter.format(result)
with open("transcript.json", "w") as f:
f.write(json_content)
Command Line Interface
PyHearingAI includes a command-line interface:
# Basic usage
transcribe meeting.mp3
# Specify output format
transcribe meeting.mp3 --output transcript.txt
# Configure models
transcribe meeting.mp3 --transcriber whisper-openai --diarizer pyannote --speaker-assigner gpt-4o
# Get help
transcribe --help
Testing
The library includes an end-to-end test that validates the complete pipeline:
# Install test dependencies
pip install -r requirements_test.txt
# Run the end-to-end test
python -m pytest tests/test_end_to_end.py -v
Repository
PyHearingAI is hosted on GitHub:
Architecture
PyHearingAI follows Clean Architecture principles, with clear separation of concerns:
- Core (Domain Layer): Contains domain models and business rules
- Application Layer: Implements use cases like transcription and speaker assignment
- Infrastructure Layer: Provides concrete implementations of interfaces (OpenAI Whisper, Pyannote, GPT-4o)
- Presentation Layer: Offers user interfaces (CLI, future REST API)
For more details on the solution design and architecture, see the documentation:
Extending PyHearingAI
The library is designed for extensibility:
Custom Transcriber
from pyhearingai.extensions import register_transcriber
from pyhearingai.models import Transcriber
@register_transcriber("my-transcriber")
class MyTranscriber(Transcriber):
def transcribe(self, audio_path, **kwargs):
# Custom transcription logic
return segments
Custom Diarizer
from pyhearingai.extensions import register_diarizer
from pyhearingai.models import Diarizer
@register_diarizer("my-diarizer")
class MyDiarizer(Diarizer):
def diarize(self, audio_path, **kwargs):
# Custom diarization logic
return speaker_segments
Custom Speaker Assigner
from pyhearingai.extensions import register_speaker_assigner
from pyhearingai.models import SpeakerAssigner
@register_speaker_assigner("my-assigner")
class MySpeakerAssigner(SpeakerAssigner):
def assign_speakers(self, transcript_segments, diarization_segments, **kwargs):
# Custom speaker assignment logic
return labeled_segments
Custom Output Format
from pyhearingai.extensions import register_output_formatter
from pyhearingai.models import OutputFormatter
@register_output_formatter("my-format")
class MyOutputFormatter(OutputFormatter):
def format(self, result):
# Custom formatting logic
return formatted_output
Logging
Configure logging to control verbosity:
import logging
logging.basicConfig(level=logging.INFO)
# Set specific logger levels
logging.getLogger('pyhearingai.transcription').setLevel(logging.DEBUG)
logging.getLogger('pyhearingai.diarization').setLevel(logging.WARNING)
Directory Structure
The library creates the following directory structure for outputs:
content/
├── audio_conversion/ # Converted audio files
├── transcription/ # Transcription results
├── diarization/ # Speaker diarization results
└── speaker_assignment/ # Final output with speaker labels
Privacy and Data Handling
When using PyHearingAI, be aware that:
- Audio data is sent to third-party APIs (OpenAI and Hugging Face)
- OpenAI's data usage policies apply to audio sent for transcription
- Hugging Face's data usage policies apply to audio sent for diarization
- Consider data processing agreements when processing sensitive information
API Rate Limits and Quotas
Users should be aware of:
- OpenAI has rate limits for the Whisper API (requests per minute)
- GPT-4o has token limits per request and rate limits
- Hugging Face API may have usage quotas
Environment Variables
Required environment variables:
OPENAI_API_KEY=your_openai_api_key
HUGGINGFACE_API_KEY=your_huggingface_api_key
Optional environment variables:
PYHEARINGAI_DEFAULT_TRANSCRIBER=whisper-openai
PYHEARINGAI_DEFAULT_DIARIZER=pyannote
PYHEARINGAI_DEFAULT_SPEAKER_ASSIGNER=gpt-4o
PYHEARINGAI_OUTPUT_DIR=./content
PYHEARINGAI_LOG_LEVEL=INFO
License
Implemented Features
- Multiple output formats (TXT, JSON, SRT, VTT, Markdown)
- Transcription models:
- OpenAI Whisper API (default)
- Diarization models:
- Pyannote
- Speaker assignment models:
- GPT-4o (using OpenAI API)
Features Under Development
-
🎛️ Extended Model Support:
- Local Whisper models
- Faster Whisper
- Additional diarization models
-
🚀 Performance Features:
- GPU Acceleration
- Batch processing
- Memory optimization
Contributing
We welcome contributions! Please check our GitHub repository for guidelines.
Acknowledgments
- OpenAI for the Whisper and GPT models
- Pyannote for the diarization technology
- The open-source community for various contributions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyhearingai-0.1.0.tar.gz.
File metadata
- Download URL: pyhearingai-0.1.0.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/24.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02e4dddf5fea38816915f18171c4e90964f117e78a4a2d3ee268d87c798d3d14
|
|
| MD5 |
1652f5a9f78bfccf979a4dd3db31e620
|
|
| BLAKE2b-256 |
747ac24406ec001556f2ffeeaa37321bee4af22b2c47c1698b2d4dd601690cc8
|
File details
Details for the file pyhearingai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyhearingai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/24.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b61d81082dc74232ae50bd5af2ffb11beb8108e94e775d5f1a9b4050def506f1
|
|
| MD5 |
ffce685782490c6761e3fc31ec1fe960
|
|
| BLAKE2b-256 |
d2192b78c12a06b12fb38b3a2c0ea3e1b080c28bba01c73f7d1c766df6c62932
|