Rust-powered Speech-to-Text toolkit with Whisper, Distil-Whisper, and streaming transcription
Reason this release was yanked:
discontinued
Project description
Antenna
Rust-powered Speech-to-Text toolkit for Python with Whisper integration.
Status
Antenna is now in v0.3.0 - Whisper Integration, providing speech-to-text transcription using OpenAI's Whisper models via the Candle ML framework.
Features
v0.3.0 - Whisper Integration
- Whisper Models: Load any Whisper model (tiny, base, small, medium, large, large-v2, large-v3)
- Transcription: Convert speech to text with timestamps
- Translation: Translate any language to English
- Language Detection: Automatic detection of 99 languages
- Model Caching: Automatic caching for faster subsequent loads
- GPU Acceleration: Full CUDA support for NVIDIA GPUs (3-10x faster than CPU)
- CPU Support: Full functionality on CPU for systems without GPU
v0.2.0 - Enhanced Audio Foundation
- Multi-Format Support: Load MP3, FLAC, OGG, M4A, and WAV files
- Audio Analysis: RMS, peak, zero-crossing rate, energy calculations
- Silence Detection: Detect, trim, and split audio on silence
- Audio Normalization: Peak, RMS, and LUFS normalization
- Save Functionality: Export processed audio to WAV format
- Format Conversion: Convert stereo to mono
- Resampling: High-quality resampling using sinc interpolation
- NumPy Integration: Seamless conversion to NumPy arrays
Planned
- v0.4.0: Real-time streaming and async API
- v0.5.0: Voice Activity Detection (VAD) and advanced preprocessing
Installation
Prerequisites
- Python 3.8+
- Rust (will be installed automatically by maturin if not present)
- uv package manager
GPU Prerequisites (Optional)
For CUDA GPU acceleration:
- NVIDIA GPU with CUDA support
- CUDA Toolkit installed (provides
nvcccompiler)- Ubuntu:
sudo apt install nvidia-cuda-toolkit - Or download from NVIDIA CUDA Toolkit
- Ubuntu:
Development Installation
# Clone the repository
cd antenna
# Create virtual environment with uv
uv venv
source .venv/bin/activate # or `.venv\Scripts\activate` on Windows
# Install in development mode
uv add --dev maturin
# CPU-only build
uv run maturin develop --release
# OR build with CUDA/GPU support
uv run maturin develop --release --features cuda
# Install dependencies
uv add numpy
uv add --dev pytest
Quick Start
Speech-to-Text Transcription (NEW in v0.3.0!)
import antenna
# Quick transcription with convenience function
result = antenna.transcribe("speech.wav", model_size="base")
print(result.text)
# Or step by step with more control:
# 1. Load and preprocess audio
audio = antenna.load_audio("podcast.mp3")
audio = antenna.preprocess_for_whisper(audio) # Converts to 16kHz mono
# 2. Load Whisper model
model = antenna.WhisperModel.from_size("base") # tiny, base, small, medium, large
# 3. Transcribe
result = model.transcribe(audio)
print(f"Language: {result.language}")
print(f"Text: {result.text}")
# 4. Access segments with timestamps
for segment in result.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
Translation to English
import antenna
# Load audio in any language
audio = antenna.load_audio("spanish_speech.wav")
audio = antenna.preprocess_for_whisper(audio)
# Translate to English
model = antenna.WhisperModel.from_size("base")
result = model.translate(audio)
print(result.text) # English translation
Language Detection
import antenna
audio = antenna.load_audio("mystery_language.wav")
audio = antenna.preprocess_for_whisper(audio)
model = antenna.WhisperModel.from_size("base")
language = model.detect_language(audio)
print(f"Detected language: {language}") # e.g., "en", "es", "zh"
GPU Acceleration
import antenna
# Check if CUDA is available
if antenna.is_cuda_available():
print(f"CUDA available with {antenna.cuda_device_count()} device(s)")
# Load model on GPU (3-10x faster than CPU)
model = antenna.WhisperModel.from_size("base", device="cuda")
# Or specify a specific GPU
model = antenna.WhisperModel.from_size("base", device="cuda:0")
# "gpu" is an alias for "cuda"
model = antenna.WhisperModel.from_size("base", device="gpu")
else:
# Fallback to CPU
model = antenna.WhisperModel.from_size("base", device="cpu")
# Transcription works the same way
audio = antenna.load_audio("speech.wav")
audio = antenna.preprocess_for_whisper(audio)
result = model.transcribe(audio)
print(result.text)
Audio Processing
import antenna
# Load audio file (supports WAV, MP3, FLAC, OGG, M4A)
audio = antenna.load_audio("podcast.mp3")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Channels: {audio.channels}")
print(f"Duration: {audio.duration:.2f}s")
# Analyze audio
stats = antenna.analyze_audio(audio)
print(f"RMS: {stats.rms:.4f}, Peak: {stats.peak:.4f}")
print(f"RMS (dB): {stats.rms_db:.2f}, Peak (dB): {stats.peak_db:.2f}")
# Clean up audio
audio = antenna.trim_silence(audio, threshold_db=-40)
audio = antenna.normalize_audio(audio, method="rms", target_db=-20)
# Preprocess for Whisper (16kHz, mono)
audio = antenna.preprocess_audio(audio, target_sample_rate=16000, mono=True)
# Save processed audio
antenna.save_audio(audio, "processed.wav")
# Access as NumPy array
samples = audio.to_numpy()
print(f"Shape: {samples.shape}, dtype: {samples.dtype}")
Running the Demo
# Run Whisper transcription demo
uv run python examples/whisper_demo.py [audio_file]
# Generate a test audio file
uv run python examples/generate_test_audio.py
# Run audio processing demo
uv run python examples/audio_processing_demo.py
API Reference
Whisper Speech-to-Text
WhisperModel.from_size(size: str, device: str = "cpu") -> WhisperModel
Load a Whisper model by size name.
Parameters:
size: Model size - "tiny", "base", "small", "medium", "large", "large-v2", "large-v3"device: Device to run on - "cpu" or "cuda"
Returns:
WhisperModel: Loaded model ready for transcription
WhisperModel.from_pretrained(model_id: str, device: str = "cpu") -> WhisperModel
Load a Whisper model from HuggingFace Hub.
Parameters:
model_id: HuggingFace model ID (e.g., "openai/whisper-tiny")device: Device to run on - "cpu" or "cuda"
Returns:
WhisperModel: Loaded model ready for transcription
WhisperModel.transcribe(audio, language=None, task=None, beam_size=5, timestamps=True) -> TranscriptionResult
Transcribe audio to text.
Parameters:
audio: AudioData object (must be 16kHz mono, usepreprocess_for_whisper())language: Language code (e.g., "en", "es"). None for auto-detection.task: "transcribe" or "translate" (translate to English)beam_size: Beam size for decoding (default: 5, use 1 for greedy)timestamps: Whether to include timestamps (default: True)
Returns:
TranscriptionResult: Object withtext,language, andsegments
WhisperModel.translate(audio) -> TranscriptionResult
Translate audio to English.
Parameters:
audio: AudioData object (must be 16kHz mono)
Returns:
TranscriptionResult: English translation with timestamps
WhisperModel.detect_language(audio) -> str
Detect the language of audio.
Parameters:
audio: AudioData object (must be 16kHz mono)
Returns:
str: Language code (e.g., "en", "es", "zh")
preprocess_for_whisper(audio: AudioData) -> AudioData
Preprocess audio for Whisper (convert to 16kHz mono).
Parameters:
audio: Input AudioData object
Returns:
AudioData: Audio ready for Whisper (16kHz, mono)
transcribe(audio_path, model_size="base", language=None, device="cpu") -> TranscriptionResult
Convenience function to transcribe an audio file in one call.
Parameters:
audio_path: Path to audio filemodel_size: Whisper model sizelanguage: Language code (None for auto-detection)device: "cpu" or "cuda"
Returns:
TranscriptionResult: Transcription result
list_whisper_models() -> List[Tuple[str, str]]
List available Whisper model sizes and their HuggingFace IDs.
is_model_cached(model_id: str) -> bool
Check if a model is cached locally.
CUDA Utilities
is_cuda_available() -> bool
Check if CUDA GPU acceleration is available.
Returns:
bool: True if CUDA is available and the library was built with CUDA support
cuda_device_count() -> int
Get the number of available CUDA devices.
Returns:
int: Number of CUDA-capable GPUs (0 if CUDA is not available)
Data Types
TranscriptionResult
Transcription result container.
Properties:
text: str- Full transcribed textlanguage: str- Detected or specified language codesegments: List[TranscriptionSegment]- List of timed segments
TranscriptionSegment
A single segment with timing.
Properties:
start: float- Start time in secondsend: float- End time in secondstext: str- Segment text
Audio Loading & Saving
load_audio(path: str) -> AudioData
Load audio from any supported format.
save_audio(audio: AudioData, path: str) -> None
Save audio to WAV format.
Audio Analysis
analyze_audio(audio: AudioData) -> AudioStats
Analyze audio and return statistics.
AudioStats
Properties:
rms: float- Root Mean Square amplitudepeak: float- Peak amplituderms_db: float- RMS level in dBpeak_db: float- Peak level in dBzero_crossing_rate: float- Rate of zero crossingsenergy: float- Total energy
Silence Detection
detect_silence(audio, threshold_db, min_duration) -> List[Tuple[float, float]]
Detect silence segments in audio.
trim_silence(audio, threshold_db) -> AudioData
Trim silence from beginning and end.
split_on_silence(audio, threshold_db, min_silence_duration) -> List[AudioData]
Split audio into chunks on silence regions.
Audio Processing
preprocess_audio(audio, target_sample_rate=None, mono=None) -> AudioData
Preprocess audio data (resample, convert to mono).
normalize_audio(audio, method, target_db) -> AudioData
Normalize audio to target level.
Methods: "peak", "rms", "lufs"
Data Types
AudioData
Properties:
sample_rate: int- Sample rate in Hzchannels: int- Number of audio channelsduration: float- Duration in seconds
Methods:
to_numpy() -> np.ndarray- Convert to NumPy array
Testing
# Run all tests
cargo test && uv run pytest tests/ -v
# Run only Whisper tests
uv run pytest tests/test_whisper.py -v
# Run integration tests (downloads models)
uv run pytest tests/test_whisper.py -v -m slow
Development
Project Structure
antenna/
├── Cargo.toml # Rust dependencies
├── pyproject.toml # Python package config
├── src/
│ ├── lib.rs # PyO3 bindings (Python interface)
│ ├── types.rs # Core types (AudioData, TranscriptionResult)
│ ├── error.rs # Error types
│ ├── audio/ # Audio processing
│ │ ├── mod.rs
│ │ ├── io.rs # Audio I/O
│ │ ├── analysis.rs # Audio analysis
│ │ ├── process.rs # Preprocessing
│ │ └── silence.rs # Silence detection
│ └── ml/ # Machine learning (NEW in v0.3.0)
│ ├── mod.rs
│ ├── tokenizer.rs # Whisper tokenizer
│ └── whisper/
│ ├── mod.rs
│ ├── model.rs # Model loading
│ ├── config.rs # Model configurations
│ ├── inference.rs # Transcription engine
│ └── decode.rs # Beam search decoder
├── python/
│ └── antenna/
│ └── __init__.py # Python package entry
├── tests/
│ ├── test_basic.py # Audio processing tests
│ └── test_whisper.py # Whisper tests (NEW)
└── examples/
├── whisper_demo.py # Whisper transcription demo (NEW)
├── generate_test_audio.py
└── audio_processing_demo.py
Building
# Development build
uv run maturin develop
# Release build (optimized)
uv run maturin develop --release
# Build wheel
uv run maturin build --release
Model Selection Guide
| Model | Size | Speed | Quality | Best For |
|---|---|---|---|---|
| tiny | ~39M | Fastest | Lower | Quick tests, low resources |
| base | ~74M | Fast | Good | General use, balanced |
| small | ~244M | Medium | Better | Better accuracy needed |
| medium | ~769M | Slower | High | High accuracy needed |
| large | ~1.5G | Slow | Highest | Best quality needed |
| large-v2 | ~1.5G | Slow | Higher | Improved large model |
| large-v3 | ~1.5G | Slow | Best | Latest, best quality |
Supported Languages
Whisper supports 99 languages including:
- English, Spanish, French, German, Italian, Portuguese
- Chinese, Japanese, Korean
- Arabic, Hindi, Russian
- And 90+ more...
Troubleshooting
Model Download Issues
Problem: Model download fails or is slow Solution: Check your internet connection. Models are cached after first download.
Problem: Out of memory when loading large models Solution: Use a smaller model (tiny, base, small) or ensure sufficient RAM.
Transcription Issues
Problem: Poor transcription quality Solution:
- Ensure audio is clear without excessive background noise
- Try a larger model
- Use
preprocess_for_whisper()to ensure correct format
Problem: Wrong language detected
Solution: Specify the language explicitly: model.transcribe(audio, language="en")
Roadmap
v0.3.0 - Whisper Integration ✅
- Candle ML integration
- Whisper model loading from HuggingFace
- CPU transcription
- Model caching
- Language detection
- Translation to English
- Beam search decoding
- GPU support (CUDA)
v0.4.0 - Streaming & Async
- Async API
- Streaming API
- Real-time transcription
- Batch processing
v0.5.0 - Production Ready
- Voice Activity Detection (VAD)
- Advanced preprocessing options
- Metal support for macOS
License
MIT
Acknowledgments
- PyO3 - Rust-Python bindings
- maturin - Build tool
- Candle - ML framework
- HuggingFace Hub - Model hosting
- rubato - Audio resampling
- symphonia - Multi-format audio decoding
- OpenAI Whisper - Original Whisper model
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file antenna_stt-0.3.0.tar.gz.
File metadata
- Download URL: antenna_stt-0.3.0.tar.gz
- Upload date:
- Size: 393.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
111a4ed604b34a02105ec4f59bba32899618193f4f2ca3d1e3bfdc5a7265a6ad
|
|
| MD5 |
2b305688636278b65e0016bb910ee83d
|
|
| BLAKE2b-256 |
a208e3f9125608ac3ed901e5889e4a0fc94bfcc5133833acc855cb3c48ba99f5
|
File details
Details for the file antenna_stt-0.3.0-cp313-cp313-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: antenna_stt-0.3.0-cp313-cp313-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 6.6 MB
- Tags: CPython 3.13, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a8e97bd84b7346bc9b26e6fd39d133bc27c9c514e51a16317cb01c08294288e
|
|
| MD5 |
32584aee6e2b3be01a3e662591c9c5a2
|
|
| BLAKE2b-256 |
ad9cd1083abf01aa892a8b106cbd4e4a808d476143e2190917c64724e5815d9d
|