Skip to main content

Rust-powered Speech-to-Text toolkit with Whisper, Distil-Whisper, and streaming transcription

Reason this release was yanked:

discontinued

Project description

Antenna

Rust-powered Speech-to-Text toolkit for Python with Whisper integration.

Status

Antenna is now in v0.3.0 - Whisper Integration, providing speech-to-text transcription using OpenAI's Whisper models via the Candle ML framework.

Features

v0.3.0 - Whisper Integration

  • Whisper Models: Load any Whisper model (tiny, base, small, medium, large, large-v2, large-v3)
  • Transcription: Convert speech to text with timestamps
  • Translation: Translate any language to English
  • Language Detection: Automatic detection of 99 languages
  • Model Caching: Automatic caching for faster subsequent loads
  • GPU Acceleration: Full CUDA support for NVIDIA GPUs (3-10x faster than CPU)
  • CPU Support: Full functionality on CPU for systems without GPU

v0.2.0 - Enhanced Audio Foundation

  • Multi-Format Support: Load MP3, FLAC, OGG, M4A, and WAV files
  • Audio Analysis: RMS, peak, zero-crossing rate, energy calculations
  • Silence Detection: Detect, trim, and split audio on silence
  • Audio Normalization: Peak, RMS, and LUFS normalization
  • Save Functionality: Export processed audio to WAV format
  • Format Conversion: Convert stereo to mono
  • Resampling: High-quality resampling using sinc interpolation
  • NumPy Integration: Seamless conversion to NumPy arrays

Planned

  • v0.4.0: Real-time streaming and async API
  • v0.5.0: Voice Activity Detection (VAD) and advanced preprocessing

Installation

Prerequisites

  • Python 3.8+
  • Rust (will be installed automatically by maturin if not present)
  • uv package manager

GPU Prerequisites (Optional)

For CUDA GPU acceleration:

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit installed (provides nvcc compiler)

Development Installation

# Clone the repository
cd antenna

# Create virtual environment with uv
uv venv
source .venv/bin/activate  # or `.venv\Scripts\activate` on Windows

# Install in development mode
uv add --dev maturin

# CPU-only build
uv run maturin develop --release

# OR build with CUDA/GPU support
uv run maturin develop --release --features cuda

# Install dependencies
uv add numpy
uv add --dev pytest

Quick Start

Speech-to-Text Transcription (NEW in v0.3.0!)

import antenna

# Quick transcription with convenience function
result = antenna.transcribe("speech.wav", model_size="base")
print(result.text)

# Or step by step with more control:

# 1. Load and preprocess audio
audio = antenna.load_audio("podcast.mp3")
audio = antenna.preprocess_for_whisper(audio)  # Converts to 16kHz mono

# 2. Load Whisper model
model = antenna.WhisperModel.from_size("base")  # tiny, base, small, medium, large

# 3. Transcribe
result = model.transcribe(audio)
print(f"Language: {result.language}")
print(f"Text: {result.text}")

# 4. Access segments with timestamps
for segment in result.segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")

Translation to English

import antenna

# Load audio in any language
audio = antenna.load_audio("spanish_speech.wav")
audio = antenna.preprocess_for_whisper(audio)

# Translate to English
model = antenna.WhisperModel.from_size("base")
result = model.translate(audio)
print(result.text)  # English translation

Language Detection

import antenna

audio = antenna.load_audio("mystery_language.wav")
audio = antenna.preprocess_for_whisper(audio)

model = antenna.WhisperModel.from_size("base")
language = model.detect_language(audio)
print(f"Detected language: {language}")  # e.g., "en", "es", "zh"

GPU Acceleration

import antenna

# Check if CUDA is available
if antenna.is_cuda_available():
    print(f"CUDA available with {antenna.cuda_device_count()} device(s)")

    # Load model on GPU (3-10x faster than CPU)
    model = antenna.WhisperModel.from_size("base", device="cuda")

    # Or specify a specific GPU
    model = antenna.WhisperModel.from_size("base", device="cuda:0")

    # "gpu" is an alias for "cuda"
    model = antenna.WhisperModel.from_size("base", device="gpu")
else:
    # Fallback to CPU
    model = antenna.WhisperModel.from_size("base", device="cpu")

# Transcription works the same way
audio = antenna.load_audio("speech.wav")
audio = antenna.preprocess_for_whisper(audio)
result = model.transcribe(audio)
print(result.text)

Audio Processing

import antenna

# Load audio file (supports WAV, MP3, FLAC, OGG, M4A)
audio = antenna.load_audio("podcast.mp3")
print(f"Sample rate: {audio.sample_rate} Hz")
print(f"Channels: {audio.channels}")
print(f"Duration: {audio.duration:.2f}s")

# Analyze audio
stats = antenna.analyze_audio(audio)
print(f"RMS: {stats.rms:.4f}, Peak: {stats.peak:.4f}")
print(f"RMS (dB): {stats.rms_db:.2f}, Peak (dB): {stats.peak_db:.2f}")

# Clean up audio
audio = antenna.trim_silence(audio, threshold_db=-40)
audio = antenna.normalize_audio(audio, method="rms", target_db=-20)

# Preprocess for Whisper (16kHz, mono)
audio = antenna.preprocess_audio(audio, target_sample_rate=16000, mono=True)

# Save processed audio
antenna.save_audio(audio, "processed.wav")

# Access as NumPy array
samples = audio.to_numpy()
print(f"Shape: {samples.shape}, dtype: {samples.dtype}")

Running the Demo

# Run Whisper transcription demo
uv run python examples/whisper_demo.py [audio_file]

# Generate a test audio file
uv run python examples/generate_test_audio.py

# Run audio processing demo
uv run python examples/audio_processing_demo.py

API Reference

Whisper Speech-to-Text

WhisperModel.from_size(size: str, device: str = "cpu") -> WhisperModel

Load a Whisper model by size name.

Parameters:

  • size: Model size - "tiny", "base", "small", "medium", "large", "large-v2", "large-v3"
  • device: Device to run on - "cpu" or "cuda"

Returns:

  • WhisperModel: Loaded model ready for transcription

WhisperModel.from_pretrained(model_id: str, device: str = "cpu") -> WhisperModel

Load a Whisper model from HuggingFace Hub.

Parameters:

  • model_id: HuggingFace model ID (e.g., "openai/whisper-tiny")
  • device: Device to run on - "cpu" or "cuda"

Returns:

  • WhisperModel: Loaded model ready for transcription

WhisperModel.transcribe(audio, language=None, task=None, beam_size=5, timestamps=True) -> TranscriptionResult

Transcribe audio to text.

Parameters:

  • audio: AudioData object (must be 16kHz mono, use preprocess_for_whisper())
  • language: Language code (e.g., "en", "es"). None for auto-detection.
  • task: "transcribe" or "translate" (translate to English)
  • beam_size: Beam size for decoding (default: 5, use 1 for greedy)
  • timestamps: Whether to include timestamps (default: True)

Returns:

  • TranscriptionResult: Object with text, language, and segments

WhisperModel.translate(audio) -> TranscriptionResult

Translate audio to English.

Parameters:

  • audio: AudioData object (must be 16kHz mono)

Returns:

  • TranscriptionResult: English translation with timestamps

WhisperModel.detect_language(audio) -> str

Detect the language of audio.

Parameters:

  • audio: AudioData object (must be 16kHz mono)

Returns:

  • str: Language code (e.g., "en", "es", "zh")

preprocess_for_whisper(audio: AudioData) -> AudioData

Preprocess audio for Whisper (convert to 16kHz mono).

Parameters:

  • audio: Input AudioData object

Returns:

  • AudioData: Audio ready for Whisper (16kHz, mono)

transcribe(audio_path, model_size="base", language=None, device="cpu") -> TranscriptionResult

Convenience function to transcribe an audio file in one call.

Parameters:

  • audio_path: Path to audio file
  • model_size: Whisper model size
  • language: Language code (None for auto-detection)
  • device: "cpu" or "cuda"

Returns:

  • TranscriptionResult: Transcription result

list_whisper_models() -> List[Tuple[str, str]]

List available Whisper model sizes and their HuggingFace IDs.

is_model_cached(model_id: str) -> bool

Check if a model is cached locally.

CUDA Utilities

is_cuda_available() -> bool

Check if CUDA GPU acceleration is available.

Returns:

  • bool: True if CUDA is available and the library was built with CUDA support

cuda_device_count() -> int

Get the number of available CUDA devices.

Returns:

  • int: Number of CUDA-capable GPUs (0 if CUDA is not available)

Data Types

TranscriptionResult

Transcription result container.

Properties:

  • text: str - Full transcribed text
  • language: str - Detected or specified language code
  • segments: List[TranscriptionSegment] - List of timed segments

TranscriptionSegment

A single segment with timing.

Properties:

  • start: float - Start time in seconds
  • end: float - End time in seconds
  • text: str - Segment text

Audio Loading & Saving

load_audio(path: str) -> AudioData

Load audio from any supported format.

save_audio(audio: AudioData, path: str) -> None

Save audio to WAV format.

Audio Analysis

analyze_audio(audio: AudioData) -> AudioStats

Analyze audio and return statistics.

AudioStats

Properties:

  • rms: float - Root Mean Square amplitude
  • peak: float - Peak amplitude
  • rms_db: float - RMS level in dB
  • peak_db: float - Peak level in dB
  • zero_crossing_rate: float - Rate of zero crossings
  • energy: float - Total energy

Silence Detection

detect_silence(audio, threshold_db, min_duration) -> List[Tuple[float, float]]

Detect silence segments in audio.

trim_silence(audio, threshold_db) -> AudioData

Trim silence from beginning and end.

split_on_silence(audio, threshold_db, min_silence_duration) -> List[AudioData]

Split audio into chunks on silence regions.

Audio Processing

preprocess_audio(audio, target_sample_rate=None, mono=None) -> AudioData

Preprocess audio data (resample, convert to mono).

normalize_audio(audio, method, target_db) -> AudioData

Normalize audio to target level.

Methods: "peak", "rms", "lufs"

Data Types

AudioData

Properties:

  • sample_rate: int - Sample rate in Hz
  • channels: int - Number of audio channels
  • duration: float - Duration in seconds

Methods:

  • to_numpy() -> np.ndarray - Convert to NumPy array

Testing

# Run all tests
cargo test && uv run pytest tests/ -v

# Run only Whisper tests
uv run pytest tests/test_whisper.py -v

# Run integration tests (downloads models)
uv run pytest tests/test_whisper.py -v -m slow

Development

Project Structure

antenna/
├── Cargo.toml              # Rust dependencies
├── pyproject.toml          # Python package config
├── src/
│   ├── lib.rs              # PyO3 bindings (Python interface)
│   ├── types.rs            # Core types (AudioData, TranscriptionResult)
│   ├── error.rs            # Error types
│   ├── audio/              # Audio processing
│   │   ├── mod.rs
│   │   ├── io.rs           # Audio I/O
│   │   ├── analysis.rs     # Audio analysis
│   │   ├── process.rs      # Preprocessing
│   │   └── silence.rs      # Silence detection
│   └── ml/                 # Machine learning (NEW in v0.3.0)
│       ├── mod.rs
│       ├── tokenizer.rs    # Whisper tokenizer
│       └── whisper/
│           ├── mod.rs
│           ├── model.rs    # Model loading
│           ├── config.rs   # Model configurations
│           ├── inference.rs # Transcription engine
│           └── decode.rs   # Beam search decoder
├── python/
│   └── antenna/
│       └── __init__.py     # Python package entry
├── tests/
│   ├── test_basic.py       # Audio processing tests
│   └── test_whisper.py     # Whisper tests (NEW)
└── examples/
    ├── whisper_demo.py     # Whisper transcription demo (NEW)
    ├── generate_test_audio.py
    └── audio_processing_demo.py

Building

# Development build
uv run maturin develop

# Release build (optimized)
uv run maturin develop --release

# Build wheel
uv run maturin build --release

Model Selection Guide

Model Size Speed Quality Best For
tiny ~39M Fastest Lower Quick tests, low resources
base ~74M Fast Good General use, balanced
small ~244M Medium Better Better accuracy needed
medium ~769M Slower High High accuracy needed
large ~1.5G Slow Highest Best quality needed
large-v2 ~1.5G Slow Higher Improved large model
large-v3 ~1.5G Slow Best Latest, best quality

Supported Languages

Whisper supports 99 languages including:

  • English, Spanish, French, German, Italian, Portuguese
  • Chinese, Japanese, Korean
  • Arabic, Hindi, Russian
  • And 90+ more...

Troubleshooting

Model Download Issues

Problem: Model download fails or is slow Solution: Check your internet connection. Models are cached after first download.

Problem: Out of memory when loading large models Solution: Use a smaller model (tiny, base, small) or ensure sufficient RAM.

Transcription Issues

Problem: Poor transcription quality Solution:

  • Ensure audio is clear without excessive background noise
  • Try a larger model
  • Use preprocess_for_whisper() to ensure correct format

Problem: Wrong language detected Solution: Specify the language explicitly: model.transcribe(audio, language="en")

Roadmap

v0.3.0 - Whisper Integration ✅

  • Candle ML integration
  • Whisper model loading from HuggingFace
  • CPU transcription
  • Model caching
  • Language detection
  • Translation to English
  • Beam search decoding
  • GPU support (CUDA)

v0.4.0 - Streaming & Async

  • Async API
  • Streaming API
  • Real-time transcription
  • Batch processing

v0.5.0 - Production Ready

  • Voice Activity Detection (VAD)
  • Advanced preprocessing options
  • Metal support for macOS

License

MIT

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

antenna_stt-0.3.0.tar.gz (393.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

antenna_stt-0.3.0-cp313-cp313-manylinux_2_39_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.39+ x86-64

File details

Details for the file antenna_stt-0.3.0.tar.gz.

File metadata

  • Download URL: antenna_stt-0.3.0.tar.gz
  • Upload date:
  • Size: 393.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for antenna_stt-0.3.0.tar.gz
Algorithm Hash digest
SHA256 111a4ed604b34a02105ec4f59bba32899618193f4f2ca3d1e3bfdc5a7265a6ad
MD5 2b305688636278b65e0016bb910ee83d
BLAKE2b-256 a208e3f9125608ac3ed901e5889e4a0fc94bfcc5133833acc855cb3c48ba99f5

See more details on using hashes here.

File details

Details for the file antenna_stt-0.3.0-cp313-cp313-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for antenna_stt-0.3.0-cp313-cp313-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 3a8e97bd84b7346bc9b26e6fd39d133bc27c9c514e51a16317cb01c08294288e
MD5 32584aee6e2b3be01a3e662591c9c5a2
BLAKE2b-256 ad9cd1083abf01aa892a8b106cbd4e4a808d476143e2190917c64724e5815d9d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page