A small Python transcription helper using OpenAI speech-to-text APIs.
Project description
whisper-smith
whisper-smith is a small Python CLI/app helper for transcribing audio files with OpenAI speech-to-text models.
Features
- Transcribe local audio files
- CLI-first workflow for quick terminal use
- Output as
txt,json,srt, orvtt - Automatically infer output format from output file extension
- Load environment variables from
.env
Run on Google Colab (free GPU)
No local setup needed. Open the notebook directly in Colab and run the full speaker-aligned transcript pipeline on a free T4 GPU:
The notebook covers: install → set API keys → upload audio → run pipeline → download result. Use a GPU runtime for the best diarization performance. The notebook also includes an advanced GPU pipeline for explicitly moving the pyannote model to CUDA.
Requirements
- Python
3.10+ - An OpenAI API key (
OPENAI_API_KEY) - For large-file fallback: either system
ffmpeginPATH, or Python packageimageio-ffmpeg - For optional speaker diarization: a Hugging Face token (
HUGGINGFACE_TOKEN) andpyannote.audio
Installation
Option 1: uv (recommended)
uv sync
Option 2: pip
pip install -e .
Optional speaker diarization dependencies
uv sync --extra diarize
or:
pip install -e ".[diarize]"
Configuration
Set your API key in the environment or in a .env file:
export OPENAI_API_KEY="your_api_key_here"
export HUGGINGFACE_TOKEN="your_huggingface_token_here"
Or create .env in project root:
OPENAI_API_KEY=your_api_key_here
HUGGINGFACE_TOKEN=your_huggingface_token_here
CLI Usage Guide
Basic command:
whisper-smith <audio_path>
Show help:
whisper-smith --help
1) Print transcript to terminal (default txt)
whisper-smith data/sample.m4a
2) Save transcript to a file
whisper-smith data/sample.m4a --output data/sample.txt
3) Choose output format explicitly
whisper-smith data/sample.m4a --format json --output data/sample.json
Supported CLI formats: txt, json, srt, vtt
4) Let format be inferred from output extension
whisper-smith data/sample.m4a --output data/sample.srt
5) Overwrite existing file
whisper-smith data/sample.m4a --output data/sample.txt --overwrite
6) Run speaker diarization
whisper-smith data/sample.m4a --diarize --output data/sample.diarization.json
Diarization currently supports JSON output only. Optional speaker hints:
whisper-smith data/sample.m4a --diarize --format json --num-speakers 2
7) Create speaker-aligned transcript JSON
Run the full pipeline from one audio file:
whisper-smith data/sample.m4a --align --output data/sample.aligned.json
This writes the main aligned transcript JSON to data/sample.aligned.json and
also writes intermediate artifacts beside it:
data/sample.transcript.json
data/sample.diarization.json
To put the intermediate artifacts in a separate directory:
whisper-smith data/sample.m4a --align --output data/sample.aligned.json --artifacts-dir data/artifacts
Python Usage
from pathlib import Path
from whisper_smith.transcribe import transcribe_audio
from whisper_smith.exporters import export_transcript
result = transcribe_audio(Path("data/sample.m4a"))
print(result.text)
srt = export_transcript(result, "srt")
Path("data/sample.srt").write_text(srt, encoding="utf-8")
Speaker diarization
from pathlib import Path
from whisper_smith.diarize import diarize_audio
result = diarize_audio(Path("data/sample.m4a"))
for segment in result.segments:
print(segment.start, segment.end, segment.speaker)
diarize_audio uses HUGGINGFACE_TOKEN from the environment, or accepts
hf_token="..." explicitly.
The default local model is pyannote/speaker-diarization-3.1, which is compatible
with the Intel macOS dependency set. You may pass a different model explicitly
from Python or with --diarization-model when running on a newer platform.
Notes
- If
--outputis omitted, transcript is printed to stdout. - If
--formatis omitted, format is inferred from--outputextension when possible. - If an output file already exists, add
--overwriteto replace it. - Transcription uses a timestamp-capable OpenAI model by default so JSON, SRT, and VTT outputs have segment timestamps.
- For large audio files,
whisper-smithautomatically splits audio into chunks and merges transcript text. - If diarization fails with
torchaudiomissingAudioMetaData, refresh the optional diarization dependencies withuv lock --upgrade-package torch --upgrade-package torchaudioand thenuv sync --extra diarize.
Development
Run tests:
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file whisper_smith-0.2.0.tar.gz.
File metadata
- Download URL: whisper_smith-0.2.0.tar.gz
- Upload date:
- Size: 296.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
994cfab7dc98a4565542e9db7ded92b60078c354c4add65a5b58aa3f6fb6c157
|
|
| MD5 |
3e2a4a6f80a60f4a0f62f8bbf2d6e0f1
|
|
| BLAKE2b-256 |
67f64b8455f9d3a302a0c191319e4b0d0d5546c9ef2bb34878aa2857b91d7b31
|
File details
Details for the file whisper_smith-0.2.0-py3-none-any.whl.
File metadata
- Download URL: whisper_smith-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
925ef38870e62678c7fc707f787c3345b913792db968af92a806a0e588bc34c8
|
|
| MD5 |
86f0a4755efd10e10e638b8da09d5d0d
|
|
| BLAKE2b-256 |
f7c387d4a5ec4af775ea3635bdb17bd025faf5fec7083686e17ed730704314c2
|