Skip to main content

A small Python transcription helper using OpenAI speech-to-text APIs.

Project description

whisper-smith

PyPI version Python versions License: MIT Docs

whisper-smith is a small Python CLI/app helper for transcribing audio files with OpenAI speech-to-text models.

Features

  • Transcribe local audio files
  • CLI-first workflow for quick terminal use
  • Output as txt, json, srt, or vtt
  • Automatically infer output format from output file extension
  • Load environment variables from .env

Run on Google Colab (free GPU)

No local setup needed. Open the notebook directly in Colab and run the full speaker-aligned transcript pipeline on a free T4 GPU:

Open in Colab

The notebook covers: install → set API keys → upload audio → run pipeline → download result. Use a GPU runtime for the best diarization performance. The notebook also includes an advanced GPU pipeline for explicitly moving the pyannote model to CUDA.

Requirements

  • Python 3.10+
  • An OpenAI API key (OPENAI_API_KEY)
  • For large-file fallback: either system ffmpeg in PATH, or Python package imageio-ffmpeg
  • For optional speaker diarization: a Hugging Face token (HUGGINGFACE_TOKEN) and pyannote.audio

Installation

Option 1: uv (recommended)

uv sync

Option 2: pip

pip install -e .

Optional speaker diarization dependencies

uv sync --extra diarize

or:

pip install -e ".[diarize]"

Configuration

Set your API key in the environment or in a .env file:

export OPENAI_API_KEY="your_api_key_here"
export HUGGINGFACE_TOKEN="your_huggingface_token_here"

Or create .env in project root:

OPENAI_API_KEY=your_api_key_here
HUGGINGFACE_TOKEN=your_huggingface_token_here

CLI Usage Guide

Basic command:

whisper-smith <audio_path>

Show help:

whisper-smith --help

1) Print transcript to terminal (default txt)

whisper-smith data/sample.m4a

2) Save transcript to a file

whisper-smith data/sample.m4a --output data/sample.txt

3) Choose output format explicitly

whisper-smith data/sample.m4a --format json --output data/sample.json

Supported CLI formats: txt, json, srt, vtt

4) Let format be inferred from output extension

whisper-smith data/sample.m4a --output data/sample.srt

5) Overwrite existing file

whisper-smith data/sample.m4a --output data/sample.txt --overwrite

6) Run speaker diarization

whisper-smith data/sample.m4a --diarize --output data/sample.diarization.json

Diarization currently supports JSON output only. Optional speaker hints:

whisper-smith data/sample.m4a --diarize --format json --num-speakers 2

7) Create speaker-aligned transcript JSON

Run the full pipeline from one audio file:

whisper-smith data/sample.m4a --align --output data/sample.aligned.json

This writes the main aligned transcript JSON to data/sample.aligned.json and also writes intermediate artifacts beside it:

data/sample.transcript.json
data/sample.diarization.json

To put the intermediate artifacts in a separate directory:

whisper-smith data/sample.m4a --align --output data/sample.aligned.json --artifacts-dir data/artifacts

Python Usage

from pathlib import Path
from whisper_smith.transcribe import transcribe_audio
from whisper_smith.exporters import export_transcript

result = transcribe_audio(Path("data/sample.m4a"))
print(result.text)

srt = export_transcript(result, "srt")
Path("data/sample.srt").write_text(srt, encoding="utf-8")

Speaker diarization

from pathlib import Path
from whisper_smith.diarize import diarize_audio

result = diarize_audio(Path("data/sample.m4a"))

for segment in result.segments:
    print(segment.start, segment.end, segment.speaker)

diarize_audio uses HUGGINGFACE_TOKEN from the environment, or accepts hf_token="..." explicitly.

The default local model is pyannote/speaker-diarization-3.1, which is compatible with the Intel macOS dependency set. You may pass a different model explicitly from Python or with --diarization-model when running on a newer platform.

Notes

  • If --output is omitted, transcript is printed to stdout.
  • If --format is omitted, format is inferred from --output extension when possible.
  • If an output file already exists, add --overwrite to replace it.
  • Transcription uses a timestamp-capable OpenAI model by default so JSON, SRT, and VTT outputs have segment timestamps.
  • For large audio files, whisper-smith automatically splits audio into chunks and merges transcript text.
  • If diarization fails with torchaudio missing AudioMetaData, refresh the optional diarization dependencies with uv lock --upgrade-package torch --upgrade-package torchaudio and then uv sync --extra diarize.

Development

Run tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whisper_smith-0.2.0.tar.gz (296.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whisper_smith-0.2.0-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file whisper_smith-0.2.0.tar.gz.

File metadata

  • Download URL: whisper_smith-0.2.0.tar.gz
  • Upload date:
  • Size: 296.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for whisper_smith-0.2.0.tar.gz
Algorithm Hash digest
SHA256 994cfab7dc98a4565542e9db7ded92b60078c354c4add65a5b58aa3f6fb6c157
MD5 3e2a4a6f80a60f4a0f62f8bbf2d6e0f1
BLAKE2b-256 67f64b8455f9d3a302a0c191319e4b0d0d5546c9ef2bb34878aa2857b91d7b31

See more details on using hashes here.

File details

Details for the file whisper_smith-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: whisper_smith-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for whisper_smith-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 925ef38870e62678c7fc707f787c3345b913792db968af92a806a0e588bc34c8
MD5 86f0a4755efd10e10e638b8da09d5d0d
BLAKE2b-256 f7c387d4a5ec4af775ea3635bdb17bd025faf5fec7083686e17ed730704314c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page