Skip to main content

Speech-to-text with speaker diarization — Whisper + pyannote.audio, optimized for Google Colab Free Tier (T4 GPU).

Project description

speakerscribe

Speech-to-text with speaker diarization — powered by faster-whisper + pyannote.audio, optimized for Google Colab Free Tier (T4 GPU).

PyPI version Python License: MIT CI


What it does

speakerscribe takes any audio or video file and produces:

Output Description
.txt Plain transcript with [SPEAKER_XX] labels per line
.srt Subtitle file with timestamps and speaker labels
.json Full structured metadata (segments, speakers, versions, RTF)
.transcript.md Readable Markdown grouped by speaker turns with a metadata header
_1.txt, _2.txt Text chunks of ~1950 words for downstream LLM processing

Quick start (Google Colab — recommended)

Open the companion notebook:

Open In Colab

The notebook handles everything: installation, Google Drive mounting, HF token setup, and batch processing.


Installation

pip install speakerscribe

Requires Python ≥ 3.10. On Colab, restart the runtime after installation.

HuggingFace token (required for diarization)

pyannote.audio requires a free HuggingFace token with access to the diarization model:

  1. Create a Read token at https://huggingface.co/settings/tokens (do not use fine-grained tokens).
  2. Accept the model terms at https://huggingface.co/pyannote/speaker-diarization-community-1.
  3. Make the token available in one of three ways:
    • Colab Secrets (recommended): add HF_TOKEN in the Colab sidebar.
    • Environment variable: export HF_TOKEN=hf_...
    • Config parameter: TranscriptionConfig(hf_token="hf_...")

Python API

from speakerscribe import TranscriptionConfig, WorkspacePaths, process_batch

# 1. Configure the pipeline
config = TranscriptionConfig(
    model="large-v3-turbo",   # recommended: fast + accurate
    language="es",            # None = auto-detect
    beam_size=5,
    enable_diarization=True,
    # hf_token="hf_..."       # or set HF_TOKEN env var / Colab Secret
)

# 2. Set workspace (put your audio files in <workspace>/data/)
paths = WorkspacePaths(workspace="/content/drive/MyDrive/MyProject")

# 3. Run — processes all files in data/ and writes outputs to transcripts/
results = process_batch(paths, config)

Process a single file

from speakerscribe import TranscriptionConfig, WorkspacePaths
from speakerscribe.pipeline import preflight_check, process_one
from speakerscribe.transcription import load_whisper_model, release_whisper_model

config = TranscriptionConfig(model="large-v3-turbo", language="en")
paths  = WorkspacePaths(workspace="/tmp/my_project")

preflight_check(paths, config)
model = load_whisper_model(config)
try:
    result = process_one(paths.data / "interview.mp4", paths, model, config)
    print(result["total_words"], "words transcribed")
finally:
    release_whisper_model(model)

Rename SPEAKER_XX labels after reviewing the transcript

from speakerscribe import WorkspacePaths, rename_speakers_in_outputs

paths = WorkspacePaths(workspace="/tmp/my_project")
rename_speakers_in_outputs(
    paths,
    base_name="interview_large-v3-turbo",
    mapping={"SPEAKER_00": "Alice", "SPEAKER_01": "Bob"},
)

Command-line interface

# Process all audio files in a workspace
speakerscribe process --workspace /path/to/project --model large-v3-turbo --language es

# Run a fast smoke test (small model, first file only)
speakerscribe smoke-test --workspace /path/to/project

# Inspect a transcription JSON file
speakerscribe inspect /path/to/project/transcripts/interview.json

# Rename speaker labels
speakerscribe rename --workspace /path/to/project \
  --base-name "interview_large-v3-turbo" \
  --mapping mapping.json

# Delete outputs for a specific file (keep source audio)
speakerscribe clean --workspace /path/to/project --pattern "interview"

# Show version
speakerscribe --version

Workspace layout

my_project/
├── data/                  ← place your .mp4 / .mp3 / .wav / .m4a files here
├── transcripts/           ← outputs: .txt, .srt, .json, .transcript.md
├── splits/                ← chunked .txt files for LLM use
├── _audio_temp/           ← temporary 16 kHz WAVs (auto-deleted)
├── _diar_cache/           ← cached pyannote diarization results (reuse on reruns)
└── _logs/                 ← rotating log files

Supported models

Model VRAM (T4) Speed Quality
tiny ~1.5 GB fastest basic
base ~1.8 GB very fast fair
small ~2.5 GB fast good
medium ~3.5 GB moderate very good
large-v3-turbo ~3.5 GB fast excellent
large-v3 ~5.0 GB slower best

large-v3-turbo is the recommended default: ~3× faster than large-v3 with comparable accuracy.


Supported audio/video formats

.mp4 · .mp3 · .wav · .m4a · .mkv · .aac · .flac · .ogg · .webm

Audio is automatically converted to 16 kHz mono WAV via ffmpeg before processing.


Configuration reference

TranscriptionConfig(
    model            = "large-v3-turbo",  # Whisper model name
    device           = "auto",            # "auto" | "cuda" | "cpu"
    compute_type     = "auto",            # "auto" | "float16" | "int8"
    beam_size        = 5,                 # 1 (greedy) to 10 (highest quality)
    language         = None,              # None = auto-detect; "en", "es", etc.
    initial_prompt   = None,              # glossary string for proper nouns / jargon
    use_vad          = True,              # Silero VAD — skip silences
    word_timestamps  = False,             # per-word timestamps (slower)
    enable_diarization = True,            # set False for transcription only
    num_speakers     = None,              # pin exact count if known
    min_speakers     = None,              # or use min/max range
    max_speakers     = None,
    words_per_split  = 1950,              # chunk size for split files
    force_reprocess  = False,             # True = ignore existing outputs
    evaluate_quality = True,              # heuristic quality check
    enable_runs_db   = False,             # SQLite run history (opt-in)
)

Quality checker

After each file, speakerscribe runs a heuristic quality check and logs any issues:

Flag Severity Meaning
LOW_LANG_CONFIDENCE WARNING Language detection < 85% — check if audio is noisy
LOW_RTF WARNING Processing faster than 2× — possible pipeline issue
HIGH_RTF INFO Audio mostly silence
LOW_WPM WARNING < 60 words/min — aggressive VAD or very slow speech
HIGH_WPM CRITICAL > 250 words/min — likely Whisper hallucination
SPEAKER_DOMINANCE WARNING One speaker > 95% — poor diarization
TOO_MANY_SPEAKERS WARNING > 8 speakers detected — consider pinning num_speakers
REPETITIONS CRITICAL Consecutive repeated n-grams — Whisper hallucination loop
EMPTY_SEGMENTS WARNING > 10% empty segments — VAD too aggressive

Requirements

  • Python ≥ 3.10
  • ffmpeg installed and available in PATH
  • NVIDIA GPU (T4 or better) recommended; CPU mode is also supported but slow

On Google Colab

ffmpeg is pre-installed. GPU runtime: Runtime → Change runtime type → T4 GPU.


Development

git clone https://github.com/EnriqueForero/speakerscribe
cd speakerscribe
pip install -e ".[dev]"
pre-commit install

# Run tests (no GPU needed for unit tests)
pytest tests/ -m "not integration and not gpu"

# Lint
ruff check speakerscribe/ tests/
ruff format speakerscribe/ tests/

License

MIT — © 2026 Néstor Enrique Forero Herrera

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speakerscribe-0.1.0.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

speakerscribe-0.1.0-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file speakerscribe-0.1.0.tar.gz.

File metadata

  • Download URL: speakerscribe-0.1.0.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for speakerscribe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 083a3fb121ecdf99b4b9c58517ce2e8cbcf493947c1497449dcb054c1f620036
MD5 5b6c6d8ec37010e3f24b282a6b4eb538
BLAKE2b-256 6c4922d3dc33b1fb96be5d73f3a356b103f58ba6f6cfab2b214354b63faa0d84

See more details on using hashes here.

File details

Details for the file speakerscribe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: speakerscribe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for speakerscribe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 daebc0a005da109c13f23319d2264f8bbaed5bd9d1b60ecc5a96422eb98fa2ca
MD5 883ca850da4a6c7ba66490d9adc85b1a
BLAKE2b-256 861c00d8a28922113c678999a05935bf606e807b079a862ee3ab5c0c3a94a650

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page