Speech-to-text with speaker diarization — Whisper + pyannote.audio, optimized for Google Colab Free Tier (T4 GPU).

These details have not been verified by PyPI

Project links

Project description

speakerscribe

Speech-to-text with speaker diarization — powered by faster-whisper + pyannote.audio, optimized for Google Colab Free Tier (T4 GPU).

What it does

speakerscribe takes any audio or video file and produces:

Output	Description
`.txt`	Plain transcript with `[SPEAKER_XX]` labels per line
`.srt`	Subtitle file with timestamps and speaker labels
`.json`	Full structured metadata (segments, speakers, versions, RTF)
`.transcript.md`	Readable Markdown grouped by speaker turns with a metadata header
`_1.txt`, `_2.txt` …	Text chunks of ~1950 words for downstream LLM processing

Quick start (Google Colab — recommended)

Open the companion notebook:

The notebook handles everything: installation, Google Drive mounting, HF token setup, and batch processing.

Installation

pip install speakerscribe

Requires Python ≥ 3.10. On Colab, restart the runtime after installation.

HuggingFace token (required for diarization)

pyannote.audio requires a free HuggingFace token with access to the diarization model:

Create a Read token at https://huggingface.co/settings/tokens (do not use fine-grained tokens).
Accept the model terms at https://huggingface.co/pyannote/speaker-diarization-community-1.
Make the token available in one of three ways:
- Colab Secrets (recommended): add HF_TOKEN in the Colab sidebar.
- Environment variable: export HF_TOKEN=hf_...
- Config parameter: TranscriptionConfig(hf_token="hf_...")

Python API

from speakerscribe import TranscriptionConfig, WorkspacePaths, process_batch

# 1. Configure the pipeline
config = TranscriptionConfig(
    model="large-v3-turbo",   # recommended: fast + accurate
    language="es",            # None = auto-detect
    beam_size=5,
    enable_diarization=True,
    # hf_token="hf_..."       # or set HF_TOKEN env var / Colab Secret
)

# 2. Set workspace (put your audio files in <workspace>/data/)
paths = WorkspacePaths(workspace="/content/drive/MyDrive/MyProject")

# 3. Run — processes all files in data/ and writes outputs to transcripts/
results = process_batch(paths, config)

Process a single file

from speakerscribe import TranscriptionConfig, WorkspacePaths
from speakerscribe.pipeline import preflight_check, process_one
from speakerscribe.transcription import load_whisper_model, release_whisper_model

config = TranscriptionConfig(model="large-v3-turbo", language="en")
paths  = WorkspacePaths(workspace="/tmp/my_project")

preflight_check(paths, config)
model = load_whisper_model(config)
try:
    result = process_one(paths.data / "interview.mp4", paths, model, config)
    print(result["total_words"], "words transcribed")
finally:
    release_whisper_model(model)

Rename SPEAKER_XX labels after reviewing the transcript

from speakerscribe import WorkspacePaths, rename_speakers_in_outputs

paths = WorkspacePaths(workspace="/tmp/my_project")
rename_speakers_in_outputs(
    paths,
    base_name="interview_large-v3-turbo",
    mapping={"SPEAKER_00": "Alice", "SPEAKER_01": "Bob"},
)

Command-line interface

# Process all audio files in a workspace
speakerscribe process --workspace /path/to/project --model large-v3-turbo --language es

# Run a fast smoke test (small model, first file only)
speakerscribe smoke-test --workspace /path/to/project

# Inspect a transcription JSON file
speakerscribe inspect /path/to/project/transcripts/interview.json

# Rename speaker labels
speakerscribe rename --workspace /path/to/project \
  --base-name "interview_large-v3-turbo" \
  --mapping mapping.json

# Delete outputs for a specific file (keep source audio)
speakerscribe clean --workspace /path/to/project --pattern "interview"

# Show version
speakerscribe --version

Workspace layout

my_project/
├── data/                  ← place your .mp4 / .mp3 / .wav / .m4a files here
├── transcripts/           ← outputs: .txt, .srt, .json, .transcript.md
├── splits/                ← chunked .txt files for LLM use
├── _audio_temp/           ← temporary 16 kHz WAVs (auto-deleted)
├── _diar_cache/           ← cached pyannote diarization results (reuse on reruns)
└── _logs/                 ← rotating log files

Supported models

Model	VRAM (T4)	Speed	Quality
`tiny`	~1.5 GB	fastest	basic
`base`	~1.8 GB	very fast	fair
`small`	~2.5 GB	fast	good
`medium`	~3.5 GB	moderate	very good
`large-v3-turbo` ⭐	~3.5 GB	fast	excellent
`large-v3`	~5.0 GB	slower	best

large-v3-turbo is the recommended default: ~3× faster than large-v3 with comparable accuracy.

Supported audio/video formats

.mp4 · .mp3 · .wav · .m4a · .mkv · .aac · .flac · .ogg · .webm

Audio is automatically converted to 16 kHz mono WAV via ffmpeg before processing.

Configuration reference

TranscriptionConfig(
    model            = "large-v3-turbo",  # Whisper model name
    device           = "auto",            # "auto" | "cuda" | "cpu"
    compute_type     = "auto",            # "auto" | "float16" | "int8"
    beam_size        = 5,                 # 1 (greedy) to 10 (highest quality)
    language         = None,              # None = auto-detect; "en", "es", etc.
    initial_prompt   = None,              # glossary string for proper nouns / jargon
    use_vad          = True,              # Silero VAD — skip silences
    word_timestamps  = False,             # per-word timestamps (slower)
    enable_diarization = True,            # set False for transcription only
    num_speakers     = None,              # pin exact count if known
    min_speakers     = None,              # or use min/max range
    max_speakers     = None,
    words_per_split  = 1950,              # chunk size for split files
    force_reprocess  = False,             # True = ignore existing outputs
    evaluate_quality = True,              # heuristic quality check
    enable_runs_db   = False,             # SQLite run history (opt-in)
)

Quality checker

After each file, speakerscribe runs a heuristic quality check and logs any issues:

Flag	Severity	Meaning
`LOW_LANG_CONFIDENCE`	WARNING	Language detection < 85% — check if audio is noisy
`LOW_RTF`	WARNING	Processing faster than 2× — possible pipeline issue
`HIGH_RTF`	INFO	Audio mostly silence
`LOW_WPM`	WARNING	< 60 words/min — aggressive VAD or very slow speech
`HIGH_WPM`	CRITICAL	> 250 words/min — likely Whisper hallucination
`SPEAKER_DOMINANCE`	WARNING	One speaker > 95% — poor diarization
`TOO_MANY_SPEAKERS`	WARNING	> 8 speakers detected — consider pinning `num_speakers`
`REPETITIONS`	CRITICAL	Consecutive repeated n-grams — Whisper hallucination loop
`EMPTY_SEGMENTS`	WARNING	> 10% empty segments — VAD too aggressive

Requirements

Python ≥ 3.10
ffmpeg installed and available in PATH
NVIDIA GPU (T4 or better) recommended; CPU mode is also supported but slow

On Google Colab

ffmpeg is pre-installed. GPU runtime: Runtime → Change runtime type → T4 GPU.

Development

git clone https://github.com/EnriqueForero/speakerscribe
cd speakerscribe
pip install -e ".[dev]"
pre-commit install

# Run tests (no GPU needed for unit tests)
pytest tests/ -m "not integration and not gpu"

# Lint
ruff check speakerscribe/ tests/
ruff format speakerscribe/ tests/

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 5, 2026

This version

0.1.0

May 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speakerscribe-0.1.0.tar.gz (39.4 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speakerscribe-0.1.0-py3-none-any.whl (40.2 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file speakerscribe-0.1.0.tar.gz.

File metadata

Download URL: speakerscribe-0.1.0.tar.gz
Upload date: May 1, 2026
Size: 39.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for speakerscribe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`083a3fb121ecdf99b4b9c58517ce2e8cbcf493947c1497449dcb054c1f620036`
MD5	`5b6c6d8ec37010e3f24b282a6b4eb538`
BLAKE2b-256	`6c4922d3dc33b1fb96be5d73f3a356b103f58ba6f6cfab2b214354b63faa0d84`

See more details on using hashes here.

File details

Details for the file speakerscribe-0.1.0-py3-none-any.whl.

File metadata

Download URL: speakerscribe-0.1.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 40.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for speakerscribe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`daebc0a005da109c13f23319d2264f8bbaed5bd9d1b60ecc5a96422eb98fa2ca`
MD5	`883ca850da4a6c7ba66490d9adc85b1a`
BLAKE2b-256	`861c00d8a28922113c678999a05935bf606e807b079a862ee3ab5c0c3a94a650`

See more details on using hashes here.

speakerscribe 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

speakerscribe

What it does

Quick start (Google Colab — recommended)

Installation

HuggingFace token (required for diarization)

Python API

Process a single file

Rename SPEAKER_XX labels after reviewing the transcript

Command-line interface

Workspace layout

Supported models

Supported audio/video formats

Configuration reference

Quality checker

Requirements

On Google Colab

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes