Speech-to-text with speaker diarization — Whisper + pyannote.audio, optimized for Google Colab Free Tier (T4 GPU).

These details have not been verified by PyPI

Project links

Project description

speakerscribe

Speech-to-text with speaker diarization — powered by faster-whisper + pyannote.audio, optimized for Google Colab Free Tier (T4 GPU).

What it does

speakerscribe takes any audio or video file and produces:

Output	Description
`.txt`	Plain transcript with `[SPEAKER_XX]` labels per line
`.srt`	Subtitle file with timestamps and speaker labels
`.json`	Full structured metadata (segments, speakers, versions, RTF, timings)
`.transcript.md`	Readable Markdown grouped by speaker turns with a metadata header
`_1.txt`, `_2.txt` …	Text chunks of ~1950 words for downstream LLM processing
`_full_for_llm.txt`	Single-file full transcript for LLMs with large context windows
`.segments.jsonl`	Per-segment streaming output (opt-in via `streaming_jsonl=True`)

Quick start (Google Colab — recommended)

Open the companion notebook:

The notebook handles everything: installation, Google Drive mounting, HF token setup, and batch processing.

Installation

pip install speakerscribe

Requires Python ≥ 3.10. On Colab, restart the runtime after installation.

HuggingFace token (required for diarization)

pyannote.audio requires a free HuggingFace token with access to the diarization model:

Create a Read token at https://huggingface.co/settings/tokens (do not use fine-grained tokens).
Accept the model terms at https://huggingface.co/pyannote/speaker-diarization-community-1.
Make the token available in one of three ways:
- Colab Secrets (recommended): add HF_TOKEN in the Colab sidebar.
- Environment variable: export HF_TOKEN=hf_...
- Config parameter: TranscriptionConfig(hf_token="hf_...")

Python API

from speakerscribe import TranscriptionConfig, WorkspacePaths, process_batch

# 1. Configure the pipeline
config = TranscriptionConfig(
    model="large-v3-turbo",   # recommended: fast + accurate
    language="es",            # None = auto-detect
    beam_size=5,
    enable_diarization=True,
    # hf_token="hf_..."       # or set HF_TOKEN env var / Colab Secret
)

# 2. Set workspace (put your audio files in <workspace>/data/)
paths = WorkspacePaths(workspace="/content/drive/MyDrive/MyProject")

# 3. Run — processes all files in data/ and writes outputs to transcripts/
results = process_batch(paths, config)

Process a single file

from speakerscribe import TranscriptionConfig, WorkspacePaths
from speakerscribe.pipeline import preflight_check, process_one
from speakerscribe.transcription import load_whisper_model, release_whisper_model

config = TranscriptionConfig(model="large-v3-turbo", language="en")
paths  = WorkspacePaths(workspace="/tmp/my_project")

preflight_check(paths, config)
model = load_whisper_model(config)
try:
    result = process_one(paths.data / "interview.mp4", paths, model, config)
    print(result["total_words"], "words transcribed")
finally:
    release_whisper_model(model)

Rename SPEAKER_XX labels after reviewing the transcript

from speakerscribe import WorkspacePaths, rename_speakers_in_outputs

paths = WorkspacePaths(workspace="/tmp/my_project")
rename_speakers_in_outputs(
    paths,
    base_name="interview_large-v3-turbo",
    mapping={"SPEAKER_00": "Alice", "SPEAKER_01": "Bob"},
)

Command-line interface

# Process all audio files in a workspace
speakerscribe process --workspace /path/to/project --model large-v3-turbo --language es

# Enable debug logging
speakerscribe --verbose process --workspace /path/to/project

# Load config from a JSON file instead of CLI flags
speakerscribe process --workspace /path/to/project --config-file config.json

# Run a fast smoke test (small model, first file only)
speakerscribe smoke-test --workspace /path/to/project

# Inspect a transcription JSON file
speakerscribe inspect /path/to/project/transcripts/interview.json

# Show aggregate statistics from the runs database
speakerscribe stats --workspace /path/to/project

# List the most recent runs
speakerscribe list-runs --workspace /path/to/project --limit 20

# Rename speaker labels
speakerscribe rename --workspace /path/to/project \
  --base-name "interview_large-v3-turbo" \
  --mapping mapping.json

# Delete outputs for a specific file (keep source audio)
speakerscribe clean --workspace /path/to/project --pattern "interview"

# Delete everything (requires explicit confirmation)
speakerscribe clean --workspace /path/to/project --all --confirm "YES DELETE ALL"

# Show version
speakerscribe version

Workspace layout

my_project/
├── data/                  ← place your .mp4 / .mp3 / .wav / .m4a files here
├── transcripts/           ← outputs: .txt, .srt, .json, .transcript.md
├── splits/                ← chunked .txt files and _full_for_llm.txt for LLM use
├── _audio_temp/           ← temporary 16 kHz WAVs (auto-deleted)
├── _audio_chunks/         ← chunked WAVs for long-audio splitting (auto-deleted)
├── _diar_cache/           ← cached pyannote diarization results (reuse on reruns)
├── _logs/                 ← rotating log files
└── _runs.db               ← SQLite run history (created when enable_runs_db=True)

Supported models

Model	VRAM (T4)	Speed	Quality
`tiny`	~1.5 GB	fastest	basic
`base`	~1.8 GB	very fast	fair
`small`	~2.5 GB	fast	good
`medium`	~3.5 GB	moderate	very good
`large-v3-turbo` ⭐	~3.5 GB	fast	excellent
`large-v3`	~5.0 GB	slower	best

large-v3-turbo is the recommended choice for most workflows: ~3× faster than large-v3 with comparable accuracy. The default when no model is specified is large-v3.

Supported audio/video formats

.mp4 · .mp3 · .wav · .m4a · .mkv · .aac · .flac · .ogg · .webm

Audio is automatically converted to 16 kHz mono WAV via ffmpeg before processing.

Configuration reference

TranscriptionConfig(
    # ── Model ──────────────────────────────────────────────────────
    model            = "large-v3",        # Whisper model name (default: large-v3)
    device           = "auto",            # "auto" | "cuda" | "cpu"
    compute_type     = "auto",            # "auto" | "float16" | "int8"

    # ── Decoding ───────────────────────────────────────────────────
    beam_size        = 5,                 # 1 (greedy) to 10 (highest quality)
    language         = None,              # None = auto-detect; "en", "es", etc.
    initial_prompt   = None,              # glossary string for proper nouns / jargon
                                          # per-file override: <stem>.prompt.txt next to audio

    # ── VAD ────────────────────────────────────────────────────────
    use_vad              = True,          # Silero VAD — skip silences
    vad_min_silence_ms   = 500,           # silences ≥ N ms split segments

    # ── Transcription ──────────────────────────────────────────────
    word_timestamps  = False,             # per-word timestamps (slower)

    # ── Diarization ────────────────────────────────────────────────
    enable_diarization   = True,          # set False for transcription only
    hf_token             = None,          # or set HF_TOKEN env var / Colab Secret
    num_speakers         = None,          # pin exact count if known
    min_speakers         = None,          # or use min/max range
    max_speakers         = None,
    diarization_model    = "pyannote/speaker-diarization-community-1",

    # ── Transcript output ──────────────────────────────────────────
    gap_max_s_transcript    = 2.0,        # seconds of silence to open a new speaker turn
    generate_transcript_md  = True,       # write the readable .transcript.md
    remove_fillers          = True,       # drop filler-only segments from .transcript.md
                                          # (.json and .srt keep ALL segments)

    # ── Splits and LLM output ──────────────────────────────────────
    words_per_split          = 1950,      # chunk size for split files
    produce_unified_for_llm  = True,      # also write a single _full_for_llm.txt

    # ── Long-audio chunking ────────────────────────────────────────
    long_audio_threshold_min = 120,       # audios longer than N min are chunked
    chunk_duration_min       = 30,        # target chunk length in minutes
    chunk_overlap_s          = 5,         # overlap between chunks in seconds

    # ── Streaming ─────────────────────────────────────────────────
    streaming_jsonl  = False,             # write segments to .segments.jsonl as produced

    # ── Run control ────────────────────────────────────────────────
    force_reprocess  = False,             # True = ignore existing outputs
    evaluate_quality = True,              # heuristic quality check after each file
    enable_runs_db   = True,              # SQLite run history for idempotency by hash

    # ── Disk safety ────────────────────────────────────────────────
    disk_margin_factor  = 0.5,            # required free space = input_size × factor
    disk_margin_min_mb  = 500,            # hard floor for required free space (MB)
)

Quality checker

After each file, speakerscribe runs a heuristic quality check and logs any issues:

Flag	Severity	Meaning
`LOW_LANG_CONFIDENCE`	WARNING	Language detection < 85% — check if audio is noisy
`LOW_RTF`	WARNING	Processing faster than 2× — possible pipeline issue
`HIGH_RTF`	INFO	Audio mostly silence
`LOW_WPM`	WARNING	< 60 words/min — aggressive VAD or very slow speech
`HIGH_WPM`	CRITICAL	> 250 words/min — likely Whisper hallucination
`SPEAKER_DOMINANCE`	WARNING	One speaker > 95% — poor diarization
`TOO_MANY_SPEAKERS`	WARNING	> 8 speakers detected — consider pinning `num_speakers`
`TINY_SPEAKERS`	WARNING	Speakers with ≤ 2 segments — likely false positives
`REPETITIONS`	CRITICAL	Consecutive repeated n-grams — Whisper hallucination loop
`WORD_DOMINANCE`	WARNING	One word dominates the transcript — possible hallucination
`EMPTY_SEGMENTS`	WARNING	> 10% empty segments — VAD too aggressive

Requirements

Python ≥ 3.10
ffmpeg installed and available in PATH
NVIDIA GPU (T4 or better) recommended; CPU mode is also supported but slow

On Google Colab

ffmpeg is pre-installed. GPU runtime: Runtime → Change runtime type → T4 GPU.

Development

git clone https://github.com/EnriqueForero/speakerscribe
cd speakerscribe
pip install -e ".[dev]"
pre-commit install

# Run tests (no GPU needed for unit tests)
pytest tests/ -m "not integration and not gpu"

# Lint
ruff check speakerscribe/ tests/
ruff format speakerscribe/ tests/

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

May 5, 2026

0.1.0

May 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speakerscribe-0.1.1.tar.gz (73.7 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speakerscribe-0.1.1-py3-none-any.whl (50.0 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file speakerscribe-0.1.1.tar.gz.

File metadata

Download URL: speakerscribe-0.1.1.tar.gz
Upload date: May 5, 2026
Size: 73.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for speakerscribe-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e44cddd39ddc58514430e9faccbaa45cdd908dadca05ceef80b9e02b3c199d4e`
MD5	`d6ef747147c88a3133e0e30237b7bb9d`
BLAKE2b-256	`22421e2db3e28070c6ade8bd69ce568a83070f1dac9bc46f4367578cd3a3585b`

See more details on using hashes here.

File details

Details for the file speakerscribe-0.1.1-py3-none-any.whl.

File metadata

Download URL: speakerscribe-0.1.1-py3-none-any.whl
Upload date: May 5, 2026
Size: 50.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for speakerscribe-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`98138219004b5fb1956955ab0c6ccc7d2ebea8a210cedb27343137b03bbb4e7a`
MD5	`2a541ac4e29abc395764baf571231aca`
BLAKE2b-256	`5a6f09da6ea6bbe972a5f289e2340374602dde92c09a8b59833f9ac1e0203fec`

See more details on using hashes here.

speakerscribe 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

speakerscribe

What it does

Quick start (Google Colab — recommended)

Installation

HuggingFace token (required for diarization)

Python API

Process a single file

Rename SPEAKER_XX labels after reviewing the transcript

Command-line interface

Workspace layout

Supported models

Supported audio/video formats

Configuration reference

Quality checker

Requirements

On Google Colab

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes