crawl4ai for video & audio — turn any YouTube video, podcast, or recording into clean timestamped LLM-ready markdown, or TTS/STT training datasets

These details have not been verified by PyPI

Project links

Project description

hearsay

crawl4ai for video & audio. One command turns any YouTube video, podcast episode, or local recording into clean, timestamped, chunked, LLM-ready markdown — for RAG pipelines, notes, and AI agents.

Command line — for pipelines & automation Web UI — hearsay web, in your browser

Same engine, two front ends — use whichever fits. Paste a link, get back markdown a human and a model can read: readable paragraphs, real timestamps, chapter headings, and an optional JSON sidecar with a stable schema. Captions when they exist (fast, no download); local Whisper or Apple-Silicon Parakeet transcription when they don't.

uv tool install hearsay
hearsay "https://www.youtube.com/watch?v=VIDEO_ID"   # → ./VIDEO_ID.md

Two ways to use it
Why · Install · Transcription engines
Quickstart · Web UI · What you get
Build training datasets · How it compares
Give your agent ears (MCP) · CLI reference
Requirements · Contributing

Two ways to use it

Command line — for pipelines, batches, and automation:

hearsay "https://youtu.be/VIDEO_ID" --json        # markdown + JSON sidecar
hearsay "https://example.com/feed.xml" --all      # batch a whole podcast feed

Web UI — for a browser: run hearsay web, paste a URL or drop in a file, and watch the clean markdown appear with a live preview, copy, and download. It's a single self-contained page with no extra dependencies (Python standard library only). See it in action up top, and details below.

Why

Getting a transcript into your RAG pipeline usually means gluing together yt-dlp, Whisper, and a pile of timestamp-wrangling scripts — and you still end up with one line per caption fragment or an undifferentiated wall of text. hearsay does the whole thing in one command:

Captions-first. Uses YouTube captions when available — fast, no media download.
Falls back to transcription automatically (local Whisper, or Parakeet on Apple Silicon) when there are none.
Readable output. Caption fragments are regrouped into real paragraphs (40–120 words) split on pauses and sentence boundaries — never one line per fragment, never a wall of text.
Structured. Chapters become ## sections (or ~5-minute time windows), every paragraph keeps a [hh:mm:ss] timestamp, and --json emits a sidecar with a stable schema.
Scales. Single videos, whole YouTube playlists, and podcast RSS feeds — batch into a folder, continuing past per-item failures.

Install

uv tool install hearsay          # recommended
# or
pipx install hearsay

Optional extras:

uv tool install "hearsay[mcp]"        # MCP server, for AI agents
uv tool install "hearsay[parakeet]"   # fast Apple-Silicon engine (macOS arm64)
uv tool install "hearsay[diarize]"    # speaker diarization for single-voice TTS datasets

System requirement: ffmpeg on your PATH.

From source (for development)

git clone https://github.com/mudassar531/hearsay
cd hearsay
uv sync && uv run hearsay --help    # or: uv tool install .

Transcription engines

When a video has no captions (or you pass --transcribe), hearsay transcribes locally with the fastest engine your machine has. The default --model auto picks:

Engine	When	Speed	Notes
Parakeet (NVIDIA Parakeet-TDT on Apple MLX)	Apple Silicon + `parakeet` extra	~24× realtime (M1 Pro)	`parakeet` is multilingual (25 European languages); `parakeet-en` is English-only
Whisper (faster-whisper, CPU int8)	everywhere else, or an explicit size	~7× realtime	sizes `tiny`…`large-v3`; `large-v3` is the multilingual ceiling

On Apple Silicon, Parakeet is about 3× faster than whisper-small at comparable accuracy. If the parakeet extra isn't installed, auto falls back to whisper-small automatically — so hearsay behaves the same everywhere, just faster on a Mac. Models download once (Whisper: tens of MB to ~1.5 GB; Parakeet v3: ~2.5 GB) and are cached for offline use.

Quickstart

# YouTube → markdown via captions (fast — no download)
hearsay "https://www.youtube.com/watch?v=VIDEO_ID"

# Local audio/video → markdown (fast Parakeet on Apple Silicon, else CPU Whisper)
hearsay talk.mp3

# Force local transcription on a YouTube URL, pick an engine, also emit JSON
hearsay "https://youtu.be/VIDEO_ID" --transcribe --model parakeet --json

# Music/song? Add --no-vad so the lyrics aren't filtered out as "non-speech"
hearsay "https://youtu.be/SONG_ID" --no-vad

# A podcast feed or YouTube playlist: list, then ingest a selection
hearsay "https://example.com/feed.xml"
hearsay "https://example.com/feed.xml" --all --limit 3 --output-dir ./out

No captions on a video? hearsay falls back to local transcription automatically.

Web UI

Prefer a browser? hearsay web starts a tiny local web UI — paste a YouTube URL or attach an audio/video file, pick the model, and get a live markdown preview with copy, download, and a history of past transcripts. It's a single self-contained page built on the Python standard library, with no extra dependencies, and it binds to 127.0.0.1 so nothing leaves your machine.

hearsay web                      # → http://localhost:8756
hearsay web --port 9000          # custom port
hearsay web --host 0.0.0.0       # expose on your LAN (unauthenticated — careful)

Run hearsay web and open the printed URL.
Paste a YouTube link (or click Attach to upload a file).
Optionally tick Force transcription, toggle VAD, or pick a model.
Hit send — the transcript renders with Copy / Download buttons.

Tick Dataset to build a training dataset instead (set clip length and sample rate); it downloads as a .zip. Single video URLs and file uploads go through the UI; for playlists and podcast feeds, use the CLI (the UI shows a friendly hint).

What you get

---
title: "You Would Be a Terrible Leader"
source: "https://www.youtube.com/watch?v=rStL7niR7gs"
channel: "CGP Grey"
duration: "00:18:13"
ingested: "2026-06-13T10:00:00Z"
method: "captions"
language: "en"
---

# You Would Be a Terrible Leader

## [00:00:00 – 00:05:21]

**[00:00:00]** Do you want to rule? Do you see the problems in your country and
know how to fix them? If only you had the power to do so. Well. You've come to
the right place. But, before we begin this lesson in political power, ask
yourself, why don't rulers see as clearly as you...

The method field records exactly how the text was produced — captions, captions-auto, whisper-small, parakeet-tdt-0.6b-v3, etc. — so a downstream consumer can tell a human transcript from a machine one.

Pass --json for a sidecar matching the Transcript schema: metadata plus chunks[], each with start_s, end_s, section, and text — ready to embed.

Build training datasets

The markdown above is for reading (RAG, humans). hearsay dataset is a second, separate mode that turns the same media into ML training datasets for TTS and STT — audio sliced into short clips paired with exact, verbatim transcripts, in standard layouts a training pipeline reads directly.

# A short video → a dataset folder (LJSpeech metadata.csv + NeMo manifest.jsonl + wavs/)
hearsay dataset "https://youtu.be/VIDEO_ID" --out ./voice-data

# A whole playlist / channel / podcast feed → one merged dataset
hearsay dataset "https://example.com/feed.xml" --out ./speech-data

# 16 kHz mono for ASR, custom clip length, also emit a HuggingFace audiofolder index
hearsay dataset talk.mp3 --sample-rate 16000 --segment-min 2 --segment-max 12 --format hf

You get a folder like:

voice-data/
  wavs/VIDEO_ID_0001.wav …     # mono 16-bit PCM, cut on sentence/pause boundaries, never mid-word
  metadata.csv                  # LJSpeech: id|text|text  (Coqui / Piper read this directly)
  manifest.jsonl                # NeMo/ESPnet: {"audio_filepath","duration","text","offset"}
  dataset_card.md               # provenance, counts, language + a rights/consent note
  dropped.jsonl                 # every filtered-out clip, with the reason

Word-accurate cuts. Clips are sliced on word-level timestamps (faster-whisper word_timestamps, or Parakeet on Apple Silicon) — never mid-word.
Quality filtering (on by default) drops junk: too-short/long, internal silence, wrong-script/odd speaking-rate text, repetition, and low ASR confidence. Each drop is logged; --no-filter keeps everything.
Single-voice TTS from multi-speaker audio needs diarization: uv tool install "hearsay[diarize]", accept the pyannote model conditions and set HF_TOKEN, then --dominant-speaker (keep the host) or --per-speaker (one index per speaker). Without it, datasets are mixed-speaker (fine for STT) and hearsay says so.

Accuracy & rights. Word boundaries from Whisper/Parakeet are good but not phonetically exact — clips are padded and snapped to pauses, and you should spot-check. You are responsible for the rights to any media you process and for voice consent (cloning a real person's voice may require it); extracting audio from YouTube may breach its Terms. hearsay is local and ships no datasets. Informational, not legal advice — see each generated dataset_card.md.

How it compares

	hearsay	DIY `yt-dlp` + Whisper	markitdown / docling
Input	video & audio	video & audio (you wire it)	documents (pdf/docx/pptx)
One command	✅	❌ multi-step plumbing	✅ (for docs)
Captions-first (no download)	✅	✗ usually re-transcribes	n/a
Timestamps + paragraph grouping	✅ readable	✗ raw segments	n/a
Chapters → sections	✅	✗ manual	n/a
Podcasts · playlists · batch	✅	✗ manual	✗
Fast Apple-Silicon engine	✅ Parakeet (MLX)	✗ DIY	n/a
JSON sidecar for RAG	✅ stable schema	✗ manual	varies
TTS/STT dataset export	✅ LJSpeech + JSONL, filtered, diarizable	✗ DIY plumbing	✗
Browser UI + MCP server	✅	✗	varies

hearsay does media; document tools like markitdown and docling do documents. Use both.

Give your agent ears

hearsay ships an MCP server so AI agents can ingest media themselves. It exposes two tools — ingest_url(url, transcribe?, lang?) and ingest_file(path) — that each return clean, timestamped markdown.

uv tool install "hearsay[mcp]"
hearsay mcp                      # stdio MCP server (Ctrl-C to stop)

Claude Code:

claude mcp add hearsay -- hearsay mcp

or add to .mcp.json (project) / ~/.claude.json (user):

{
  "mcpServers": {
    "hearsay": {
      "type": "stdio",
      "command": "hearsay",
      "args": ["mcp"]
    }
  }
}

Claude Desktop — add to claude_desktop_config.json (Settings → Developer → Edit Config; macOS: ~/Library/Application Support/Claude/, Windows: %APPDATA%\Claude\):

{
  "mcpServers": {
    "hearsay": {
      "type": "stdio",
      "command": "hearsay",
      "args": ["mcp"],
      "env": {
        "HEARSAY_MODEL": "auto"
      }
    }
  }
}

If hearsay is not on the host's PATH, use the absolute path (which hearsay), or "command": "python", "args": ["-m", "hearsay", "mcp"].

Server configuration (env vars, since MCP tool signatures are fixed):

Variable	Default	Effect
`HEARSAY_MODEL`	`auto`	`auto`, `parakeet`, `parakeet-en`, or a Whisper size (`tiny`…`large-v3`)
`HEARSAY_LANG`	(unset)	Default language: English captions, else transcription auto-detect
`HEARSAY_VAD`	`1`	Voice-activity filter (Whisper); set `0` for music/songs
`HEARSAY_PARAKEET_MODEL`	(unset)	Override the Parakeet MLX repo id (advanced)

Speech vs. music: hearsay is tuned for spoken audio (podcasts, talks, interviews, meetings), where transcription is accurate. For music, pass --no-vad so the vocals aren't discarded — but expect a rough, approximate lyric transcript, since these are speech models, not lyrics transcribers.

CLI reference

hearsay <SOURCE> [options]      SOURCE = YouTube video/playlist URL, podcast RSS, or local file

  -o, --output PATH    Output file for a single source (default ./<id>.md)
  --output-dir PATH    Output directory for batch (playlist/feed) ingestion (default ./hearsay-out)
  --lang CODE          Language: captions default to English; transcription auto-detects
  --transcribe         Force local transcription even when captions exist
  --model MODEL        auto (default) | parakeet | parakeet-en | tiny | base | small | medium | large-v3
  --no-vad             Disable voice-activity filtering (Whisper; use for music/songs)
  --json               Also write a .json sidecar (Transcript schema)
  --latest             Batch: ingest only the most recent item
  --episode N          Batch: ingest only item N (1-indexed)
  --all [--limit N]    Batch: ingest all items (optionally capped)

hearsay dataset <SOURCE> [options]   Build a TTS/STT training dataset
  --out PATH           Dataset output directory (default ./hearsay-dataset)
  --format FMT         ljspeech | jsonl | hf (repeatable; default ljspeech + jsonl)
  --sample-rate HZ     Output WAV rate (default 22050; 16000 for ASR)
  --segment-min/max S  Clip length bounds in seconds (default 1–15)
  --model / --lang / --vad / --no-vad    Transcription (as above)
  --normalize          EBU R128 loudness-normalize each clip
  --no-filter          Keep every clip (skip the quality filters)
  --diarize            Label speakers (needs hearsay[diarize] + HF_TOKEN)
  --per-speaker        Diarize and emit a per-speaker index
  --dominant-speaker   Diarize and keep only the most-spoken speaker
  --limit N            Batch: cap items from a playlist/feed

hearsay web            Run the local web UI (--host, --port)
hearsay mcp            Run the MCP stdio server
hearsay --version      Show the version

Requirements

Python 3.11+
ffmpeg on your PATH. hearsay decodes most audio/video directly (faster-whisper bundles its own decoder), but ffmpeg is the safe baseline and is used for some yt-dlp format merges.

OS	Install ffmpeg
macOS (Homebrew)	`brew install ffmpeg`
Debian / Ubuntu	`sudo apt install ffmpeg`
Fedora	`sudo dnf install ffmpeg`
Arch	`sudo pacman -S ffmpeg`
Windows (winget)	`winget install Gyan.FFmpeg`
Windows (Chocolatey)	`choco install ffmpeg`

The first transcription downloads the chosen model once (Whisper: tens of MB to ~1.5 GB; Parakeet v3: ~2.5 GB), then caches it for offline use.

Apple Silicon speed: the parakeet extra (uv tool install "hearsay[parakeet]") runs NVIDIA Parakeet on MLX, transcribing ~3× faster than CPU Whisper (~24× realtime on an M1 Pro). It requires macOS on arm64; on other platforms hearsay uses CPU Whisper automatically.

Contributing

See CONTRIBUTING.md and the good first issues. hearsay does one thing well — media → great markdown — and aims to keep doing exactly that.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jun 14, 2026

0.2.0

Jun 14, 2026

0.1.0

Jun 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hearsay-0.3.0.tar.gz (84.9 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hearsay-0.3.0-py3-none-any.whl (96.3 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file hearsay-0.3.0.tar.gz.

File metadata

Download URL: hearsay-0.3.0.tar.gz
Upload date: Jun 14, 2026
Size: 84.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hearsay-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`3600b5018cb9798e7178fa4b5bc96e6b7c29382aa2ff142495e5d7e05d428991`
MD5	`410218c88a7333bcaa8ed09282abcbaf`
BLAKE2b-256	`48fe72c9c2b317be8429db45ead46690ebe8d9f9cde8670ef53cc821f34976fb`

See more details on using hashes here.

File details

Details for the file hearsay-0.3.0-py3-none-any.whl.

File metadata

Download URL: hearsay-0.3.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 96.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hearsay-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5807fdac66d267201a4e59c9ff6ca993a4c5807ca9a8e1338310705d1838fcd3`
MD5	`f0df1874cd2c1f4f747703e5e8bb3b18`
BLAKE2b-256	`dc9ff46b26e68e23312b133fdce76b3e04fba1a7f626be1f1090b26f0381e060`

See more details on using hashes here.

hearsay 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hearsay

Two ways to use it

Why

Install

Transcription engines

Quickstart

Web UI

What you get

Build training datasets

How it compares

Give your agent ears

CLI reference

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes