YouTube video metadata, transcript, and media fetcher
Project description
yt-fetch
A Python CLI and library that fetches and extracts structured metadata and transcripts from YouTube videos, producing LLM-ready plain text, content hashes for change detection, and unified video bundles with batch processing, caching, and retry logic.
yt-fetch is a Python tool that extracts structured, AI-ready content from YouTube videos. Given one or more video IDs, URLs, playlists, or channels, it produces normalized metadata, transcripts, and optional media in formats optimized for downstream AI/LLM pipelines (summarization, fact-checking, RAG, search indexing, etc.). It provides content hashes for change detection, optional token count estimates, and unified video bundles. The tool supports both CLI and library usage with batch processing, intelligent caching, configurable retries via gentlify, and rate limiting.
Features
- Metadata — title, channel, duration, tags, upload date via yt-dlp (or YouTube Data API v3)
- Transcripts — fetched via youtube-transcript-api with language preference and fallback
- Media — optional video/audio download via yt-dlp
- Export formats — JSON, plain text, WebVTT (.vtt), SubRip (.srt)
- Batch processing — concurrent workers with per-video error isolation
- Caching — skip already-fetched data; selective
--forceoverrides - Retry — powered by gentlify with exponential backoff and jitter on transient errors
- Rate limiting — token bucket algorithm, shared across workers
- CLI + Library — use from the command line or import as a Python package
Installation
Requires Python 3.14+.
pip install tubefetch
For YouTube Data API v3 support (optional):
pip install tubefetch[youtube-api]
Note: The CLI command can be invoked as either
yt_fetchoryt-fetch.
Quick Start
CLI
# Fetch metadata + transcript for a single video
yt_fetch fetch --id dQw4w9WgXcQ
# Fetch with media download
yt_fetch fetch --id dQw4w9WgXcQ --download video
# Batch from a file
yt_fetch fetch --file video_ids.txt --out ./output --workers 3
# Transcript only
yt_fetch transcript --id dQw4w9WgXcQ --languages en,fr
# Metadata only
yt_fetch metadata --id dQw4w9WgXcQ
# Media only
yt_fetch media --id dQw4w9WgXcQ
Library API
from yt_fetch import fetch_video, fetch_batch, FetchOptions
# Single video
result = fetch_video("dQw4w9WgXcQ")
print(result.metadata.title)
print(result.transcript.segments[0].text)
# With options
opts = FetchOptions(out="./output", languages=["en", "fr"], download="audio")
result = fetch_video("dQw4w9WgXcQ", opts)
# Batch
results = fetch_batch(["dQw4w9WgXcQ", "abc12345678"], opts)
print(f"{results.succeeded}/{results.total} succeeded")
Output Structure
out/
├── <video_id>/
│ ├── metadata.json
│ ├── transcript.json
│ ├── transcript.txt
│ ├── transcript.vtt
│ ├── transcript.srt
│ └── media/
│ ├── video.mp4
│ └── audio.m4a
└── summary.json
Configuration
Options are resolved in this order (first wins):
- CLI flags
- Environment variables (prefix
YT_FETCH_) - YAML config file (
yt_fetch.yaml) - Defaults
CLI Flags
| Flag | Description | Default |
|---|---|---|
--id |
Video ID or URL (repeatable) | — |
--file |
Text/CSV file with IDs | — |
--jsonl |
JSONL file with IDs | — |
--id-field |
Field name in CSV/JSONL | id |
--out |
Output directory | ./out |
--languages |
Comma-separated language codes | en |
--allow-generated |
Allow auto-generated transcripts | true |
--allow-any-language |
Fall back to any language | false |
--download |
none, video, audio, both |
none |
--max-height |
Max video height (e.g. 720) | — |
--format |
Video format | best |
--audio-format |
Audio format | best |
--force |
Force re-fetch everything | false |
--force-metadata |
Force re-fetch metadata only | false |
--force-transcript |
Force re-fetch transcript only | false |
--force-media |
Force re-download media only | false |
--retries |
Max retries per request | 3 |
--rate-limit |
Requests per second | 2.0 |
--workers |
Parallel workers for batch | 3 |
--fail-fast |
Stop on first failure | false |
--strict |
Exit code 2 on partial failure | false |
--verbose |
Verbose output | false |
Environment Variables
All options can be set via environment variables with the YT_FETCH_ prefix:
export YT_FETCH_OUT=./output
export YT_FETCH_LANGUAGES=en,fr
export YT_FETCH_DOWNLOAD=video
export YT_FETCH_YT_API_KEY=your-api-key
YAML Config File
Create yt_fetch.yaml in the working directory:
out: ./output
languages:
- en
- fr
download: none
allow_generated: true
retries: 3
rate_limit: 2.0
workers: 3
Retry Configuration
yt-fetch uses gentlify for intelligent retry management with exponential backoff and jitter.
How Retries Work
- Transient errors (rate limits, network errors, service errors) are automatically retried
- Permanent errors (video not found, transcripts disabled) fail immediately without retry
- Configurable attempts: Set
--retries Nto control max retry attempts (default: 3) - Disable retries: Set
--retries 0for external retry management (e.g., with your own gentlify configuration)
Examples
from yt_fetch import fetch_video, FetchOptions
# Default: 3 retry attempts
result = fetch_video("dQw4w9WgXcQ")
# Custom retry count
opts = FetchOptions(retries=5)
result = fetch_video("dQw4w9WgXcQ", opts)
# Disable internal retries (for external retry management)
opts = FetchOptions(retries=0)
result = fetch_video("dQw4w9WgXcQ", opts)
CLI:
# Custom retry count
yt_fetch fetch --id dQw4w9WgXcQ --retries 5
# Disable retries
yt_fetch fetch --id dQw4w9WgXcQ --retries 0
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success (or partial failure without --strict) |
| 1 | Generic error (e.g. no IDs provided) |
| 2 | Partial failure with --strict |
| 3 | All videos failed |
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run unit tests
python -m pytest tests/
# Run with coverage
python -m pytest tests/ --cov=yt_fetch --cov-report=term-missing
# Run integration tests (requires network)
RUN_INTEGRATION=1 python -m pytest tests/integration/
License
MPL-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tubefetch-0.8.1.tar.gz.
File metadata
- Download URL: tubefetch-0.8.1.tar.gz
- Upload date:
- Size: 50.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
877d2d0af5db06cac163c4dfd49d8718c521cf6533a885f4887bb3dec9da9465
|
|
| MD5 |
8b3d2a5104e7c4d127c68d2ae0ad331b
|
|
| BLAKE2b-256 |
63a3923e26ff4e65f685d3c956cc3f831a4d86a819655c6a9e436426356ae5a3
|
Provenance
The following attestation bundles were made for tubefetch-0.8.1.tar.gz:
Publisher:
release.yml on pointmatic/tubefetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tubefetch-0.8.1.tar.gz -
Subject digest:
877d2d0af5db06cac163c4dfd49d8718c521cf6533a885f4887bb3dec9da9465 - Sigstore transparency entry: 1022158735
- Sigstore integration time:
-
Permalink:
pointmatic/tubefetch@f1b8df5a89f99fb191fc28d407e3cb4302c5ad73 -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/pointmatic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f1b8df5a89f99fb191fc28d407e3cb4302c5ad73 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tubefetch-0.8.1-py3-none-any.whl.
File metadata
- Download URL: tubefetch-0.8.1-py3-none-any.whl
- Upload date:
- Size: 37.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26c05fdd2b8d574af60ca9d9f9abbfa685e09b7422a5dc4f7da01bdb32f0ca88
|
|
| MD5 |
3d7f2a388b50b2d135365e130acba4ee
|
|
| BLAKE2b-256 |
0298aa666bc63a426c7e76d99f4e2af554dd1c2175d368d8d32fa10e7a25f9d2
|
Provenance
The following attestation bundles were made for tubefetch-0.8.1-py3-none-any.whl:
Publisher:
release.yml on pointmatic/tubefetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tubefetch-0.8.1-py3-none-any.whl -
Subject digest:
26c05fdd2b8d574af60ca9d9f9abbfa685e09b7422a5dc4f7da01bdb32f0ca88 - Sigstore transparency entry: 1022158761
- Sigstore integration time:
-
Permalink:
pointmatic/tubefetch@f1b8df5a89f99fb191fc28d407e3cb4302c5ad73 -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/pointmatic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f1b8df5a89f99fb191fc28d407e3cb4302c5ad73 -
Trigger Event:
push
-
Statement type: