Skip to main content

YouTube video metadata, transcript, and media fetcher

Project description

tubefetch

yt-fetch

CI codecov PyPI Python License

A Python CLI and library that fetches and extracts structured metadata and transcripts from YouTube videos, producing LLM-ready plain text, content hashes for change detection, and unified video bundles with batch processing, caching, and retry logic.

yt-fetch is a Python tool that extracts structured, AI-ready content from YouTube videos. Given one or more video IDs, URLs, playlists, or channels, it produces normalized metadata, transcripts, and optional media in formats optimized for downstream AI/LLM pipelines (summarization, fact-checking, RAG, search indexing, etc.). It provides content hashes for change detection, optional token count estimates, and unified video bundles. The tool supports both CLI and library usage with batch processing, intelligent caching, configurable retries via gentlify, and rate limiting.

Features

  • Metadata — title, channel, duration, tags, upload date via yt-dlp (or YouTube Data API v3)
  • Transcripts — fetched via youtube-transcript-api with language preference and fallback
  • Media — optional video/audio download via yt-dlp
  • Export formats — JSON, plain text, WebVTT (.vtt), SubRip (.srt)
  • Batch processing — concurrent workers with per-video error isolation
  • Caching — skip already-fetched data; selective --force overrides
  • Retry — powered by gentlify with exponential backoff and jitter on transient errors
  • Rate limiting — token bucket algorithm, shared across workers
  • CLI + Library — use from the command line or import as a Python package

Installation

Requires Python 3.14+.

pip install tubefetch

For YouTube Data API v3 support (optional):

pip install tubefetch[youtube-api]

Note: The CLI command can be invoked as either yt_fetch or yt-fetch.

Quick Start

CLI

# Fetch metadata + transcript for a single video
yt_fetch fetch --id dQw4w9WgXcQ

# Fetch with media download
yt_fetch fetch --id dQw4w9WgXcQ --download video

# Batch from a file
yt_fetch fetch --file video_ids.txt --out ./output --workers 3

# Transcript only
yt_fetch transcript --id dQw4w9WgXcQ --languages en,fr

# Metadata only
yt_fetch metadata --id dQw4w9WgXcQ

# Media only
yt_fetch media --id dQw4w9WgXcQ

Library API

from yt_fetch import fetch_video, fetch_batch, FetchOptions

# Single video
result = fetch_video("dQw4w9WgXcQ")
print(result.metadata.title)
print(result.transcript.segments[0].text)

# With options
opts = FetchOptions(out="./output", languages=["en", "fr"], download="audio")
result = fetch_video("dQw4w9WgXcQ", opts)

# Batch
results = fetch_batch(["dQw4w9WgXcQ", "abc12345678"], opts)
print(f"{results.succeeded}/{results.total} succeeded")

Output Structure

out/
├── <video_id>/
│   ├── metadata.json
│   ├── transcript.json
│   ├── transcript.txt
│   ├── transcript.vtt
│   ├── transcript.srt
│   └── media/
│       ├── video.mp4
│       └── audio.m4a
└── summary.json

Configuration

Options are resolved in this order (first wins):

  1. CLI flags
  2. Environment variables (prefix YT_FETCH_)
  3. YAML config file (yt_fetch.yaml)
  4. Defaults

CLI Flags

Flag Description Default
--id Video ID or URL (repeatable)
--file Text/CSV file with IDs
--jsonl JSONL file with IDs
--id-field Field name in CSV/JSONL id
--out Output directory ./out
--languages Comma-separated language codes en
--allow-generated Allow auto-generated transcripts true
--allow-any-language Fall back to any language false
--download none, video, audio, both none
--max-height Max video height (e.g. 720)
--format Video format best
--audio-format Audio format best
--force Force re-fetch everything false
--force-metadata Force re-fetch metadata only false
--force-transcript Force re-fetch transcript only false
--force-media Force re-download media only false
--retries Max retries per request 3
--rate-limit Requests per second 2.0
--workers Parallel workers for batch 3
--fail-fast Stop on first failure false
--strict Exit code 2 on partial failure false
--verbose Verbose output false

Environment Variables

All options can be set via environment variables with the YT_FETCH_ prefix:

export YT_FETCH_OUT=./output
export YT_FETCH_LANGUAGES=en,fr
export YT_FETCH_DOWNLOAD=video
export YT_FETCH_YT_API_KEY=your-api-key

YAML Config File

Create yt_fetch.yaml in the working directory:

out: ./output
languages:
  - en
  - fr
download: none
allow_generated: true
retries: 3
rate_limit: 2.0
workers: 3

Retry Configuration

yt-fetch uses gentlify for intelligent retry management with exponential backoff and jitter.

How Retries Work

  • Transient errors (rate limits, network errors, service errors) are automatically retried
  • Permanent errors (video not found, transcripts disabled) fail immediately without retry
  • Configurable attempts: Set --retries N to control max retry attempts (default: 3)
  • Disable retries: Set --retries 0 for external retry management (e.g., with your own gentlify configuration)

Examples

from yt_fetch import fetch_video, FetchOptions

# Default: 3 retry attempts
result = fetch_video("dQw4w9WgXcQ")

# Custom retry count
opts = FetchOptions(retries=5)
result = fetch_video("dQw4w9WgXcQ", opts)

# Disable internal retries (for external retry management)
opts = FetchOptions(retries=0)
result = fetch_video("dQw4w9WgXcQ", opts)

CLI:

# Custom retry count
yt_fetch fetch --id dQw4w9WgXcQ --retries 5

# Disable retries
yt_fetch fetch --id dQw4w9WgXcQ --retries 0

Exit Codes

Code Meaning
0 Success (or partial failure without --strict)
1 Generic error (e.g. no IDs provided)
2 Partial failure with --strict
3 All videos failed

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run unit tests
python -m pytest tests/

# Run with coverage
python -m pytest tests/ --cov=yt_fetch --cov-report=term-missing

# Run integration tests (requires network)
RUN_INTEGRATION=1 python -m pytest tests/integration/

License

MPL-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubefetch-0.8.1.tar.gz (50.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tubefetch-0.8.1-py3-none-any.whl (37.6 kB view details)

Uploaded Python 3

File details

Details for the file tubefetch-0.8.1.tar.gz.

File metadata

  • Download URL: tubefetch-0.8.1.tar.gz
  • Upload date:
  • Size: 50.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tubefetch-0.8.1.tar.gz
Algorithm Hash digest
SHA256 877d2d0af5db06cac163c4dfd49d8718c521cf6533a885f4887bb3dec9da9465
MD5 8b3d2a5104e7c4d127c68d2ae0ad331b
BLAKE2b-256 63a3923e26ff4e65f685d3c956cc3f831a4d86a819655c6a9e436426356ae5a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for tubefetch-0.8.1.tar.gz:

Publisher: release.yml on pointmatic/tubefetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tubefetch-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: tubefetch-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 37.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tubefetch-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 26c05fdd2b8d574af60ca9d9f9abbfa685e09b7422a5dc4f7da01bdb32f0ca88
MD5 3d7f2a388b50b2d135365e130acba4ee
BLAKE2b-256 0298aa666bc63a426c7e76d99f4e2af554dd1c2175d368d8d32fa10e7a25f9d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for tubefetch-0.8.1-py3-none-any.whl:

Publisher: release.yml on pointmatic/tubefetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page