Skip to main content

YouTube video metadata, transcript, and media fetcher

Project description

tubefetch

tubefetch

CI codecov PyPI Python License

A Python CLI and library that fetches and extracts structured metadata and transcripts from YouTube videos, producing LLM-ready plain text, content hashes for change detection, and unified video bundles with batch processing, caching, and retry logic.

TubeFetch is a Python tool that extracts structured, AI-ready content from YouTube videos. Given one or more video IDs, URLs, playlists, or channels, it produces normalized metadata, transcripts, and optional media in formats optimized for downstream AI/LLM pipelines (summarization, fact-checking, RAG, search indexing, etc.). It provides content hashes for change detection, optional token count estimates, and unified video bundles. The tool supports both CLI and library usage with batch processing, intelligent caching, configurable retries via gentlify, and rate limiting.

Features

  • Metadata — title, channel, duration, tags, upload date via yt-dlp (or YouTube Data API v3)
  • Transcripts — fetched via youtube-transcript-api with language preference and fallback
  • Media — optional video/audio download via yt-dlp
  • Export formats — JSON, plain text, WebVTT (.vtt), SubRip (.srt)
  • Batch processing — concurrent workers with per-video error isolation
  • Caching — skip already-fetched data; selective --force overrides
  • Retry — powered by gentlify with exponential backoff and jitter on transient errors
  • Rate limiting — token bucket algorithm, shared across workers
  • CLI + Library — use from the command line or import as a Python package

Installation

Requires Python 3.14+.

pip install tubefetch

For YouTube Data API v3 support (optional):

pip install tubefetch[youtube-api]

Note: The CLI accepts video IDs/URLs as positional arguments. Use tubefetch VIDEO_ID for the default behavior (metadata + transcript), or specialized commands like metadata, transcript, media for specific content.

Quick Start

CLI

# Fetch a single video
tubefetch dQw4w9WgXcQ

# Multiple videos
tubefetch VIDEO_ID_1 VIDEO_ID_2 VIDEO_ID_3

# From a file
tubefetch --file video_ids.txt

# With media download
tubefetch VIDEO_ID --download video

# Batch from a file
tubefetch --file video_ids.txt --workers 3

# Transcript only
tubefetch transcript dQw4w9WgXcQ --languages en,fr

# Metadata only
tubefetch metadata dQw4w9WgXcQ

# Media only (downloads video+audio by default)
tubefetch media dQw4w9WgXcQ

Specialized Commands

For exceptional cases when you only need specific data:

# Metadata only
tubefetch metadata VIDEO_ID

# Transcript only
tubefetch transcript VIDEO_ID

# Media only
tubefetch media VIDEO_ID

Library API

from tubefetch import fetch_video, fetch_batch, FetchOptions

# Single video
result = fetch_video("dQw4w9WgXcQ")
print(result.metadata.title)
print(result.transcript.segments[0].text)

# With options
opts = FetchOptions(out="./output", languages=["en", "fr"], download="audio")
result = fetch_video("dQw4w9WgXcQ", opts)

# Batch
results = fetch_batch(["dQw4w9WgXcQ", "abc12345678"], opts)
print(f"{results.succeeded}/{results.total} succeeded")

Output Structure

out/
├── <video_id>/
│   ├── metadata.json
│   ├── transcript.json
│   ├── transcript.txt
│   ├── transcript.vtt
│   ├── transcript.srt
│   └── media/
│       ├── video.mp4
│       └── audio.m4a
└── summary.json

Configuration

Options are resolved in this order (first wins):

  1. CLI flags
  2. Environment variables (prefix TUBEFETCH_)
  3. YAML config file (tubefetch.yaml)
  4. Defaults

CLI Flags

Flag Description Default
--id Video ID or URL (repeatable)
--file Text/CSV file with IDs
--jsonl JSONL file with IDs
--id-field Field name in CSV/JSONL id
--out Output directory ./out
--languages Comma-separated language codes en
--allow-generated Allow auto-generated transcripts true
--allow-any-language Fall back to any language false
--download none, video, audio, both none
--max-height Max video height (e.g. 720)
--format Video format best
--audio-format Audio format best
--force Force re-fetch everything false
--force-metadata Force re-fetch metadata only false
--force-transcript Force re-fetch transcript only false
--force-media Force re-download media only false
--retries Max retries per request 3
--rate-limit Requests per second 2.0
--workers Parallel workers for batch 3
--fail-fast Stop on first failure false
--strict Exit code 2 on partial failure false
--verbose Verbose output false

Environment Variables

All options can be set via environment variables with the TUBEFETCH_ prefix:

export TUBEFETCH_OUT=./output
export TUBEFETCH_LANGUAGES=en,fr
export TUBEFETCH_DOWNLOAD=video
export TUBEFETCH_YT_API_KEY=your-api-key

YAML Config File

Create tubefetch.yaml in the working directory:

out: ./output
languages:
  - en
  - fr
download: none
allow_generated: true
retries: 3
rate_limit: 2.0
workers: 3

Retry Configuration

tubefetch uses gentlify for intelligent retry management with exponential backoff and jitter.

How Retries Work

  • Transient errors (rate limits, network errors, service errors) are automatically retried
  • Permanent errors (video not found, transcripts disabled) fail immediately without retry
  • Configurable attempts: Set --retries N to control max retry attempts (default: 3)
  • Disable retries: Set --retries 0 for external retry management (e.g., with your own gentlify configuration)

Examples

from tubefetch import fetch_video, FetchOptions

# Default: 3 retry attempts
result = fetch_video("dQw4w9WgXcQ")

# Custom retry count
opts = FetchOptions(retries=5)
result = fetch_video("dQw4w9WgXcQ", opts)

# Disable internal retries (for external retry management)
opts = FetchOptions(retries=0)
result = fetch_video("dQw4w9WgXcQ", opts)

CLI:

# Custom retry count
tubefetch dQw4w9WgXcQ --retries 5

# Disable retries
tubefetch dQw4w9WgXcQ --retries 0

Exit Codes

Code Meaning
0 Success (or partial failure without --strict)
1 Generic error (e.g. no IDs provided)
2 Partial failure with --strict
3 All videos failed

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run unit tests
python -m pytest tests/

# Run with coverage
python -m pytest tests/ --cov=tubefetch --cov-report=term-missing

# Run integration tests (requires network)
RUN_INTEGRATION=1 python -m pytest tests/integration/

License

MPL-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubefetch-0.9.0.tar.gz (50.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tubefetch-0.9.0-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file tubefetch-0.9.0.tar.gz.

File metadata

  • Download URL: tubefetch-0.9.0.tar.gz
  • Upload date:
  • Size: 50.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tubefetch-0.9.0.tar.gz
Algorithm Hash digest
SHA256 8d039e91972c2b0a94f6a24b4bcb8b03e66ccb5acf8928d84c5a4f11a38a13bc
MD5 a38d1b186de44b342f94a025afb982eb
BLAKE2b-256 3944b2b55375b0af6bc74f14b8c54f081cc7f8a128ac97d483dea9c8d864d205

See more details on using hashes here.

Provenance

The following attestation bundles were made for tubefetch-0.9.0.tar.gz:

Publisher: release.yml on pointmatic/tubefetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tubefetch-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: tubefetch-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 38.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tubefetch-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e6e52232edabefac1ef1bed6f7ffd321d544ddf68d6d39b3ffe29d9cf8295cc
MD5 4913d097ea8736f2ea68bf466400365b
BLAKE2b-256 647ac35a0bf84e2835d24273b5d9d5ccc19f8680d83a4a68880fa534ffcf7863

See more details on using hashes here.

Provenance

The following attestation bundles were made for tubefetch-0.9.0-py3-none-any.whl:

Publisher: release.yml on pointmatic/tubefetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page