Skip to main content

Production-grade YouTube transcript extractor: single videos, batches, playlists, and entire channels. v2 adds IP-block-resistant multi-backend cascade.

Project description

yt-transcript-pro v2.0

Python License: MIT Tests: 130 passing

The most advanced, production-grade YouTube transcript extractor. Single videos, multi-video batches, full playlists, or entire channels โ€” with concurrency, retries, checkpointing, six output formats, and โ€” new in v2 โ€” four interchangeable extraction backends with automatic cascade fallback that bypass IP blocks without needing proxies.


๐Ÿš€ What's new in v2.0

The old v1 used youtube-transcript-api exclusively, which hits YouTube's /api/timedtext endpoint directly. That endpoint is the first thing YouTube rate-limits โ€” after ~250 rapid requests the IP gets RequestBlocked / IpBlocked errors for 1โ€“24 hours.

v2 ships four independent backends that use completely different endpoints, and an auto backend that cascades through them per video:

Backend Endpoint Resilience Speed
auto (default) cascades watch โ†’ ytdlp โ†’ api โญโญโญโญโญ fast
watch GET /watch?v=<id> HTML scrape โญโญโญโญ fastest
ytdlp yt-dlp player API with 7-client rotation โญโญโญโญ fast
api (legacy v1) youtube-transcript-api โญโญ fast

Why this works without proxies: each backend hits a different YouTube surface with different rate-limit heuristics. When one backend is throttled, the cascade automatically switches to the next โ€” and each one has its own independent adaptive back-off (rogues of rapid requests trigger per-backend slowdowns so other backends keep flowing).

Empirically verified on cloud IPs that youtube-transcript-api gets blocked on within seconds โ€” v2 keeps producing transcripts.

Other v2 improvements

  • ๐Ÿ›ก๏ธ Windows Unicode fix โ€“ no more cp1252 crashes on the โœ“/โœ— progress glyphs.
  • ๐Ÿ“ Incremental combined-file writes โ€“ partial runs produce durable output; Ctrl-C doesn't lose data.
  • ๐Ÿ”„ Per-backend adaptive throttling โ€“ cooperative sleeps slow just the backend being throttled, not the whole pool.
  • ๐ŸŽฏ 7-client rotation for ytdlp: android โ†’ android_vr โ†’ tv_simply โ†’ tv_embedded โ†’ mweb โ†’ web โ†’ ios. Each client has its own rate-limit pool and user-agent fingerprint.
  • ๐Ÿช Cookies.txt support for age-restricted / private videos (--cookies cookies.txt).
  • ๐Ÿ•ต๏ธ Rotating modern User-Agents (Chrome, Firefox, Safari, Edge, Android Chrome).
  • ๐Ÿ“Š Non-TTY / nohup logging โ€“ progress is still visible in log files.
  • ๐Ÿงช 130 unit tests, zero network access required to run the suite.

โšก Install

# Clone / unzip and install
cd yt-transcript-pro
pip install -e ".[dev]"

Requires Python โ‰ฅ 3.9. Dependencies: yt-dlp, youtube-transcript-api, pydantic, tenacity, typer, rich.


๐Ÿ“– Quickstart

Entire channel โ†’ one combined text file

# This is the exact command that extracts all 665 InnerCircleTrader videos:
yttp extract "https://www.youtube.com/@InnerCircleTrader" \
  --output-dir ./channel_extraction/ICT \
  --format txt \
  --combine \
  --combined-name InnerCircleTrader_all_transcripts \
  --concurrency 5 \
  --retries 5 \
  --resume
  • --backend defaults to auto โ€” cascades watch โ†’ ytdlp โ†’ api.
  • --resume skips videos already completed (re-run without re-downloading).
  • --concurrency 5 is a safe default. Bump to 10 for residential IPs; drop to 2 for cloud IPs.

Single video

yttp extract dQw4w9WgXcQ -o ./out
yttp extract "https://youtu.be/dQw4w9WgXcQ" -o ./out

Playlist

yttp extract "https://www.youtube.com/playlist?list=PLxxxx" -o ./out -f srt

All channel playlists -> one playlist-organized text file

python extract_playlists.py

The playlist runner writes one consolidated file plus an index, report, and failure checkpoint under channel_extraction/ICT_playlists/. It reuses transcript blocks already present in ../InnerCircleTrader_all_transcripts.txt and caches each newly fetched video under channel_extraction/ICT_playlists/transcripts/ for clean resume behavior.

Batch from a file of URLs/IDs

cat > urls.txt <<EOF
# one URL or ID per line; # for comments
https://www.youtube.com/watch?v=AAAA
dQw4w9WgXcQ
https://youtu.be/XXXX
EOF

yttp extract urls.txt -o ./out --combine

๐Ÿงฐ Full CLI reference

yttp extract --help

Key flags:

Flag Default Description
-o/--output-dir output/ Where to write files
-f/--format txt txt|json|srt|vtt|md|csv|all
-C/--combine off Combine all transcripts into one file
--combined-name combined Filename stem for combined output
-c/--concurrency 5 Parallel fetchers (1-64)
-n/--max-videos unlimited Cap total videos processed
-l/--languages en,en-US,en-GB Preferred language list
--timestamps off Prefix each line with [HH:MM:SS]
--allow-generated on Fall back to auto-captions
--resume on Skip already-completed videos
--checkpoint <out>/.yttp-checkpoint.json Checkpoint location
--retries 4 Max per-video retries on transient errors
-b/--backend auto auto | watch | ytdlp | api
--player-clients (built-in) Override yt-dlp client order
--cookies none Netscape cookies.txt (age-restricted videos)
--user-agent rotating Fixed HTTP User-Agent
--proxy none http://user:pass@host:port
--webshare-user/-pass none Webshare rotating residential proxy
-v/--verbose off Debug logging

Windows users

If you were hitting UnicodeEncodeError: 'charmap' codec can't encode character '\\u2717' in v1 โ€” that's fixed in v2. The CLI now forces UTF-8 output on Windows.

If you still see it (exotic terminal setup), set:

$env:PYTHONIOENCODING="utf-8"
chcp 65001

๐Ÿ Python API

import asyncio
from yt_transcript_pro import (
    Config,
    SourceResolver,
    AutoTranscriptExtractor,   # โ† new in v2, recommended
    FormatWriter,
)

async def main() -> None:
    cfg = Config(
        concurrency=5,
        output_format="txt",
        combine_into_single_file=True,
        output_dir=Path("out"),
    )
    videos = SourceResolver().resolve(["https://www.youtube.com/@InnerCircleTrader"])
    ext = AutoTranscriptExtractor(cfg)
    results = await ext.fetch_many(videos)

    writer = FormatWriter(cfg)
    for r in results:
        if r.success:
            writer.append_combined(r, "txt", filename="all")

asyncio.run(main())

You can also use any backend directly:

from yt_transcript_pro import YtDlpTranscriptExtractor, WatchPageTranscriptExtractor

watch = WatchPageTranscriptExtractor(cfg)          # scrape /watch HTML
ydl   = YtDlpTranscriptExtractor(cfg)              # yt-dlp player API
ydl   = YtDlpTranscriptExtractor(
    cfg,
    player_clients=["android_vr", "tv_simply"],    # custom client order
)

๐Ÿ›ก๏ธ Troubleshooting IP blocks

If you still get blocks (very rare with auto backend):

  1. Wait out the cool-down. YouTube typically unblocks IPs after 1โ€“24 hours. Your progress is saved in .yttp-checkpoint.json โ€” just re-run with --resume and it picks up where it left off.

  2. Drop concurrency to 2 or 1. The adaptive throttler will further slow down automatically, but a lower starting concurrency is gentler on heavily-flagged IPs.

  3. Export your browser cookies (Netscape format) and pass --cookies cookies.txt. Authenticated requests have a much higher rate-limit ceiling.

  4. Use a proxy as a last resort (--proxy or --webshare-user/-pass). Residential proxies work best; datacenter proxies are often pre-blocked.


๐Ÿงช Running the tests

pip install -e ".[dev]"
pytest -v          # 130 tests, no network, < 1s

๐Ÿ“ฆ Project layout

src/yt_transcript_pro/
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ auto_extractor.py        # โ† new: cascade over all backends
โ”œโ”€โ”€ checkpoint.py
โ”œโ”€โ”€ cli.py                   # โ† updated: --backend flag
โ”œโ”€โ”€ config.py                # โ† updated: cookies_file, user_agent
โ”œโ”€โ”€ extractor.py             # legacy youtube-transcript-api backend
โ”œโ”€โ”€ models.py
โ”œโ”€โ”€ resolver.py              # channel/playlist/video URL resolution
โ”œโ”€โ”€ watch_extractor.py       # โ† new: /watch HTML scraper
โ”œโ”€โ”€ writers.py               # โ† updated: append_combined()
โ””โ”€โ”€ ytdlp_extractor.py       # โ† new: yt-dlp backend w/ 7-client rotation
tests/                       # 130 unit tests

๐Ÿ“ License

MIT (same as v1). See LICENSE.


๐Ÿ™ Credits

Built on top of the excellent yt-dlp, youtube-transcript-api, typer, rich, and pydantic.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_transcript_pro-2.0.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_transcript_pro-2.0.0-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file yt_transcript_pro-2.0.0.tar.gz.

File metadata

  • Download URL: yt_transcript_pro-2.0.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for yt_transcript_pro-2.0.0.tar.gz
Algorithm Hash digest
SHA256 9e55c9ea5f8ef58c2e56314c9718dcfc09a98f2ebec846364f91af0cc3c7c5be
MD5 bae71172bb84a140a8beaeb4aa2967bc
BLAKE2b-256 9594a1a2181590b28f4c67872fdbdd20d88a0afc0c3cd6213ada652b82331012

See more details on using hashes here.

File details

Details for the file yt_transcript_pro-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for yt_transcript_pro-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1569ec76991e93967e6c02b089379388f303ab1593445604cafeeba1eae27878
MD5 8ee656eb552717c004519f3aba943a82
BLAKE2b-256 c4fcf21a8c73acc01cb2cae5be82940c219a433229f5dbf461f715c9f064c3bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page