Skip to main content

Production-grade YouTube transcript extractor: single videos, batches, playlists, and entire channels. v2 adds IP-block-resistant multi-backend cascade.

Project description

yt-transcript-pro v2.0

Python License: MIT Tests: 130 passing

The most advanced, production-grade YouTube transcript extractor. Single videos, multi-video batches, full playlists, or entire channels โ€” with concurrency, retries, checkpointing, six output formats, and โ€” new in v2 โ€” four interchangeable extraction backends with automatic cascade fallback that bypass IP blocks without needing proxies.


๐Ÿš€ What's new in v2.0

The old v1 used youtube-transcript-api exclusively, which hits YouTube's /api/timedtext endpoint directly. That endpoint is the first thing YouTube rate-limits โ€” after ~250 rapid requests the IP gets RequestBlocked / IpBlocked errors for 1โ€“24 hours.

v2 ships four independent backends that use completely different endpoints, and an auto backend that cascades through them per video:

Backend Endpoint Resilience Speed
auto (default) cascades watch โ†’ ytdlp โ†’ api โญโญโญโญโญ fast
watch GET /watch?v=<id> HTML scrape โญโญโญโญ fastest
ytdlp yt-dlp player API with 7-client rotation โญโญโญโญ fast
api (legacy v1) youtube-transcript-api โญโญ fast

Why this works without proxies: each backend hits a different YouTube surface with different rate-limit heuristics. When one backend is throttled, the cascade automatically switches to the next โ€” and each one has its own independent adaptive back-off (rogues of rapid requests trigger per-backend slowdowns so other backends keep flowing).

Empirically verified on cloud IPs that youtube-transcript-api gets blocked on within seconds โ€” v2 keeps producing transcripts.

Other v2 improvements

  • ๐Ÿ›ก๏ธ Windows Unicode fix โ€“ no more cp1252 crashes on the โœ“/โœ— progress glyphs.
  • ๐Ÿ“ Incremental combined-file writes โ€“ partial runs produce durable output; Ctrl-C doesn't lose data.
  • ๐Ÿ”„ Per-backend adaptive throttling โ€“ cooperative sleeps slow just the backend being throttled, not the whole pool.
  • ๐ŸŽฏ 7-client rotation for ytdlp: android โ†’ android_vr โ†’ tv_simply โ†’ tv_embedded โ†’ mweb โ†’ web โ†’ ios. Each client has its own rate-limit pool and user-agent fingerprint.
  • ๐Ÿช Cookies.txt support for age-restricted / private videos (--cookies cookies.txt).
  • ๐Ÿ•ต๏ธ Rotating modern User-Agents (Chrome, Firefox, Safari, Edge, Android Chrome).
  • ๐Ÿ“Š Non-TTY / nohup logging โ€“ progress is still visible in log files.
  • ๐Ÿงช 130 unit tests, zero network access required to run the suite.

โšก Install

# Clone / unzip and install
cd yt-transcript-pro
pip install -e ".[dev]"

Requires Python โ‰ฅ 3.9. Dependencies: yt-dlp, youtube-transcript-api, pydantic, tenacity, typer, rich.


๐Ÿ“– Quickstart

Entire channel โ†’ one combined text file

# This is the exact command that extracts all 665 InnerCircleTrader videos:
yttp extract "https://www.youtube.com/@InnerCircleTrader" \
  --output-dir ./channel_extraction/ICT \
  --format txt \
  --combine \
  --combined-name InnerCircleTrader_all_transcripts \
  --concurrency 5 \
  --retries 5 \
  --resume
  • --backend defaults to auto โ€” cascades watch โ†’ ytdlp โ†’ api.
  • --resume skips videos already completed (re-run without re-downloading).
  • --concurrency 5 is a safe default. Bump to 10 for residential IPs; drop to 2 for cloud IPs.

Single video

yttp extract dQw4w9WgXcQ -o ./out
yttp extract "https://youtu.be/dQw4w9WgXcQ" -o ./out

Playlist

yttp extract "https://www.youtube.com/playlist?list=PLxxxx" -o ./out -f srt

All channel playlists -> one playlist-organized text file

python extract_playlists.py

The playlist runner writes one consolidated file plus an index, report, and failure checkpoint under channel_extraction/ICT_playlists/. It reuses transcript blocks already present in ../InnerCircleTrader_all_transcripts.txt and caches each newly fetched video under channel_extraction/ICT_playlists/transcripts/ for clean resume behavior.

Batch from a file of URLs/IDs

cat > urls.txt <<EOF
# one URL or ID per line; # for comments
https://www.youtube.com/watch?v=AAAA
dQw4w9WgXcQ
https://youtu.be/XXXX
EOF

yttp extract urls.txt -o ./out --combine

๐Ÿงฐ Full CLI reference

yttp extract --help

Key flags:

Flag Default Description
-o/--output-dir output/ Where to write files
-f/--format txt txt|json|srt|vtt|md|csv|all
-C/--combine off Combine all transcripts into one file
--combined-name combined Filename stem for combined output
-c/--concurrency 5 Parallel fetchers (1-64)
-n/--max-videos unlimited Cap total videos processed
-l/--languages en,en-US,en-GB Preferred language list
--timestamps off Prefix each line with [HH:MM:SS]
--allow-generated on Fall back to auto-captions
--resume on Skip already-completed videos
--checkpoint <out>/.yttp-checkpoint.json Checkpoint location
--retries 4 Max per-video retries on transient errors
-b/--backend auto auto | watch | ytdlp | api
--player-clients (built-in) Override yt-dlp client order
--cookies none Netscape cookies.txt (age-restricted videos)
--user-agent rotating Fixed HTTP User-Agent
--proxy none http://user:pass@host:port
--webshare-user/-pass none Webshare rotating residential proxy
-v/--verbose off Debug logging

Windows users

If you were hitting UnicodeEncodeError: 'charmap' codec can't encode character '\\u2717' in v1 โ€” that's fixed in v2. The CLI now forces UTF-8 output on Windows.

If you still see it (exotic terminal setup), set:

$env:PYTHONIOENCODING="utf-8"
chcp 65001

๐Ÿ Python API

import asyncio
from yt_transcript_pro import (
    Config,
    SourceResolver,
    AutoTranscriptExtractor,   # โ† new in v2, recommended
    FormatWriter,
)

async def main() -> None:
    cfg = Config(
        concurrency=5,
        output_format="txt",
        combine_into_single_file=True,
        output_dir=Path("out"),
    )
    videos = SourceResolver().resolve(["https://www.youtube.com/@InnerCircleTrader"])
    ext = AutoTranscriptExtractor(cfg)
    results = await ext.fetch_many(videos)

    writer = FormatWriter(cfg)
    for r in results:
        if r.success:
            writer.append_combined(r, "txt", filename="all")

asyncio.run(main())

You can also use any backend directly:

from yt_transcript_pro import YtDlpTranscriptExtractor, WatchPageTranscriptExtractor

watch = WatchPageTranscriptExtractor(cfg)          # scrape /watch HTML
ydl   = YtDlpTranscriptExtractor(cfg)              # yt-dlp player API
ydl   = YtDlpTranscriptExtractor(
    cfg,
    player_clients=["android_vr", "tv_simply"],    # custom client order
)

๐Ÿ›ก๏ธ Troubleshooting IP blocks

If you still get blocks (very rare with auto backend):

  1. Wait out the cool-down. YouTube typically unblocks IPs after 1โ€“24 hours. Your progress is saved in .yttp-checkpoint.json โ€” just re-run with --resume and it picks up where it left off.

  2. Drop concurrency to 2 or 1. The adaptive throttler will further slow down automatically, but a lower starting concurrency is gentler on heavily-flagged IPs.

  3. Export your browser cookies (Netscape format) and pass --cookies cookies.txt. Authenticated requests have a much higher rate-limit ceiling.

  4. Use a proxy as a last resort (--proxy or --webshare-user/-pass). Residential proxies work best; datacenter proxies are often pre-blocked.


๐Ÿงช Running the tests

pip install -e ".[dev]"
pytest -v          # 130 tests, no network, < 1s

๐Ÿ“ฆ Project layout

src/yt_transcript_pro/
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ auto_extractor.py        # โ† new: cascade over all backends
โ”œโ”€โ”€ checkpoint.py
โ”œโ”€โ”€ cli.py                   # โ† updated: --backend flag
โ”œโ”€โ”€ config.py                # โ† updated: cookies_file, user_agent
โ”œโ”€โ”€ extractor.py             # legacy youtube-transcript-api backend
โ”œโ”€โ”€ models.py
โ”œโ”€โ”€ resolver.py              # channel/playlist/video URL resolution
โ”œโ”€โ”€ watch_extractor.py       # โ† new: /watch HTML scraper
โ”œโ”€โ”€ writers.py               # โ† updated: append_combined()
โ””โ”€โ”€ ytdlp_extractor.py       # โ† new: yt-dlp backend w/ 7-client rotation
tests/                       # 130 unit tests

๐Ÿ“ License

MIT (same as v1). See LICENSE.


๐Ÿ™ Credits

Built on top of the excellent yt-dlp, youtube-transcript-api, typer, rich, and pydantic.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_transcript_pro-2.0.1.tar.gz (60.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_transcript_pro-2.0.1-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file yt_transcript_pro-2.0.1.tar.gz.

File metadata

  • Download URL: yt_transcript_pro-2.0.1.tar.gz
  • Upload date:
  • Size: 60.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for yt_transcript_pro-2.0.1.tar.gz
Algorithm Hash digest
SHA256 c117540fc6abff94d0c529fd08eaf7574b6cad6bb4bbe8240546bf6aa2d7588e
MD5 69114d9aa29e18934ba9df6efdb45878
BLAKE2b-256 242a53f875c89247b14243f86e83524b526587497ec2c8d8791a2462d8459101

See more details on using hashes here.

File details

Details for the file yt_transcript_pro-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for yt_transcript_pro-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6ff2f5a99cc6ce245b160d16df87c234a63135687643786c7d3e370d9b82fbb
MD5 043285a0dcab016817aedb1d7979aeb7
BLAKE2b-256 a62ae6bedc870c5977076257d8b4db4a7c8d5338b8c4ed305dfb9c387fb63cfe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page