Production-grade YouTube transcript extractor: single videos, batches, playlists, and entire channels. v2 adds IP-block-resistant multi-backend cascade.
Project description
yt-transcript-pro v2.0
The most advanced, production-grade YouTube transcript extractor. Single videos, multi-video batches, full playlists, or entire channels โ with concurrency, retries, checkpointing, six output formats, and โ new in v2 โ four interchangeable extraction backends with automatic cascade fallback that bypass IP blocks without needing proxies.
๐ What's new in v2.0
The old v1 used youtube-transcript-api exclusively, which hits
YouTube's /api/timedtext endpoint directly. That endpoint is the first
thing YouTube rate-limits โ after ~250 rapid requests the IP gets
RequestBlocked / IpBlocked errors for 1โ24 hours.
v2 ships four independent backends that use completely different
endpoints, and an auto backend that cascades through them per
video:
| Backend | Endpoint | Resilience | Speed |
|---|---|---|---|
auto (default) |
cascades watch โ ytdlp โ api |
โญโญโญโญโญ | fast |
watch |
GET /watch?v=<id> HTML scrape |
โญโญโญโญ | fastest |
ytdlp |
yt-dlp player API with 7-client rotation | โญโญโญโญ | fast |
api (legacy v1) |
youtube-transcript-api |
โญโญ | fast |
Why this works without proxies: each backend hits a different YouTube surface with different rate-limit heuristics. When one backend is throttled, the cascade automatically switches to the next โ and each one has its own independent adaptive back-off (rogues of rapid requests trigger per-backend slowdowns so other backends keep flowing).
Empirically verified on cloud IPs that youtube-transcript-api gets
blocked on within seconds โ v2 keeps producing transcripts.
Other v2 improvements
- ๐ก๏ธ Windows Unicode fix โ no more
cp1252crashes on the โ/โ progress glyphs. - ๐ Incremental combined-file writes โ partial runs produce durable output; Ctrl-C doesn't lose data.
- ๐ Per-backend adaptive throttling โ cooperative sleeps slow just the backend being throttled, not the whole pool.
- ๐ฏ 7-client rotation for
ytdlp:android โ android_vr โ tv_simply โ tv_embedded โ mweb โ web โ ios. Each client has its own rate-limit pool and user-agent fingerprint. - ๐ช Cookies.txt support for age-restricted / private videos (
--cookies cookies.txt). - ๐ต๏ธ Rotating modern User-Agents (Chrome, Firefox, Safari, Edge, Android Chrome).
- ๐ Non-TTY /
nohuplogging โ progress is still visible in log files. - ๐งช 130 unit tests, zero network access required to run the suite.
โก Install
# Clone / unzip and install
cd yt-transcript-pro
pip install -e ".[dev]"
Requires Python โฅ 3.9. Dependencies: yt-dlp, youtube-transcript-api, pydantic, tenacity, typer, rich.
๐ Quickstart
Entire channel โ one combined text file
# This is the exact command that extracts all 665 InnerCircleTrader videos:
yttp extract "https://www.youtube.com/@InnerCircleTrader" \
--output-dir ./channel_extraction/ICT \
--format txt \
--combine \
--combined-name InnerCircleTrader_all_transcripts \
--concurrency 5 \
--retries 5 \
--resume
--backenddefaults toautoโ cascadeswatch โ ytdlp โ api.--resumeskips videos already completed (re-run without re-downloading).--concurrency 5is a safe default. Bump to10for residential IPs; drop to2for cloud IPs.
Single video
yttp extract dQw4w9WgXcQ -o ./out
yttp extract "https://youtu.be/dQw4w9WgXcQ" -o ./out
Playlist
yttp extract "https://www.youtube.com/playlist?list=PLxxxx" -o ./out -f srt
All channel playlists -> one playlist-organized text file
python extract_playlists.py
The playlist runner writes one consolidated file plus an index, report,
and failure checkpoint under channel_extraction/ICT_playlists/. It
reuses transcript blocks already present in
../InnerCircleTrader_all_transcripts.txt and caches each newly fetched
video under channel_extraction/ICT_playlists/transcripts/ for clean
resume behavior.
Batch from a file of URLs/IDs
cat > urls.txt <<EOF
# one URL or ID per line; # for comments
https://www.youtube.com/watch?v=AAAA
dQw4w9WgXcQ
https://youtu.be/XXXX
EOF
yttp extract urls.txt -o ./out --combine
๐งฐ Full CLI reference
yttp extract --help
Key flags:
| Flag | Default | Description |
|---|---|---|
-o/--output-dir |
output/ |
Where to write files |
-f/--format |
txt |
txt|json|srt|vtt|md|csv|all |
-C/--combine |
off | Combine all transcripts into one file |
--combined-name |
combined |
Filename stem for combined output |
-c/--concurrency |
5 |
Parallel fetchers (1-64) |
-n/--max-videos |
unlimited | Cap total videos processed |
-l/--languages |
en,en-US,en-GB |
Preferred language list |
--timestamps |
off | Prefix each line with [HH:MM:SS] |
--allow-generated |
on | Fall back to auto-captions |
--resume |
on | Skip already-completed videos |
--checkpoint |
<out>/.yttp-checkpoint.json |
Checkpoint location |
--retries |
4 |
Max per-video retries on transient errors |
-b/--backend |
auto |
auto | watch | ytdlp | api |
--player-clients |
(built-in) | Override yt-dlp client order |
--cookies |
none | Netscape cookies.txt (age-restricted videos) |
--user-agent |
rotating | Fixed HTTP User-Agent |
--proxy |
none | http://user:pass@host:port |
--webshare-user/-pass |
none | Webshare rotating residential proxy |
-v/--verbose |
off | Debug logging |
Windows users
If you were hitting UnicodeEncodeError: 'charmap' codec can't encode character '\\u2717' in v1 โ that's fixed in v2. The CLI now forces
UTF-8 output on Windows.
If you still see it (exotic terminal setup), set:
$env:PYTHONIOENCODING="utf-8"
chcp 65001
๐ Python API
import asyncio
from yt_transcript_pro import (
Config,
SourceResolver,
AutoTranscriptExtractor, # โ new in v2, recommended
FormatWriter,
)
async def main() -> None:
cfg = Config(
concurrency=5,
output_format="txt",
combine_into_single_file=True,
output_dir=Path("out"),
)
videos = SourceResolver().resolve(["https://www.youtube.com/@InnerCircleTrader"])
ext = AutoTranscriptExtractor(cfg)
results = await ext.fetch_many(videos)
writer = FormatWriter(cfg)
for r in results:
if r.success:
writer.append_combined(r, "txt", filename="all")
asyncio.run(main())
You can also use any backend directly:
from yt_transcript_pro import YtDlpTranscriptExtractor, WatchPageTranscriptExtractor
watch = WatchPageTranscriptExtractor(cfg) # scrape /watch HTML
ydl = YtDlpTranscriptExtractor(cfg) # yt-dlp player API
ydl = YtDlpTranscriptExtractor(
cfg,
player_clients=["android_vr", "tv_simply"], # custom client order
)
๐ก๏ธ Troubleshooting IP blocks
If you still get blocks (very rare with auto backend):
-
Wait out the cool-down. YouTube typically unblocks IPs after 1โ24 hours. Your progress is saved in
.yttp-checkpoint.jsonโ just re-run with--resumeand it picks up where it left off. -
Drop concurrency to 2 or 1. The adaptive throttler will further slow down automatically, but a lower starting concurrency is gentler on heavily-flagged IPs.
-
Export your browser cookies (Netscape format) and pass
--cookies cookies.txt. Authenticated requests have a much higher rate-limit ceiling. -
Use a proxy as a last resort (
--proxyor--webshare-user/-pass). Residential proxies work best; datacenter proxies are often pre-blocked.
๐งช Running the tests
pip install -e ".[dev]"
pytest -v # 130 tests, no network, < 1s
๐ฆ Project layout
src/yt_transcript_pro/
โโโ __init__.py
โโโ auto_extractor.py # โ new: cascade over all backends
โโโ checkpoint.py
โโโ cli.py # โ updated: --backend flag
โโโ config.py # โ updated: cookies_file, user_agent
โโโ extractor.py # legacy youtube-transcript-api backend
โโโ models.py
โโโ resolver.py # channel/playlist/video URL resolution
โโโ watch_extractor.py # โ new: /watch HTML scraper
โโโ writers.py # โ updated: append_combined()
โโโ ytdlp_extractor.py # โ new: yt-dlp backend w/ 7-client rotation
tests/ # 130 unit tests
๐ License
MIT (same as v1). See LICENSE.
๐ Credits
Built on top of the excellent yt-dlp, youtube-transcript-api, typer, rich, and pydantic.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yt_transcript_pro-2.0.2.tar.gz.
File metadata
- Download URL: yt_transcript_pro-2.0.2.tar.gz
- Upload date:
- Size: 60.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66a2ce94b565c13d5a9ce51baadd207fa7f58694d5448da50ec2b680607f5950
|
|
| MD5 |
20cc795ab0162f1ea1942f5007fd67da
|
|
| BLAKE2b-256 |
db46bd403d9f5d08209489a14ab23b06823c55e05abb5cadbd50aca3961895cc
|
File details
Details for the file yt_transcript_pro-2.0.2-py3-none-any.whl.
File metadata
- Download URL: yt_transcript_pro-2.0.2-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc286c48769924aead908f6b6d326daddb3f2c499557d3b93f646e37ba1807b2
|
|
| MD5 |
38cbdce3006077efd74e1f72efd465c3
|
|
| BLAKE2b-256 |
47cdbbfba30f9f4811d6e0fcc6d095f60a3c72bcaca079e181e8b8f15a90cd45
|