Extract transcripts from YouTube videos with multi-language support and various output formats
Project description
yt-ts-extract
A robust Python library and CLI tool for extracting YouTube video transcripts with multi-language support and proxy rotation capabilities.
✨ Key Features
- Extract transcripts from YouTube videos via video ID
- 26+ language support (English, Spanish, French, German, Japanese, Arabic, Chinese, etc.)
- Multiple output formats: plain text, SRT subtitles, timestamped segments, JSON
- Batch processing for multiple videos
- Anti-bot protection: Android client implementation bypasses detection
- Proxy rotation: Multiple proxy support with automatic rotation strategies
- Both CLI and Python library interfaces
🚀 Installation
# Install from PyPI
pip install yt-ts-extract
# Or install in development mode
git clone https://github.com/sinjab/yt-ts-extract.git
cd yt-ts-extract
pip install -e .
📖 Quick Start
Command Line Interface
# Basic transcript extraction
yt-transcript fR9ClX0egTc
# Export as SRT subtitles
yt-transcript -f srt -o video.srt fR9ClX0egTc
# List available languages
yt-transcript --list-languages fR9ClX0egTc
# Batch process multiple videos
yt-transcript --batch ids.txt --output-dir ./transcripts/
# Get help
yt-transcript --help
Python Library
from yt_ts_extract import (
get_transcript,
get_transcript_text,
get_available_languages,
YouTubeTranscriptExtractor,
)
from yt_ts_extract.utils import export_to_srt, get_transcript_stats
# Quick transcript extraction
transcript = get_transcript("fR9ClX0egTc")
print(f"Segments: {len(transcript)}")
# Export to SRT
srt_text = export_to_srt(transcript)
with open("video.srt", "w", encoding="utf-8") as f:
f.write(srt_text)
# Plain text and languages
text = get_transcript_text("fR9ClX0egTc")
langs = get_available_languages("fR9ClX0egTc")
print(f"Languages available: {[l['code'] for l in langs]}")
# Using the class directly
extractor = YouTubeTranscriptExtractor(
timeout=20,
max_retries=5,
backoff_factor=1.0,
min_delay=1.5
)
segments = extractor.get_transcript("fR9ClX0egTc", language="en")
stats = get_transcript_stats(segments)
print(stats)
🎛️ CLI Options
yt-transcript [OPTIONS] VIDEO_ID
Options:
-f, --format [text|srt|segments|stats] Output format (default: text)
-o, --output PATH Save output to file
-l, --language TEXT Language code (e.g., 'en', 'es', 'fr')
--list-languages Show available languages for video
--batch PATH Process video IDs from file (one per line)
--output-dir PATH Directory for batch output files
--search TEXT Search for specific text in transcript
--examples Show usage examples
--timeout FLOAT Per-request timeout in seconds (default: 30)
--retries INT Max HTTP retries on failure (default: 3)
--backoff FLOAT Exponential backoff factor (default: 0.75)
--min-delay FLOAT Minimum delay between requests (default: 2)
--proxy TEXT Proxy URL (e.g., "http://user:pass@host:port")
--proxy-list PATH Proxy list file for rotation
--rotation-strategy [random|round_robin|least_used] Proxy rotation strategy (default: random)
--health-check Perform health check on all proxies before starting
--help Show this message and exit
Network Tuning Examples
# Increase retries and timeout
yt-transcript fR9ClX0egTc --retries 5 --timeout 45
# Reduce delay for faster runs (use responsibly)
yt-transcript fR9ClX0egTc --min-delay 1.0 --backoff 0.5
# Single proxy
yt-transcript fR9ClX0egTc --proxy "http://user:pass@host:port"
# Proxy rotation with health check
yt-transcript fR9ClX0egTc --proxy-list proxies.txt --health-check
# Batch processing with proxy rotation
yt-transcript --batch ids.txt --proxy-list proxies.txt --output-dir transcripts/
🔄 Proxy Support
Single Proxy
# HTTP proxy with authentication
yt-transcript --proxy "http://username:password@proxy-host:8080" fR9ClX0egTc
# HTTPS proxy
yt-transcript --proxy "https://proxy-host:8443" fR9ClX0egTc
# SOCKS5 proxy
yt-transcript --proxy "socks5://user:pass@proxy-host:1080" fR9ClX0egTc
Proxy Rotation
Load multiple proxies from a file and automatically rotate between them:
# Basic proxy rotation
yt-transcript --proxy-list proxies.txt fR9ClX0egTc
# With rotation strategy
yt-transcript --proxy-list proxies.txt --rotation-strategy round_robin fR9ClX0egTc
# With health check
yt-transcript --proxy-list proxies.txt --health-check fR9ClX0egTc
Proxy List File Format (proxies.txt):
Address Port Username Password
23.95.150.145 6114 mhzbhrwb yj2veiaafrbu
198.23.239.134 6540 mhzbhrwb yj2veiaafrbu
45.38.107.97 6014 mhzbhrwb yj2veiaafrbu
64.137.96.74 6641 mhzbhrwb yj2veiaafrbu
216.10.27.159 6837 mhzbhrwb yj2veiaafrbu
136.0.207.84 6661 mhzbhrwb yj2veiaafrbu
Rotation Strategies:
random: Random proxy selection (default)round_robin: Cycle through proxies in orderleast_used: Select least recently used proxy
Python Proxy Usage
from yt_ts_extract import YouTubeTranscriptExtractor, ProxyManager
# Single proxy
extractor = YouTubeTranscriptExtractor(
proxy="http://user:pass@host:port",
timeout=30,
max_retries=3
)
# Proxy rotation
proxy_manager = ProxyManager.from_file("proxies.txt", rotation_strategy="round_robin")
extractor = YouTubeTranscriptExtractor(
proxy_manager=proxy_manager,
timeout=30,
max_retries=3
)
# Convenience functions with proxy rotation
from yt_ts_extract import get_transcript_with_proxy_rotation
transcript = get_transcript_with_proxy_rotation("fR9ClX0egTc", "proxies.txt")
Proxy Best Practices:
- Use
--health-checkto verify proxy connectivity before processing - Failed proxies are automatically deactivated and reactivated after cooldown
- Each proxy respects minimum delay between requests
- Monitor proxy health with
extractor.get_proxy_stats()
📊 Output Formats
1. Plain Text (text)
Hello everyone and welcome to this tutorial.
In this video we'll be covering the basics of...
2. SRT Subtitles (srt)
1
00:00:00,000 --> 00:00:03,200
Hello everyone and welcome to this tutorial.
2
00:00:03,200 --> 00:00:07,840
In this video we'll be covering the basics of...
3. Timestamped Segments (segments)
[
{
"start": 0.0,
"end": 3.2,
"duration": 3.2,
"text": "Hello everyone and welcome to this tutorial."
},
{
"start": 3.2,
"end": 7.84,
"duration": 4.64,
"text": "In this video we'll be covering the basics of..."
}
]
4. Statistics (stats)
{
"total_segments": 245,
"total_duration": 1823.4,
"word_count": 2156,
"average_words_per_segment": 8.8,
"languages_available": ["en", "es", "fr", "de"]
}
🌍 Language Support
Supports 26+ languages with automatic detection:
| Language | Code | Language | Code |
|---|---|---|---|
| English | en |
Spanish | es |
| French | fr |
German | de |
| Italian | it |
Portuguese | pt |
| Russian | ru |
Japanese | ja |
| Korean | ko |
Chinese (Simplified) | zh-Hans |
| Chinese (Traditional) | zh-Hant |
Arabic | ar |
| Hindi | hi |
Dutch | nl |
| Polish | pl |
Turkish | tr |
Use --list-languages to see available languages for any video.
🔧 Advanced Usage
Batch Processing
Create an ids.txt file (one video ID per line):
fR9ClX0egTc
9bZkp7q19f0
wIwCTQZ_xFE
Process all videos:
yt-transcript --batch ids.txt --format srt --output-dir ./subtitles/
Search Within Transcripts
# Find mentions of specific topics
yt-transcript --search "machine learning" VIDEO_ID
Advanced Python Features
from yt_ts_extract import YouTubeTranscriptExtractor
from yt_ts_extract.utils import get_transcript_stats
extractor = YouTubeTranscriptExtractor()
# Get timestamped segments for an ID
segments = extractor.get_transcript("fR9ClX0egTc", language="en")
for seg in segments[:5]:
print(f"{seg['start']:.1f}s: {seg['text']}")
# Get statistics about the transcript
stats = get_transcript_stats(segments)
print(f"Duration: {stats['duration_seconds']:.1f} seconds")
print(f"Word count: {stats['word_count']} words")
🏗️ Technical Architecture
Android Client Implementation
The extractor uses Android YouTube client headers to bypass anti-bot measures:
User-Agent: com.google.android.youtube/20.10.38 (Linux; U; Android 14) gzip
X-YouTube-Client-Name: 3
X-YouTube-Client-Version: 20.10.38
Content-Type: application/json
Dual XML Parser System
- Legacy format: Direct XML transcript data
- Current format: API-based JSON responses with embedded XML
Proxy Architecture
- Rotation strategies: random, round_robin, least_used
- Health monitoring: Automatic health checks and failed proxy deactivation
- Recovery: Reactivation after cooldown periods
Error Handling & Recovery
- Exponential backoff: Prevents overwhelming servers during failures
- Retry mechanisms: Configurable retry logic with circuit breaking
- Graceful degradation: Falls back to alternative extraction methods
- Rate limiting: Built-in delays prevent IP-based blocking
🧪 Testing
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=yt_ts_extract --cov-report=term-missing
# Run specific test suites
uv run pytest tests/test_proxy_manager.py -v
uv run pytest tests/test_e2e_proxy.py -v
Test Categories
- Unit tests: Individual component testing
- Integration tests: CLI and API integration testing
- E2E tests: Full workflow testing with real YouTube videos
- Proxy tests: Proxy rotation and health check testing
- Network resilience: Timeout and retry behavior testing
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Run the test suite:
uv run pytest - Submit a pull request
Development Setup
git clone https://github.com/sinjab/yt-ts-extract.git
cd yt-ts-extract
uv sync # Install dependencies
uv run pytest # Run tests
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
- Issues: GitHub Issues
- Documentation: This README and inline code documentation
- Examples: Check the
examples/directory for usage patterns
Made with ❤️ for the developer community. Happy transcript extracting!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yt_ts_extract-1.0.0.tar.gz.
File metadata
- Download URL: yt_ts_extract-1.0.0.tar.gz
- Upload date:
- Size: 45.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6df9f8000f76bc89868694584c5a23c6db54a34a85488d2e3adec22096d83ee5
|
|
| MD5 |
00526744d5770acb92ee27d619d84424
|
|
| BLAKE2b-256 |
5b50aa40da1178ed298d79a2daf940369798aedb5c70123e6aca2500c06f2968
|
File details
Details for the file yt_ts_extract-1.0.0-py3-none-any.whl.
File metadata
- Download URL: yt_ts_extract-1.0.0-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef4747edba10b629a1d1a1471532e7b0de97b331b5b201df25759c4facb027df
|
|
| MD5 |
bc06f83224298d52c9bf0b96b1aefcc4
|
|
| BLAKE2b-256 |
9271786d412d6374b294c85fa77349fa351069035abc4085ccf7d2be81a6737a
|