Extract transcripts from YouTube videos with multi-language support and various output formats

These details have not been verified by PyPI

Project links

Project description

yt-ts-extract

A robust Python library and CLI tool for extracting YouTube video transcripts with multi-language support and proxy rotation capabilities.

✨ Key Features

Extract transcripts from YouTube videos via video ID
26+ language support (English, Spanish, French, German, Japanese, Arabic, Chinese, etc.)
Multiple output formats: plain text, SRT subtitles, timestamped segments, JSON
Batch processing for multiple videos
Anti-bot protection: Android client implementation bypasses detection
Proxy rotation: Multiple proxy support with automatic rotation strategies
Both CLI and Python library interfaces

🚀 Installation

# Install from PyPI
pip install yt-ts-extract

# Or install in development mode
git clone https://github.com/sinjab/yt-ts-extract.git
cd yt-ts-extract
pip install -e .

📖 Quick Start

Command Line Interface

# Basic transcript extraction
yt-transcript fR9ClX0egTc

# Export as SRT subtitles
yt-transcript -f srt -o video.srt fR9ClX0egTc

# List available languages
yt-transcript --list-languages fR9ClX0egTc

# Batch process multiple videos
yt-transcript --batch ids.txt --output-dir ./transcripts/

# Get help
yt-transcript --help

Python Library

from yt_ts_extract import (
    get_transcript,
    get_transcript_text,
    get_available_languages,
    YouTubeTranscriptExtractor,
)
from yt_ts_extract.utils import export_to_srt, get_transcript_stats

# Quick transcript extraction
transcript = get_transcript("fR9ClX0egTc")
print(f"Segments: {len(transcript)}")

# Export to SRT
srt_text = export_to_srt(transcript)
with open("video.srt", "w", encoding="utf-8") as f:
    f.write(srt_text)

# Plain text and languages
text = get_transcript_text("fR9ClX0egTc")
langs = get_available_languages("fR9ClX0egTc")
print(f"Languages available: {[l['code'] for l in langs]}")

# Using the class directly
extractor = YouTubeTranscriptExtractor(
    timeout=20,
    max_retries=5,
    backoff_factor=1.0,
    min_delay=1.5
)
segments = extractor.get_transcript("fR9ClX0egTc", language="en")
stats = get_transcript_stats(segments)
print(stats)

🎛️ CLI Options

yt-transcript [OPTIONS] VIDEO_ID

Options:
  -f, --format [text|srt|segments|stats]  Output format (default: text)
  -o, --output PATH                       Save output to file
  -l, --language TEXT                     Language code (e.g., 'en', 'es', 'fr')
  --list-languages                        Show available languages for video
  --batch PATH                            Process video IDs from file (one per line)
  --output-dir PATH                       Directory for batch output files
  --search TEXT                           Search for specific text in transcript
  --examples                              Show usage examples
  --timeout FLOAT                         Per-request timeout in seconds (default: 30)
  --retries INT                           Max HTTP retries on failure (default: 3)
  --backoff FLOAT                         Exponential backoff factor (default: 0.75)
  --min-delay FLOAT                       Minimum delay between requests (default: 2)
  --proxy TEXT                            Proxy URL (e.g., "http://user:pass@host:port")
  --proxy-list PATH                       Proxy list file for rotation
  --rotation-strategy [random|round_robin|least_used]  Proxy rotation strategy (default: random)
  --health-check                          Perform health check on all proxies before starting
  --help                                  Show this message and exit

Network Tuning Examples

# Increase retries and timeout
yt-transcript fR9ClX0egTc --retries 5 --timeout 45

# Reduce delay for faster runs (use responsibly)
yt-transcript fR9ClX0egTc --min-delay 1.0 --backoff 0.5

# Single proxy
yt-transcript fR9ClX0egTc --proxy "http://user:pass@host:port"

# Proxy rotation with health check
yt-transcript fR9ClX0egTc --proxy-list proxies.txt --health-check

# Batch processing with proxy rotation
yt-transcript --batch ids.txt --proxy-list proxies.txt --output-dir transcripts/

🔄 Proxy Support

Single Proxy

# HTTP proxy with authentication
yt-transcript --proxy "http://username:password@proxy-host:8080" fR9ClX0egTc

# HTTPS proxy
yt-transcript --proxy "https://proxy-host:8443" fR9ClX0egTc

# SOCKS5 proxy
yt-transcript --proxy "socks5://user:pass@proxy-host:1080" fR9ClX0egTc

Proxy Rotation

Load multiple proxies from a file and automatically rotate between them:

# Basic proxy rotation
yt-transcript --proxy-list proxies.txt fR9ClX0egTc

# With rotation strategy
yt-transcript --proxy-list proxies.txt --rotation-strategy round_robin fR9ClX0egTc

# With health check
yt-transcript --proxy-list proxies.txt --health-check fR9ClX0egTc

Proxy List File Format (proxies.txt):

Address Port Username Password
23.95.150.145 6114 mhzbhrwb yj2veiaafrbu
198.23.239.134 6540 mhzbhrwb yj2veiaafrbu
45.38.107.97 6014 mhzbhrwb yj2veiaafrbu
64.137.96.74 6641 mhzbhrwb yj2veiaafrbu
216.10.27.159 6837 mhzbhrwb yj2veiaafrbu
136.0.207.84 6661 mhzbhrwb yj2veiaafrbu

Rotation Strategies:

random: Random proxy selection (default)
round_robin: Cycle through proxies in order
least_used: Select least recently used proxy

Python Proxy Usage

from yt_ts_extract import YouTubeTranscriptExtractor, ProxyManager

# Single proxy
extractor = YouTubeTranscriptExtractor(
    proxy="http://user:pass@host:port",
    timeout=30,
    max_retries=3
)

# Proxy rotation
proxy_manager = ProxyManager.from_file("proxies.txt", rotation_strategy="round_robin")
extractor = YouTubeTranscriptExtractor(
    proxy_manager=proxy_manager,
    timeout=30,
    max_retries=3
)

# Convenience functions with proxy rotation
from yt_ts_extract import get_transcript_with_proxy_rotation
transcript = get_transcript_with_proxy_rotation("fR9ClX0egTc", "proxies.txt")

Proxy Best Practices:

Use --health-check to verify proxy connectivity before processing
Failed proxies are automatically deactivated and reactivated after cooldown
Each proxy respects minimum delay between requests
Monitor proxy health with extractor.get_proxy_stats()

📊 Output Formats

1. Plain Text (`text`)

Hello everyone and welcome to this tutorial.
In this video we'll be covering the basics of...

2. SRT Subtitles (`srt`)

1
00:00:00,000 --> 00:00:03,200
Hello everyone and welcome to this tutorial.

2
00:00:03,200 --> 00:00:07,840
In this video we'll be covering the basics of...

3. Timestamped Segments (`segments`)

[
  {
    "start": 0.0,
    "end": 3.2,
    "duration": 3.2,
    "text": "Hello everyone and welcome to this tutorial."
  },
  {
    "start": 3.2,
    "end": 7.84,
    "duration": 4.64,
    "text": "In this video we'll be covering the basics of..."
  }
]

4. Statistics (`stats`)

{
  "total_segments": 245,
  "total_duration": 1823.4,
  "word_count": 2156,
  "average_words_per_segment": 8.8,
  "languages_available": ["en", "es", "fr", "de"]
}

🌍 Language Support

Supports 26+ languages with automatic detection:

Language	Code	Language	Code
English	`en`	Spanish	`es`
French	`fr`	German	`de`
Italian	`it`	Portuguese	`pt`
Russian	`ru`	Japanese	`ja`
Korean	`ko`	Chinese (Simplified)	`zh-Hans`
Chinese (Traditional)	`zh-Hant`	Arabic	`ar`
Hindi	`hi`	Dutch	`nl`
Polish	`pl`	Turkish	`tr`

Use --list-languages to see available languages for any video.

🔧 Advanced Usage

Batch Processing

Create an ids.txt file (one video ID per line):

fR9ClX0egTc
9bZkp7q19f0
wIwCTQZ_xFE

Process all videos:

yt-transcript --batch ids.txt --format srt --output-dir ./subtitles/

Search Within Transcripts

# Find mentions of specific topics
yt-transcript --search "machine learning" VIDEO_ID

Advanced Python Features

from yt_ts_extract import YouTubeTranscriptExtractor
from yt_ts_extract.utils import get_transcript_stats

extractor = YouTubeTranscriptExtractor()

# Get timestamped segments for an ID
segments = extractor.get_transcript("fR9ClX0egTc", language="en")
for seg in segments[:5]:
    print(f"{seg['start']:.1f}s: {seg['text']}")

# Get statistics about the transcript
stats = get_transcript_stats(segments)
print(f"Duration: {stats['duration_seconds']:.1f} seconds")
print(f"Word count: {stats['word_count']} words")

🏗️ Technical Architecture

Android Client Implementation

The extractor uses Android YouTube client headers to bypass anti-bot measures:

User-Agent: com.google.android.youtube/20.10.38 (Linux; U; Android 14) gzip
X-YouTube-Client-Name: 3
X-YouTube-Client-Version: 20.10.38
Content-Type: application/json

Dual XML Parser System

Legacy format: Direct XML transcript data
Current format: API-based JSON responses with embedded XML

Proxy Architecture

Rotation strategies: random, round_robin, least_used
Health monitoring: Automatic health checks and failed proxy deactivation
Recovery: Reactivation after cooldown periods

Error Handling & Recovery

Exponential backoff: Prevents overwhelming servers during failures
Retry mechanisms: Configurable retry logic with circuit breaking
Graceful degradation: Falls back to alternative extraction methods
Rate limiting: Built-in delays prevent IP-based blocking

🧪 Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=yt_ts_extract --cov-report=term-missing

# Run specific test suites
uv run pytest tests/test_proxy_manager.py -v
uv run pytest tests/test_e2e_proxy.py -v

Test Categories

Unit tests: Individual component testing
Integration tests: CLI and API integration testing
E2E tests: Full workflow testing with real YouTube videos
Proxy tests: Proxy rotation and health check testing
Network resilience: Timeout and retry behavior testing

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and add tests
Run the test suite: uv run pytest
Submit a pull request

Development Setup

git clone https://github.com/sinjab/yt-ts-extract.git
cd yt-ts-extract
uv sync  # Install dependencies
uv run pytest  # Run tests

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Issues: GitHub Issues
Documentation: This README and inline code documentation
Examples: Check the examples/ directory for usage patterns

Made with ❤️ for the developer community. Happy transcript extracting!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Aug 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_ts_extract-1.0.0.tar.gz (45.9 kB view details)

Uploaded Aug 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yt_ts_extract-1.0.0-py3-none-any.whl (28.0 kB view details)

Uploaded Aug 24, 2025 Python 3

File details

Details for the file yt_ts_extract-1.0.0.tar.gz.

File metadata

Download URL: yt_ts_extract-1.0.0.tar.gz
Upload date: Aug 24, 2025
Size: 45.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for yt_ts_extract-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`6df9f8000f76bc89868694584c5a23c6db54a34a85488d2e3adec22096d83ee5`
MD5	`00526744d5770acb92ee27d619d84424`
BLAKE2b-256	`5b50aa40da1178ed298d79a2daf940369798aedb5c70123e6aca2500c06f2968`

See more details on using hashes here.

File details

Details for the file yt_ts_extract-1.0.0-py3-none-any.whl.

File metadata

Download URL: yt_ts_extract-1.0.0-py3-none-any.whl
Upload date: Aug 24, 2025
Size: 28.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for yt_ts_extract-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef4747edba10b629a1d1a1471532e7b0de97b331b5b201df25759c4facb027df`
MD5	`bc06f83224298d52c9bf0b96b1aefcc4`
BLAKE2b-256	`9271786d412d6374b294c85fa77349fa351069035abc4085ccf7d2be81a6737a`

See more details on using hashes here.

yt-ts-extract 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

yt-ts-extract

✨ Key Features

🚀 Installation

📖 Quick Start

Command Line Interface

Python Library

🎛️ CLI Options

Network Tuning Examples

🔄 Proxy Support

Single Proxy

Proxy Rotation

Python Proxy Usage

📊 Output Formats

1. Plain Text (text)

2. SRT Subtitles (srt)

3. Timestamped Segments (segments)

4. Statistics (stats)

🌍 Language Support

🔧 Advanced Usage

Batch Processing

Search Within Transcripts

Advanced Python Features

🏗️ Technical Architecture

Android Client Implementation

Dual XML Parser System

Proxy Architecture

Error Handling & Recovery

🧪 Testing

Test Categories

🤝 Contributing

Development Setup

📄 License

🆘 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Plain Text (`text`)

2. SRT Subtitles (`srt`)

3. Timestamped Segments (`segments`)

4. Statistics (`stats`)