Skip to main content

YouTube transcript / subtitle fetching toolkit for Python. Provides command line and HTTP interfaces. Supports language selection, SRT or VTT output

Project description

subxx

YouTube transcript / subtitle fetching toolkit for Python - Download, extract, and process subtitles from video URLs with a simple CLI or HTTP API.

Version Python 3.9+ License: CC BY 4.0 Development Status


Features

  • Download YouTube subtitles from videos and channels (powered by yt-dlp)
  • Multiple output formats: SRT, VTT, TXT, Markdown, PDF
  • Text extraction with automatic subtitle cleanup and optional timestamp markers
  • Language selection: Download specific languages or all available subtitles
  • Batch processing: Process multiple URLs from a file
  • Configuration files: Project and global settings via TOML
  • HTTP API: Optional FastAPI server for programmatic access
  • Dry-run mode: Preview operations without downloading
  • Filename sanitization: Safe, nospace, or slugify modes

Table of Contents


Installation

Requirements

  • Python 3.9 or higher
  • uv package manager (recommended)

Install with uv (recommended)

# Clone or download the project
git clone https://gist.github.com/cprima/subxx
cd subxx

# Install core dependencies
uv sync

# Install with optional features
uv sync --extra extract      # Text extraction (txt/md/pdf)
uv sync --extra api          # HTTP API server
uv sync --extra dev          # Development tools (pytest)

# Install all features
uv sync --extra extract --extra api --extra dev

Using Make (Windows)

make install          # Core dependencies
make install-all      # All dependencies (extract + api + dev)

Quick Start

Basic Usage

# List available subtitles
uv run python __main__.py list https://youtu.be/VIDEO_ID

# Download English subtitle (SRT format, default)
uv run python __main__.py subs https://youtu.be/VIDEO_ID

# Extract to plain text
uv run python __main__.py subs https://youtu.be/VIDEO_ID --txt

# Extract to Markdown with 5-minute timestamps
uv run python __main__.py subs https://youtu.be/VIDEO_ID --md -t 300

# Extract to PDF
uv run python __main__.py subs https://youtu.be/VIDEO_ID --pdf

With Makefile

# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ

# With timestamps
make md VIDEO_ID=dQw4w9WgXcQ TIMESTAMPS=300

Usage

List Available Subtitles

Preview available subtitle languages without downloading:

uv run python __main__.py list https://youtu.be/VIDEO_ID

Output:

📹 Video: Example Video Title
🕒 Duration: 12:34

✅ Manual subtitles:
   - en
   - es

🤖 Auto-generated subtitles:
   - en, de, fr, ja, ko, pt, ru, zh-Hans, ...

Options:

  • -v, --verbose - Debug output
  • -q, --quiet - Errors only

Download Subtitles

Format Selection

Download subtitle files in SRT or VTT format:

# Download SRT (default)
uv run python __main__.py subs https://youtu.be/VIDEO_ID

# Download VTT
uv run python __main__.py subs https://youtu.be/VIDEO_ID --vtt

# Using --fmt flag
uv run python __main__.py subs https://youtu.be/VIDEO_ID -f srt

Behavior: Subtitle files (SRT/VTT) are downloaded and kept on disk.

Language Selection

# Download English (default)
uv run python __main__.py subs https://youtu.be/VIDEO_ID

# Download specific language
uv run python __main__.py subs https://youtu.be/VIDEO_ID -l de

# Download multiple languages
uv run python __main__.py subs https://youtu.be/VIDEO_ID -l "en,de,fr"

# Download all available languages
uv run python __main__.py subs https://youtu.be/VIDEO_ID -l all

Output Directory

# Save to specific directory
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o ~/Downloads/subs

# Use current directory (default)
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o .

Filename Sanitization

# Safe mode: Remove unsafe characters, keep spaces (default)
uv run python __main__.py subs URL --sanitize safe

# No spaces: Replace spaces with underscores
uv run python __main__.py subs URL --sanitize nospaces

# Slugify: Lowercase, hyphens, URL-safe
uv run python __main__.py subs URL --sanitize slugify

Examples:

  • safe: "My Video Title.srt""My Video Title.srt"
  • nospaces: "My Video Title.srt""My_Video_Title.srt"
  • slugify: "My Video Title.srt""my-video-title.srt"

Overwrite Handling

# Prompt before overwriting (default)
uv run python __main__.py subs URL

# Force overwrite without prompting
uv run python __main__.py subs URL --force

# Skip existing files
uv run python __main__.py subs URL --skip-existing

Auto-Generated Subtitles

# Include auto-generated subtitles (default)
uv run python __main__.py subs URL --auto

# Only manual subtitles
uv run python __main__.py subs URL --no-auto

Dry Run

Preview what would be downloaded without actually downloading:

uv run python __main__.py subs URL --dry-run

Output:

[DRY RUN] Would download subtitle: en

Text Extraction

Extract clean, readable text from subtitles by automatically removing timestamps and formatting.

Key behavior: When using text formats (txt/md/pdf), subxx:

  1. Downloads the subtitle as SRT
  2. Extracts the text content
  3. Automatically deletes the SRT file

Plain Text

# Extract to plain text
uv run python __main__.py subs URL --txt

Output: Video_Title.VIDEO_ID.en.txt

Example content:

Hello world.
This is a subtitle.
Welcome to the video.

Markdown

# Extract to Markdown
uv run python __main__.py subs URL --md

# Markdown with timestamp markers every 5 minutes
uv run python __main__.py subs URL --md -t 300

# Markdown with timestamp markers every 30 seconds
uv run python __main__.py subs URL --md -t 30

Output: Video_Title.VIDEO_ID.en.md

Example content (with timestamps):

## [0:00]

Hello world.
This is a subtitle.

## [5:00]

Welcome to the next section.
More content here.

## [10:00]

Final section of the video.

PDF

# Extract to PDF
uv run python __main__.py subs URL --pdf

# PDF with timestamp markers
uv run python __main__.py subs URL --pdf -t 300

Output: Video_Title.VIDEO_ID.en.pdf

Requirements: Install extraction dependencies:

uv sync --extra extract

Timestamp Intervals

Add timestamp markers at regular intervals for long-form content:

# Every 5 minutes (300 seconds)
uv run python __main__.py subs URL --md -t 300

# Every 30 seconds
uv run python __main__.py subs URL --txt -t 30

# Every 10 minutes
uv run python __main__.py subs URL --pdf -t 600

Format: Timestamps appear as ## [0:00], ## [5:00], ## [10:00], etc.


Batch Processing

Download subtitles for multiple URLs from a file:

# Create URLs file (one URL per line)
cat > urls.txt << EOF
https://youtu.be/VIDEO_ID_1
https://youtu.be/VIDEO_ID_2
# This is a comment
https://youtu.be/VIDEO_ID_3
EOF

# Process all URLs
uv run python __main__.py batch urls.txt

# With options
uv run python __main__.py batch urls.txt -l "en,de" -f srt -o ~/subs

Options:

  • -l, --langs - Language codes (default: en)
  • -f, --fmt - Output format (default: srt)
  • -o, --output-dir - Output directory (default: .)
  • --sanitize - Filename sanitization mode (default: safe)
  • -v, --verbose - Verbose output
  • -q, --quiet - Quiet mode

URL File Format (yt-dlp standard):

  • One URL per line
  • Lines starting with # are comments
  • Empty lines are ignored

Extract from Files

Extract text from existing subtitle files:

# Extract SRT to plain text
uv run python __main__.py extract video.srt

# Extract to Markdown
uv run python __main__.py extract video.srt -f md

# Extract to PDF
uv run python __main__.py extract video.srt -f pdf

# With timestamp markers every 5 minutes
uv run python __main__.py extract video.srt -f md -t 300

# Specify output file
uv run python __main__.py extract video.srt -o output.txt

# Force overwrite
uv run python __main__.py extract video.srt --force

Supported input formats: SRT, VTT


Configuration

Config File Locations

Configuration files are loaded in priority order:

  1. ./.subxx.toml (project-specific, current directory)
  2. ~/.subxx.toml (user global, home directory)

Priority Chain

Settings are resolved in this order (highest to lowest):

  1. CLI flags (e.g., --langs en, --fmt srt)
  2. Config file (.subxx.toml)
  3. Hardcoded defaults

Example Configuration

Copy .subxx.toml.example to .subxx.toml or ~/.subxx.toml:

cp .subxx.toml.example ~/.subxx.toml

Example config:

[defaults]
# Language codes (comma-separated or "all")
langs = "en"

# Output format: srt, vtt, txt, md, pdf
fmt = "md"

# Include auto-generated subtitles
auto = true

# Output directory (supports ~)
output_dir = "~/Downloads/subtitles"

# Filename sanitization: safe, nospaces, slugify
sanitize = "safe"

# Timestamp interval (seconds) for txt/md/pdf
timestamps = 300  # 5-minute intervals

[logging]
# Log level: DEBUG, INFO, WARNING, ERROR
level = "INFO"

# Log file (optional)
log_file = "~/.subxx/subxx.log"

Use Case Configurations

Configuration 1: Download SRT files to dedicated directory

[defaults]
langs = "en"
fmt = "srt"
output_dir = "~/Downloads/subtitles"

Configuration 2: Auto-extract to Markdown with timestamps

[defaults]
langs = "en"
fmt = "md"
timestamps = 300
output_dir = "~/Documents/transcripts"

Configuration 3: Multiple languages, plain text

[defaults]
langs = "en,de,fr"
fmt = "txt"
sanitize = "slugify"
output_dir = "./subtitles"

Makefile Shortcuts

Available Targets

# Installation
make install          # Core dependencies
make install-all      # All dependencies (extract + api + dev)

# Testing
make test             # Run all tests
make test-unit        # Unit tests only
make test-integration # Integration tests only
make test-coverage    # Tests with coverage report

# Usage
make list VIDEO_URL=https://youtu.be/VIDEO_ID
make subs VIDEO_URL=https://youtu.be/VIDEO_ID
make md VIDEO_ID=VIDEO_ID                       # Quick Markdown extraction
make md VIDEO_ID=VIDEO_ID TIMESTAMPS=300        # With timestamps

# Utilities
make version          # Show version
make clean            # Clean cache files
make clean-all        # Clean everything including .venv

Examples

# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ

# With 5-minute timestamps
make md VIDEO_ID=lHuxDMMkGJ8 TIMESTAMPS=300

# List subtitles
make list VIDEO_URL=https://youtu.be/dQw4w9WgXcQ

# Download with languages
make subs VIDEO_URL=https://youtu.be/dQw4w9WgXcQ LANGS=en,de

HTTP API

Start an HTTP API server for programmatic access (requires API dependencies):

Installation

# Install API dependencies
uv sync --extra api

# Or with Make
make install-api

Start Server

# Start on localhost:8000 (default)
uv run python __main__.py serve

# Custom host/port
uv run python __main__.py serve --host 127.0.0.1 --port 8080

Security Warning: The API has NO authentication and should ONLY run on localhost (127.0.0.1).

API Endpoints

POST /subs

Fetch subtitles and return content directly.

Request:

{
  "url": "https://youtu.be/VIDEO_ID",
  "langs": "en",
  "fmt": "srt",
  "auto": true,
  "sanitize": "safe"
}

Response: Subtitle file content as plain text.

Example:

curl -X POST http://127.0.0.1:8000/subs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://youtu.be/dQw4w9WgXcQ",
    "langs": "en",
    "fmt": "srt"
  }'

GET /health

Health check endpoint.

Response:

{
  "status": "ok",
  "service": "subxx"
}

API Documentation

Interactive API docs available at:

  • Swagger UI: http://127.0.0.1:8000/docs
  • ReDoc: http://127.0.0.1:8000/redoc

Development

Setup Development Environment

# Clone repository
git clone https://gist.github.com/cprima/subxx
cd subxx

# Install all dependencies (core + extract + api + dev)
uv sync --extra extract --extra api --extra dev

# Or with Make
make install-all

Project Structure

subxx/
├── __main__.py              # CLI entry point (Typer commands)
├── subxx.py                 # Core library functions
├── test_subxx.py            # Test suite (pytest)
├── conftest.py              # Pytest configuration
├── pyproject.toml           # Project metadata and dependencies
├── Makefile                 # Build and test automation
├── .subxx.toml.example      # Example configuration file
└── !README.md               # This file

Key Components

  • subxx.py: Core library

    • fetch_subs() - Download subtitles
    • extract_text() - Extract text from subtitles
    • load_config() - Configuration management
    • Helper functions for parsing, sanitization, logging
  • __main__.py: CLI application

    • list - List available subtitles
    • subs - Download subtitles
    • batch - Batch processing
    • extract - Extract from files
    • serve - HTTP API server
    • version - Version information

Testing

Run Tests

# All tests
make test

# Unit tests only (fast, no network)
make test-unit

# Integration tests only
make test-integration

# With coverage report
make test-coverage

# Verbose output
make test-verbose

Test Categories

  • Unit tests (@pytest.mark.unit): No external dependencies, mocked I/O
  • Integration tests (@pytest.mark.integration): May use files/network
  • E2E tests (@pytest.mark.e2e): Real YouTube API, requires internet
  • Slow tests (@pytest.mark.slow): Network I/O, real downloads

Running Specific Test Categories

# Run all tests except e2e (fast, for CI)
pytest -m "not e2e"

# Run only e2e tests (slow, requires internet)
pytest -m e2e

# Run unit tests only
pytest -m unit

Test Coverage

Current coverage: ~50 tests (unit, integration, and e2e)

Key areas tested:

  • Configuration loading and defaults
  • Language parsing
  • Filename sanitization
  • Text extraction (txt/md/pdf)
  • Timestamp markers
  • CLI commands
  • Overwrite protection
  • Real YouTube subtitle download (e2e)

Exit Codes

  • 0 - Success
  • 1 - User cancelled
  • 2 - No subtitles available
  • 3 - Network error
  • 4 - Invalid URL
  • 5 - Configuration error
  • 6 - File error

Troubleshooting

Missing Dependencies for Text Extraction

Error:

❌ Error: Missing dependencies for text extraction

Solution:

uv sync --extra extract

Missing Dependencies for API

Error:

❌ Error: API dependencies not installed

Solution:

uv sync --extra api

Windows Console Encoding Issues

If you see encoding errors on Windows, the tool automatically attempts to reconfigure stdout/stderr to UTF-8. If issues persist, use:

# Set console to UTF-8
chcp 65001

yt-dlp Network Errors

If downloads fail with network errors:

  1. Update yt-dlp:

    uv sync --upgrade
    
  2. Check firewall/proxy settings

  3. Try with --verbose for debug output:

    uv run python __main__.py subs URL --verbose
    

Roadmap

Future enhancements planned:

  • Progress bars for downloads
  • Retry logic for network failures
  • Subtitle merging/combining
  • Translation support
  • Docker container
  • GitHub Actions CI/CD
  • Published package on PyPI
  • SRT/VTT format conversion
  • Subtitle editing/manipulation

Contributing

Contributions welcome! This is an alpha project under active development.

How to Contribute

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass: make test
  6. Submit a pull request

Guidelines

  • Follow existing code style
  • Add docstrings for new functions
  • Update tests for changes
  • Update README for new features
  • Keep commits focused and atomic

License

This project is licensed under CC BY 4.0 (Creative Commons Attribution 4.0 International).

You are free to:

  • Share - Copy and redistribute the material
  • Adapt - Remix, transform, and build upon the material

Under the following terms:

  • Attribution - You must give appropriate credit

See LICENSE for full details.


Credits


Author

Christian Prior-Mamulyan


Support


subxx - Simple, powerful YouTube transcript / subtitle fetching for Python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subxx-0.3.0.tar.gz (176.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

subxx-0.3.0-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file subxx-0.3.0.tar.gz.

File metadata

  • Download URL: subxx-0.3.0.tar.gz
  • Upload date:
  • Size: 176.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for subxx-0.3.0.tar.gz
Algorithm Hash digest
SHA256 977fd4a9d858d0c6d17962d1bc26f9d6880ed19569fd483d2add9a57188541d7
MD5 006b0a005be1781fad665a2566299f20
BLAKE2b-256 8f2383a33269b11d9419a15aa6e56aa1b3ece7a77731116f5821e7314d0e479e

See more details on using hashes here.

File details

Details for the file subxx-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: subxx-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for subxx-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d39828712b384b3a28ee3982d1911da9ff84efa87fb99db9ab963b4cee8263f
MD5 c3eb661d4da9274c3894a444ac3dd728
BLAKE2b-256 a8f9215033f9fb261c5aa1bb484a72c0fc872f597538fd69028eb038965cda65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page