YouTube transcript / subtitle fetching toolkit for Python. Provides command line and HTTP interfaces. Supports language selection, SRT or VTT output

These details have not been verified by PyPI

Project links

Project description

subxx

YouTube transcript / subtitle fetching toolkit for Python - Download, extract, and process subtitles from video URLs with a simple CLI or HTTP API.

Features

Download YouTube subtitles from videos and channels (powered by yt-dlp)
Multiple output formats: SRT, VTT, TXT, Markdown, PDF
JSON output: Machine-readable output with --json and --json-file flags
Importable module: Use as a Python library with dict-based return values
Text extraction with automatic subtitle cleanup and optional timestamp markers
Language selection: Download specific languages or all available subtitles
Batch processing: Process multiple URLs from a file
Configuration files: Project and global settings via TOML
HTTP API: Optional FastAPI server for programmatic access
Dry-run mode: Preview operations without downloading
Filename sanitization: Safe, nospace, or slugify modes

Installation
Quick Start
Module Usage (Python Library)
Usage
Configuration
Makefile Shortcuts
HTTP API
Development
Testing
License

Installation

Requirements

Python 3.9 or higher
uv package manager (recommended)

Install with uv (recommended)

# Clone or download the project
git clone https://gist.github.com/cprima/subxx
cd subxx

# Install core dependencies
uv sync

# Install with optional features
uv sync --extra extract      # Text extraction (txt/md/pdf)
uv sync --extra api          # HTTP API server
uv sync --extra dev          # Development tools (pytest)

# Install all features
uv sync --extra extract --extra api --extra dev

Using Make (Windows)

make install          # Core dependencies
make install-all      # All dependencies (extract + api + dev)

Quick Start

Basic Usage

# List available subtitles
uv run subxx list https://youtu.be/VIDEO_ID

# Download English subtitle (SRT format, default)
uv run subxx subs https://youtu.be/VIDEO_ID

# Extract to plain text
uv run subxx subs https://youtu.be/VIDEO_ID --txt

# Extract to Markdown with 5-minute timestamps
uv run subxx subs https://youtu.be/VIDEO_ID --md -t 300

# Extract to PDF
uv run subxx subs https://youtu.be/VIDEO_ID --pdf

# Get JSON output for automation
uv run subxx list https://youtu.be/VIDEO_ID --json
uv run subxx subs https://youtu.be/VIDEO_ID --json-file output.json

With Makefile

# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ

# With timestamps
make md VIDEO_ID=dQw4w9WgXcQ TIMESTAMPS=300

Module Usage (Python Library)

New in v0.4.0+: subxx can be imported and used as a Python library. Core functions now return structured data (dicts) instead of exit codes.

Installation

# From test.pypi
pip install -i https://test.pypi.org/simple/ subxx==0.4.1

# Or with uv
uv add subxx==0.4.1 --index https://test.pypi.org/simple/

Basic Example

from subxx import fetch_subs, extract_text

# Download subtitles
result = fetch_subs(
    url="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    langs="en",
    fmt="srt",
    output_dir="./subs",
    logger=None  # Silent mode
)

if result["status"] == "success":
    print(f"Downloaded: {result['video_title']}")
    for file_info in result["files"]:
        print(f"  {file_info['language']}: {file_info['path']}")
else:
    print(f"Error: {result['error']}")

Return Structure

Functions return comprehensive dicts with all data:

{
    "status": "success" | "error" | "skipped",
    "video_id": "dQw4w9WgXcQ",
    "video_title": "Rick Astley - Never Gonna Give You Up...",
    "files": [
        {
            "path": "/path/to/video.en.srt",
            "language": "en",
            "format": "srt",
            "auto_generated": false
        }
    ],
    "metadata": {...},
    "available_languages": [...],
    "download_info": {...},
    "error": null
}

Complete Example

from subxx import fetch_subs, extract_text

# 1. Download subtitle
result = fetch_subs(
    url="https://youtube.com/watch?v=...",
    langs="en",
    fmt="srt",
    auto=True,
    output_dir="./transcripts",
    logger=None
)

if result["status"] != "success":
    print(f"Error: {result['error']}")
    exit(1)

# 2. Extract to markdown
subtitle_file = result["files"][0]["path"]
extract_result = extract_text(
    subtitle_file=subtitle_file,
    output_format="md",
    use_chapters=True,
    logger=None
)

if extract_result["status"] == "success":
    print(f"Extracted to: {extract_result['output_files'][0]['path']}")
    print(f"Paragraphs: {len(extract_result['extracted_data']['paragraphs'])}")

Available Functions

from subxx import (
    fetch_subs,        # Download subtitles → dict
    extract_text,      # Extract text from srt/vtt → dict
    load_config,       # Load .subxx.toml config → dict
    get_default,       # Get config default value
    setup_logging,     # Configure logging
)

Migration from CLI to Module

v0.3.x (not supported as module):

Functions returned exit codes (int)
CLI-focused design

v0.4.x (library-first):

Functions return dicts with comprehensive data
Optional logger parameter (None = silent)
Clean separation: core functions vs CLI wrapper

Usage

List Available Subtitles

Preview available subtitle languages without downloading:

# Traditional output
uv run subxx list https://youtu.be/VIDEO_ID

# JSON output
uv run subxx list https://youtu.be/VIDEO_ID --json

# Save to file
uv run subxx list https://youtu.be/VIDEO_ID --json-file metadata.json

Output:

📹 Video: Example Video Title
🕒 Duration: 12:34

✅ Manual subtitles:
   - en
   - es

🤖 Auto-generated subtitles:
   - en, de, fr, ja, ko, pt, ru, zh-Hans, ...

Options:

-v, --verbose - Debug output
-q, --quiet - Errors only

Download Subtitles

Format Selection

Download subtitle files in SRT or VTT format:

# Download SRT (default)
uv run subxx subs https://youtu.be/VIDEO_ID

# Download VTT
uv run subxx subs https://youtu.be/VIDEO_ID --vtt

# Using --fmt flag
uv run subxx subs https://youtu.be/VIDEO_ID -f srt

Behavior: Subtitle files (SRT/VTT) are downloaded and kept on disk.

Language Selection

# Download English (default)
uv run subxx subs https://youtu.be/VIDEO_ID

# Download specific language
uv run subxx subs https://youtu.be/VIDEO_ID -l de

# Download multiple languages
uv run subxx subs https://youtu.be/VIDEO_ID -l "en,de,fr"

# Download all available languages
uv run subxx subs https://youtu.be/VIDEO_ID -l all

Output Directory

# Save to specific directory
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o ~/Downloads/subs

# Use current directory (default)
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o .

Filename Sanitization

# Safe mode: Remove unsafe characters, keep spaces (default)
uv run python __main__.py subs URL --sanitize safe

# No spaces: Replace spaces with underscores
uv run python __main__.py subs URL --sanitize nospaces

# Slugify: Lowercase, hyphens, URL-safe
uv run python __main__.py subs URL --sanitize slugify

Examples:

safe: "My Video Title.srt" → "My Video Title.srt"
nospaces: "My Video Title.srt" → "My_Video_Title.srt"
slugify: "My Video Title.srt" → "my-video-title.srt"

Overwrite Handling

# Prompt before overwriting (default)
uv run python __main__.py subs URL

# Force overwrite without prompting
uv run python __main__.py subs URL --force

# Skip existing files
uv run python __main__.py subs URL --skip-existing

Auto-Generated Subtitles

# Include auto-generated subtitles (default)
uv run python __main__.py subs URL --auto

# Only manual subtitles
uv run python __main__.py subs URL --no-auto

Dry Run

Preview what would be downloaded without actually downloading:

uv run python __main__.py subs URL --dry-run

Output:

[DRY RUN] Would download subtitle: en

JSON Output

New in v0.4.0: Get machine-readable JSON output for automation and scripting.

Available Commands with JSON Support

list - List available languages
subs - Download subtitles

Output to stdout

# List command with JSON
uv run subxx list "https://youtu.be/dQw4w9WgXcQ" --json

# Subs command with JSON
uv run subxx subs "https://youtu.be/dQw4w9WgXcQ" --json

Example JSON output:

{
  "status": "success",
  "video_id": "dQw4w9WgXcQ",
  "video_title": "Rick Astley - Never Gonna Give You Up...",
  "files": [
    {
      "path": "Rick Astley - Never Gonna Give You Up.dQw4w9WgXcQ.NA.en.srt",
      "language": "en",
      "format": "srt",
      "auto_generated": false
    }
  ],
  "available_languages": [
    {"code": "en", "name": "en", "auto_generated": false}
  ],
  "metadata": {...}
}

Save to file

# Save JSON to file
uv run subxx list URL --json-file metadata.json
uv run subxx subs URL --json-file result.json

# Both stdout and file
uv run subxx subs URL --json --json-file result.json

Use in Scripts

#!/bin/bash

# Get video metadata
metadata=$(uv run subxx list "$VIDEO_URL" --json)
video_title=$(echo "$metadata" | jq -r '.video_title')

echo "Downloading: $video_title"

# Download with JSON output
uv run subxx subs "$VIDEO_URL" --json-file download.json

# Check if successful
if [ "$(jq -r '.status' download.json)" == "success" ]; then
    echo "Success! Downloaded $(jq -r '.files | length' download.json) files"
fi

Text Extraction

Extract clean, readable text from subtitles by automatically removing timestamps and formatting.

Key behavior: When using text formats (txt/md/pdf), subxx:

Downloads the subtitle as SRT
Extracts the text content
Automatically deletes the SRT file

Plain Text

# Extract to plain text
uv run python __main__.py subs URL --txt

Output: Video_Title.VIDEO_ID.en.txt

Example content:

Hello world.
This is a subtitle.
Welcome to the video.

Markdown

# Extract to Markdown
uv run python __main__.py subs URL --md

# Markdown with timestamp markers every 5 minutes
uv run python __main__.py subs URL --md -t 300

# Markdown with timestamp markers every 30 seconds
uv run python __main__.py subs URL --md -t 30

Output: Video_Title.VIDEO_ID.en.md

Example content (with timestamps):

## [0:00]

Hello world.
This is a subtitle.

## [5:00]

Welcome to the next section.
More content here.

## [10:00]

Final section of the video.

PDF

# Extract to PDF
uv run python __main__.py subs URL --pdf

# PDF with timestamp markers
uv run python __main__.py subs URL --pdf -t 300

Output: Video_Title.VIDEO_ID.en.pdf

Requirements: Install extraction dependencies:

uv sync --extra extract

Timestamp Intervals

Add timestamp markers at regular intervals for long-form content:

# Every 5 minutes (300 seconds)
uv run python __main__.py subs URL --md -t 300

# Every 30 seconds
uv run python __main__.py subs URL --txt -t 30

# Every 10 minutes
uv run python __main__.py subs URL --pdf -t 600

Format: Timestamps appear as ## [0:00], ## [5:00], ## [10:00], etc.

Batch Processing

Download subtitles for multiple URLs from a file:

# Create URLs file (one URL per line)
cat > urls.txt << EOF
https://youtu.be/VIDEO_ID_1
https://youtu.be/VIDEO_ID_2
# This is a comment
https://youtu.be/VIDEO_ID_3
EOF

# Process all URLs
uv run python __main__.py batch urls.txt

# With options
uv run python __main__.py batch urls.txt -l "en,de" -f srt -o ~/subs

Options:

-l, --langs - Language codes (default: en)
-f, --fmt - Output format (default: srt)
-o, --output-dir - Output directory (default: .)
--sanitize - Filename sanitization mode (default: safe)
-v, --verbose - Verbose output
-q, --quiet - Quiet mode

URL File Format (yt-dlp standard):

One URL per line
Lines starting with # are comments
Empty lines are ignored

Extract from Files

Extract text from existing subtitle files:

# Extract SRT to plain text
uv run python __main__.py extract video.srt

# Extract to Markdown
uv run python __main__.py extract video.srt -f md

# Extract to PDF
uv run python __main__.py extract video.srt -f pdf

# With timestamp markers every 5 minutes
uv run python __main__.py extract video.srt -f md -t 300

# Specify output file
uv run python __main__.py extract video.srt -o output.txt

# Force overwrite
uv run python __main__.py extract video.srt --force

Supported input formats: SRT, VTT

Configuration

Config File Locations

Configuration files are loaded in priority order:

./.subxx.toml (project-specific, current directory)
~/.subxx.toml (user global, home directory)

Priority Chain

Settings are resolved in this order (highest to lowest):

CLI flags (e.g., --langs en, --fmt srt)
Config file (.subxx.toml)
Hardcoded defaults

Example Configuration

Copy .subxx.toml.example to .subxx.toml or ~/.subxx.toml:

cp .subxx.toml.example ~/.subxx.toml

Example config:

[defaults]
# Language codes (comma-separated or "all")
langs = "en"

# Output format: srt, vtt, txt, md, pdf
fmt = "md"

# Include auto-generated subtitles
auto = true

# Output directory (supports ~)
output_dir = "~/Downloads/subtitles"

# Filename sanitization: safe, nospaces, slugify
sanitize = "safe"

# Timestamp interval (seconds) for txt/md/pdf
timestamps = 300  # 5-minute intervals

[logging]
# Log level: DEBUG, INFO, WARNING, ERROR
level = "INFO"

# Log file (optional)
log_file = "~/.subxx/subxx.log"

Use Case Configurations

Configuration 1: Download SRT files to dedicated directory

[defaults]
langs = "en"
fmt = "srt"
output_dir = "~/Downloads/subtitles"

Configuration 2: Auto-extract to Markdown with timestamps

[defaults]
langs = "en"
fmt = "md"
timestamps = 300
output_dir = "~/Documents/transcripts"

Configuration 3: Multiple languages, plain text

[defaults]
langs = "en,de,fr"
fmt = "txt"
sanitize = "slugify"
output_dir = "./subtitles"

Makefile Shortcuts

Available Targets

# Installation
make install          # Core dependencies
make install-all      # All dependencies (extract + api + dev)

# Testing
make test             # Run all tests
make test-unit        # Unit tests only
make test-integration # Integration tests only
make test-coverage    # Tests with coverage report

# Usage
make list VIDEO_URL=https://youtu.be/VIDEO_ID
make subs VIDEO_URL=https://youtu.be/VIDEO_ID
make md VIDEO_ID=VIDEO_ID                       # Quick Markdown extraction
make md VIDEO_ID=VIDEO_ID TIMESTAMPS=300        # With timestamps

# Utilities
make version          # Show version
make clean            # Clean cache files
make clean-all        # Clean everything including .venv

Examples

# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ

# With 5-minute timestamps
make md VIDEO_ID=lHuxDMMkGJ8 TIMESTAMPS=300

# List subtitles
make list VIDEO_URL=https://youtu.be/dQw4w9WgXcQ

# Download with languages
make subs VIDEO_URL=https://youtu.be/dQw4w9WgXcQ LANGS=en,de

HTTP API

Start an HTTP API server for programmatic access (requires API dependencies):

Installation

# Install API dependencies
uv sync --extra api

# Or with Make
make install-api

Start Server

# Start on localhost:8000 (default)
uv run python __main__.py serve

# Custom host/port
uv run python __main__.py serve --host 127.0.0.1 --port 8080

Security Warning: The API has NO authentication and should ONLY run on localhost (127.0.0.1).

API Endpoints

POST /subs

Fetch subtitles and return content directly.

Request:

{
  "url": "https://youtu.be/VIDEO_ID",
  "langs": "en",
  "fmt": "srt",
  "auto": true,
  "sanitize": "safe"
}

Response: Subtitle file content as plain text.

Example:

curl -X POST http://127.0.0.1:8000/subs \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://youtu.be/dQw4w9WgXcQ",
    "langs": "en",
    "fmt": "srt"
  }'

GET /health

Health check endpoint.

Response:

{
  "status": "ok",
  "service": "subxx"
}

API Documentation

Interactive API docs available at:

Swagger UI: http://127.0.0.1:8000/docs
ReDoc: http://127.0.0.1:8000/redoc

Development

Setup Development Environment

# Clone repository
git clone https://gist.github.com/cprima/subxx
cd subxx

# Install all dependencies (core + extract + api + dev)
uv sync --extra extract --extra api --extra dev

# Or with Make
make install-all

Project Structure

Updated in v0.4.1 - Restructured for Python best practices:

subxx/
├── subxx.py                 # Core library functions (returns dicts)
├── cli.py                   # CLI + API implementation (Typer/FastAPI)
├── __main__.py              # Minimal entry point (3 lines)
├── test_subxx.py            # Test suite (pytest)
├── conftest.py              # Pytest configuration
├── pyproject.toml           # Project metadata and dependencies
├── Makefile                 # Build and test automation
├── .subxx.toml.example      # Example configuration file
└── !README.md               # This file

Key Components

subxx.py: Core library (library-first design)
- fetch_subs() → dict - Download subtitles, return structured data
- extract_text() → dict - Extract text from subtitles, return structured data
- load_config() → dict - Configuration management
- Helper functions for parsing, sanitization, logging
- Importable as Python module
cli.py: CLI + API implementation
- Typer commands: list, subs, batch, extract, serve, version
- FastAPI HTTP server
- JSON output handling (--json, --json-file)
- Traditional console output with emojis
__main__.py: Minimal entry point (Python best practice)
- 3 lines: import and run CLI
- Enables python -m subxx usage

Testing

Run Tests

# All tests
make test

# Unit tests only (fast, no network)
make test-unit

# Integration tests only
make test-integration

# With coverage report
make test-coverage

# Verbose output
make test-verbose

Test Categories

Unit tests (@pytest.mark.unit): No external dependencies, mocked I/O
Integration tests (@pytest.mark.integration): May use files/network
E2E tests (@pytest.mark.e2e): Real YouTube API, requires internet
Slow tests (@pytest.mark.slow): Network I/O, real downloads

Running Specific Test Categories

# Run all tests except e2e (fast, for CI)
pytest -m "not e2e"

# Run only e2e tests (slow, requires internet)
pytest -m e2e

# Run unit tests only
pytest -m unit

Test Coverage

Current coverage: ~50 tests (unit, integration, and e2e)

Key areas tested:

Configuration loading and defaults
Language parsing
Filename sanitization
Text extraction (txt/md/pdf)
Timestamp markers
CLI commands
Overwrite protection
Real YouTube subtitle download (e2e)

Exit Codes

0 - Success
1 - User cancelled
2 - No subtitles available
3 - Network error
4 - Invalid URL
5 - Configuration error
6 - File error

Troubleshooting

Missing Dependencies for Text Extraction

Error:

❌ Error: Missing dependencies for text extraction

Solution:

uv sync --extra extract

Missing Dependencies for API

Error:

❌ Error: API dependencies not installed

Solution:

uv sync --extra api

Windows Console Encoding Issues

If you see encoding errors on Windows, the tool automatically attempts to reconfigure stdout/stderr to UTF-8. If issues persist, use:

# Set console to UTF-8
chcp 65001

yt-dlp Network Errors

If downloads fail with network errors:

Update yt-dlp:
```
uv sync --upgrade
```
Check firewall/proxy settings

Try with --verbose for debug output:

uv run python __main__.py subs URL --verbose

Roadmap

Completed (v0.4.x)

JSON output support (--json, --json-file)
Importable Python module (library-first architecture)
Published package on test.pypi.org
Pythonic project structure (cli.py, minimal main.py)

Future Enhancements

Publish to PyPI (production)
Progress bars for downloads
Retry logic for network failures
Subtitle merging/combining
Translation support
Docker container
GitHub Actions CI/CD
SRT/VTT format conversion
Subtitle editing/manipulation
Batch command JSON support
Extract command JSON support

Contributing

Contributions welcome! This is an alpha project under active development.

How to Contribute

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Ensure all tests pass: make test
Submit a pull request

Guidelines

Follow existing code style
Add docstrings for new functions
Update tests for changes
Update README for new features
Keep commits focused and atomic

License

This project is licensed under CC BY 4.0 (Creative Commons Attribution 4.0 International).

You are free to:

Share - Copy and redistribute the material
Adapt - Remix, transform, and build upon the material

Under the following terms:

Attribution - You must give appropriate credit

See LICENSE for full details.

Credits

Built with yt-dlp for video subtitle extraction
CLI powered by Typer
API built with FastAPI
Text extraction using srt and fpdf2

Author

Christian Prior-Mamulyan

Email: cprior@gmail.com
GitHub: @cprima

Support

Report issues: GitHub Issues
Documentation: GitHub Gist

subxx - Simple, powerful YouTube transcript / subtitle fetching for Python.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.2

Dec 29, 2025

0.3.0

Dec 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subxx-0.4.2.tar.gz (210.5 kB view details)

Uploaded Dec 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

subxx-0.4.2-py3-none-any.whl (37.2 kB view details)

Uploaded Dec 29, 2025 Python 3

File details

Details for the file subxx-0.4.2.tar.gz.

File metadata

Download URL: subxx-0.4.2.tar.gz
Upload date: Dec 29, 2025
Size: 210.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for subxx-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`999e9a5b8c3512808a00f66d5cfd4ba232a56c05cd814013b514dd71e1ae21b6`
MD5	`79647ca6f666e91022385083d00c28a2`
BLAKE2b-256	`c3d05c91d16632cf051e680e3742e732b1468949ea34fda284fecf765fbff011`

See more details on using hashes here.

File details

Details for the file subxx-0.4.2-py3-none-any.whl.

File metadata

Download URL: subxx-0.4.2-py3-none-any.whl
Upload date: Dec 29, 2025
Size: 37.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for subxx-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2860ca5f37b1040a372d3f4f3e682e238475f699aa55f1bd7fd9be200b3f0fc3`
MD5	`5ad6a96391955f48cd126470253feafb`
BLAKE2b-256	`1cee696931dfa8ec1ac894e6c482c83349790ea93f08034eff1efba9b713cf94`

See more details on using hashes here.

subxx 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

subxx

Features

Table of Contents

Installation

Requirements

Install with uv (recommended)

Using Make (Windows)

Quick Start

Basic Usage

With Makefile

Module Usage (Python Library)

Installation

Basic Example

Return Structure

Complete Example

Available Functions

Migration from CLI to Module

Usage

List Available Subtitles

Download Subtitles

Format Selection

Language Selection

Output Directory

Filename Sanitization

Overwrite Handling

Auto-Generated Subtitles

Dry Run

JSON Output

Available Commands with JSON Support

Output to stdout

Save to file

Use in Scripts

Text Extraction

Plain Text

Markdown

PDF

Timestamp Intervals

Batch Processing

Extract from Files

Configuration

Config File Locations

Priority Chain

Example Configuration

Use Case Configurations

Makefile Shortcuts

Available Targets

Examples

HTTP API

Installation

Start Server

API Endpoints

POST /subs

GET /health

API Documentation

Development

Setup Development Environment

Project Structure

Key Components

Testing

Run Tests

Test Categories

Running Specific Test Categories

Test Coverage

Exit Codes

Troubleshooting

Missing Dependencies for Text Extraction

Missing Dependencies for API

Windows Console Encoding Issues

yt-dlp Network Errors

Roadmap

Completed (v0.4.x)