YouTube transcript / subtitle fetching toolkit for Python. Provides command line and HTTP interfaces. Supports language selection, SRT or VTT output
Project description
subxx
YouTube transcript / subtitle fetching toolkit for Python - Download, extract, and process subtitles from video URLs with a simple CLI or HTTP API.
Features
- Download YouTube subtitles from videos and channels (powered by yt-dlp)
- Multiple output formats: SRT, VTT, TXT, Markdown, PDF
- JSON output: Machine-readable output with
--jsonand--json-fileflags - Importable module: Use as a Python library with dict-based return values
- Text extraction with automatic subtitle cleanup and optional timestamp markers
- Language selection: Download specific languages or all available subtitles
- Batch processing: Process multiple URLs from a file
- Configuration files: Project and global settings via TOML
- HTTP API: Optional FastAPI server for programmatic access
- Dry-run mode: Preview operations without downloading
- Filename sanitization: Safe, nospace, or slugify modes
Table of Contents
- Installation
- Quick Start
- Module Usage (Python Library)
- Usage
- Configuration
- Makefile Shortcuts
- HTTP API
- Development
- Testing
- License
Installation
Requirements
- Python 3.9 or higher
- uv package manager (recommended)
Install with uv (recommended)
# Clone or download the project
git clone https://gist.github.com/cprima/subxx
cd subxx
# Install core dependencies
uv sync
# Install with optional features
uv sync --extra extract # Text extraction (txt/md/pdf)
uv sync --extra api # HTTP API server
uv sync --extra dev # Development tools (pytest)
# Install all features
uv sync --extra extract --extra api --extra dev
Using Make (Windows)
make install # Core dependencies
make install-all # All dependencies (extract + api + dev)
Quick Start
Basic Usage
# List available subtitles
uv run subxx list https://youtu.be/VIDEO_ID
# Download English subtitle (SRT format, default)
uv run subxx subs https://youtu.be/VIDEO_ID
# Extract to plain text
uv run subxx subs https://youtu.be/VIDEO_ID --txt
# Extract to Markdown with 5-minute timestamps
uv run subxx subs https://youtu.be/VIDEO_ID --md -t 300
# Extract to PDF
uv run subxx subs https://youtu.be/VIDEO_ID --pdf
# Get JSON output for automation
uv run subxx list https://youtu.be/VIDEO_ID --json
uv run subxx subs https://youtu.be/VIDEO_ID --json-file output.json
With Makefile
# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ
# With timestamps
make md VIDEO_ID=dQw4w9WgXcQ TIMESTAMPS=300
Module Usage (Python Library)
New in v0.4.0+: subxx can be imported and used as a Python library. Core functions now return structured data (dicts) instead of exit codes.
Installation
# From test.pypi
pip install -i https://test.pypi.org/simple/ subxx==0.4.1
# Or with uv
uv add subxx==0.4.1 --index https://test.pypi.org/simple/
Basic Example
from subxx import fetch_subs, extract_text
# Download subtitles
result = fetch_subs(
url="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
langs="en",
fmt="srt",
output_dir="./subs",
logger=None # Silent mode
)
if result["status"] == "success":
print(f"Downloaded: {result['video_title']}")
for file_info in result["files"]:
print(f" {file_info['language']}: {file_info['path']}")
else:
print(f"Error: {result['error']}")
Return Structure
Functions return comprehensive dicts with all data:
{
"status": "success" | "error" | "skipped",
"video_id": "dQw4w9WgXcQ",
"video_title": "Rick Astley - Never Gonna Give You Up...",
"files": [
{
"path": "/path/to/video.en.srt",
"language": "en",
"format": "srt",
"auto_generated": false
}
],
"metadata": {...},
"available_languages": [...],
"download_info": {...},
"error": null
}
Complete Example
from subxx import fetch_subs, extract_text
# 1. Download subtitle
result = fetch_subs(
url="https://youtube.com/watch?v=...",
langs="en",
fmt="srt",
auto=True,
output_dir="./transcripts",
logger=None
)
if result["status"] != "success":
print(f"Error: {result['error']}")
exit(1)
# 2. Extract to markdown
subtitle_file = result["files"][0]["path"]
extract_result = extract_text(
subtitle_file=subtitle_file,
output_format="md",
use_chapters=True,
logger=None
)
if extract_result["status"] == "success":
print(f"Extracted to: {extract_result['output_files'][0]['path']}")
print(f"Paragraphs: {len(extract_result['extracted_data']['paragraphs'])}")
Available Functions
from subxx import (
fetch_subs, # Download subtitles → dict
extract_text, # Extract text from srt/vtt → dict
load_config, # Load .subxx.toml config → dict
get_default, # Get config default value
setup_logging, # Configure logging
)
Migration from CLI to Module
v0.3.x (not supported as module):
- Functions returned exit codes (int)
- CLI-focused design
v0.4.x (library-first):
- Functions return dicts with comprehensive data
- Optional
loggerparameter (None = silent) - Clean separation: core functions vs CLI wrapper
Usage
List Available Subtitles
Preview available subtitle languages without downloading:
# Traditional output
uv run subxx list https://youtu.be/VIDEO_ID
# JSON output
uv run subxx list https://youtu.be/VIDEO_ID --json
# Save to file
uv run subxx list https://youtu.be/VIDEO_ID --json-file metadata.json
Output:
📹 Video: Example Video Title
🕒 Duration: 12:34
✅ Manual subtitles:
- en
- es
🤖 Auto-generated subtitles:
- en, de, fr, ja, ko, pt, ru, zh-Hans, ...
Options:
-v, --verbose- Debug output-q, --quiet- Errors only
Download Subtitles
Format Selection
Download subtitle files in SRT or VTT format:
# Download SRT (default)
uv run subxx subs https://youtu.be/VIDEO_ID
# Download VTT
uv run subxx subs https://youtu.be/VIDEO_ID --vtt
# Using --fmt flag
uv run subxx subs https://youtu.be/VIDEO_ID -f srt
Behavior: Subtitle files (SRT/VTT) are downloaded and kept on disk.
Language Selection
# Download English (default)
uv run subxx subs https://youtu.be/VIDEO_ID
# Download specific language
uv run subxx subs https://youtu.be/VIDEO_ID -l de
# Download multiple languages
uv run subxx subs https://youtu.be/VIDEO_ID -l "en,de,fr"
# Download all available languages
uv run subxx subs https://youtu.be/VIDEO_ID -l all
Output Directory
# Save to specific directory
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o ~/Downloads/subs
# Use current directory (default)
uv run python __main__.py subs https://youtu.be/VIDEO_ID -o .
Filename Sanitization
# Safe mode: Remove unsafe characters, keep spaces (default)
uv run python __main__.py subs URL --sanitize safe
# No spaces: Replace spaces with underscores
uv run python __main__.py subs URL --sanitize nospaces
# Slugify: Lowercase, hyphens, URL-safe
uv run python __main__.py subs URL --sanitize slugify
Examples:
safe:"My Video Title.srt"→"My Video Title.srt"nospaces:"My Video Title.srt"→"My_Video_Title.srt"slugify:"My Video Title.srt"→"my-video-title.srt"
Overwrite Handling
# Prompt before overwriting (default)
uv run python __main__.py subs URL
# Force overwrite without prompting
uv run python __main__.py subs URL --force
# Skip existing files
uv run python __main__.py subs URL --skip-existing
Auto-Generated Subtitles
# Include auto-generated subtitles (default)
uv run python __main__.py subs URL --auto
# Only manual subtitles
uv run python __main__.py subs URL --no-auto
Dry Run
Preview what would be downloaded without actually downloading:
uv run python __main__.py subs URL --dry-run
Output:
[DRY RUN] Would download subtitle: en
JSON Output
New in v0.4.0: Get machine-readable JSON output for automation and scripting.
Available Commands with JSON Support
list- List available languagessubs- Download subtitles
Output to stdout
# List command with JSON
uv run subxx list "https://youtu.be/dQw4w9WgXcQ" --json
# Subs command with JSON
uv run subxx subs "https://youtu.be/dQw4w9WgXcQ" --json
Example JSON output:
{
"status": "success",
"video_id": "dQw4w9WgXcQ",
"video_title": "Rick Astley - Never Gonna Give You Up...",
"files": [
{
"path": "Rick Astley - Never Gonna Give You Up.dQw4w9WgXcQ.NA.en.srt",
"language": "en",
"format": "srt",
"auto_generated": false
}
],
"available_languages": [
{"code": "en", "name": "en", "auto_generated": false}
],
"metadata": {...}
}
Save to file
# Save JSON to file
uv run subxx list URL --json-file metadata.json
uv run subxx subs URL --json-file result.json
# Both stdout and file
uv run subxx subs URL --json --json-file result.json
Use in Scripts
#!/bin/bash
# Get video metadata
metadata=$(uv run subxx list "$VIDEO_URL" --json)
video_title=$(echo "$metadata" | jq -r '.video_title')
echo "Downloading: $video_title"
# Download with JSON output
uv run subxx subs "$VIDEO_URL" --json-file download.json
# Check if successful
if [ "$(jq -r '.status' download.json)" == "success" ]; then
echo "Success! Downloaded $(jq -r '.files | length' download.json) files"
fi
Text Extraction
Extract clean, readable text from subtitles by automatically removing timestamps and formatting.
Key behavior: When using text formats (txt/md/pdf), subxx:
- Downloads the subtitle as SRT
- Extracts the text content
- Automatically deletes the SRT file
Plain Text
# Extract to plain text
uv run python __main__.py subs URL --txt
Output: Video_Title.VIDEO_ID.en.txt
Example content:
Hello world.
This is a subtitle.
Welcome to the video.
Markdown
# Extract to Markdown
uv run python __main__.py subs URL --md
# Markdown with timestamp markers every 5 minutes
uv run python __main__.py subs URL --md -t 300
# Markdown with timestamp markers every 30 seconds
uv run python __main__.py subs URL --md -t 30
Output: Video_Title.VIDEO_ID.en.md
Example content (with timestamps):
## [0:00]
Hello world.
This is a subtitle.
## [5:00]
Welcome to the next section.
More content here.
## [10:00]
Final section of the video.
# Extract to PDF
uv run python __main__.py subs URL --pdf
# PDF with timestamp markers
uv run python __main__.py subs URL --pdf -t 300
Output: Video_Title.VIDEO_ID.en.pdf
Requirements: Install extraction dependencies:
uv sync --extra extract
Timestamp Intervals
Add timestamp markers at regular intervals for long-form content:
# Every 5 minutes (300 seconds)
uv run python __main__.py subs URL --md -t 300
# Every 30 seconds
uv run python __main__.py subs URL --txt -t 30
# Every 10 minutes
uv run python __main__.py subs URL --pdf -t 600
Format: Timestamps appear as ## [0:00], ## [5:00], ## [10:00], etc.
Batch Processing
Download subtitles for multiple URLs from a file:
# Create URLs file (one URL per line)
cat > urls.txt << EOF
https://youtu.be/VIDEO_ID_1
https://youtu.be/VIDEO_ID_2
# This is a comment
https://youtu.be/VIDEO_ID_3
EOF
# Process all URLs
uv run python __main__.py batch urls.txt
# With options
uv run python __main__.py batch urls.txt -l "en,de" -f srt -o ~/subs
Options:
-l, --langs- Language codes (default: en)-f, --fmt- Output format (default: srt)-o, --output-dir- Output directory (default: .)--sanitize- Filename sanitization mode (default: safe)-v, --verbose- Verbose output-q, --quiet- Quiet mode
URL File Format (yt-dlp standard):
- One URL per line
- Lines starting with
#are comments - Empty lines are ignored
Extract from Files
Extract text from existing subtitle files:
# Extract SRT to plain text
uv run python __main__.py extract video.srt
# Extract to Markdown
uv run python __main__.py extract video.srt -f md
# Extract to PDF
uv run python __main__.py extract video.srt -f pdf
# With timestamp markers every 5 minutes
uv run python __main__.py extract video.srt -f md -t 300
# Specify output file
uv run python __main__.py extract video.srt -o output.txt
# Force overwrite
uv run python __main__.py extract video.srt --force
Supported input formats: SRT, VTT
Configuration
Config File Locations
Configuration files are loaded in priority order:
./.subxx.toml(project-specific, current directory)~/.subxx.toml(user global, home directory)
Priority Chain
Settings are resolved in this order (highest to lowest):
- CLI flags (e.g.,
--langs en,--fmt srt) - Config file (
.subxx.toml) - Hardcoded defaults
Example Configuration
Copy .subxx.toml.example to .subxx.toml or ~/.subxx.toml:
cp .subxx.toml.example ~/.subxx.toml
Example config:
[defaults]
# Language codes (comma-separated or "all")
langs = "en"
# Output format: srt, vtt, txt, md, pdf
fmt = "md"
# Include auto-generated subtitles
auto = true
# Output directory (supports ~)
output_dir = "~/Downloads/subtitles"
# Filename sanitization: safe, nospaces, slugify
sanitize = "safe"
# Timestamp interval (seconds) for txt/md/pdf
timestamps = 300 # 5-minute intervals
[logging]
# Log level: DEBUG, INFO, WARNING, ERROR
level = "INFO"
# Log file (optional)
log_file = "~/.subxx/subxx.log"
Use Case Configurations
Configuration 1: Download SRT files to dedicated directory
[defaults]
langs = "en"
fmt = "srt"
output_dir = "~/Downloads/subtitles"
Configuration 2: Auto-extract to Markdown with timestamps
[defaults]
langs = "en"
fmt = "md"
timestamps = 300
output_dir = "~/Documents/transcripts"
Configuration 3: Multiple languages, plain text
[defaults]
langs = "en,de,fr"
fmt = "txt"
sanitize = "slugify"
output_dir = "./subtitles"
Makefile Shortcuts
Available Targets
# Installation
make install # Core dependencies
make install-all # All dependencies (extract + api + dev)
# Testing
make test # Run all tests
make test-unit # Unit tests only
make test-integration # Integration tests only
make test-coverage # Tests with coverage report
# Usage
make list VIDEO_URL=https://youtu.be/VIDEO_ID
make subs VIDEO_URL=https://youtu.be/VIDEO_ID
make md VIDEO_ID=VIDEO_ID # Quick Markdown extraction
make md VIDEO_ID=VIDEO_ID TIMESTAMPS=300 # With timestamps
# Utilities
make version # Show version
make clean # Clean cache files
make clean-all # Clean everything including .venv
Examples
# Quick Markdown extraction (just paste video ID)
make md VIDEO_ID=dQw4w9WgXcQ
# With 5-minute timestamps
make md VIDEO_ID=lHuxDMMkGJ8 TIMESTAMPS=300
# List subtitles
make list VIDEO_URL=https://youtu.be/dQw4w9WgXcQ
# Download with languages
make subs VIDEO_URL=https://youtu.be/dQw4w9WgXcQ LANGS=en,de
HTTP API
Start an HTTP API server for programmatic access (requires API dependencies):
Installation
# Install API dependencies
uv sync --extra api
# Or with Make
make install-api
Start Server
# Start on localhost:8000 (default)
uv run python __main__.py serve
# Custom host/port
uv run python __main__.py serve --host 127.0.0.1 --port 8080
Security Warning: The API has NO authentication and should ONLY run on localhost (127.0.0.1).
API Endpoints
POST /subs
Fetch subtitles and return content directly.
Request:
{
"url": "https://youtu.be/VIDEO_ID",
"langs": "en",
"fmt": "srt",
"auto": true,
"sanitize": "safe"
}
Response: Subtitle file content as plain text.
Example:
curl -X POST http://127.0.0.1:8000/subs \
-H "Content-Type: application/json" \
-d '{
"url": "https://youtu.be/dQw4w9WgXcQ",
"langs": "en",
"fmt": "srt"
}'
GET /health
Health check endpoint.
Response:
{
"status": "ok",
"service": "subxx"
}
API Documentation
Interactive API docs available at:
- Swagger UI:
http://127.0.0.1:8000/docs - ReDoc:
http://127.0.0.1:8000/redoc
Development
Setup Development Environment
# Clone repository
git clone https://gist.github.com/cprima/subxx
cd subxx
# Install all dependencies (core + extract + api + dev)
uv sync --extra extract --extra api --extra dev
# Or with Make
make install-all
Project Structure
Updated in v0.4.1 - Restructured for Python best practices:
subxx/
├── subxx.py # Core library functions (returns dicts)
├── cli.py # CLI + API implementation (Typer/FastAPI)
├── __main__.py # Minimal entry point (3 lines)
├── test_subxx.py # Test suite (pytest)
├── conftest.py # Pytest configuration
├── pyproject.toml # Project metadata and dependencies
├── Makefile # Build and test automation
├── .subxx.toml.example # Example configuration file
└── !README.md # This file
Key Components
-
subxx.py: Core library (library-first design)fetch_subs()→ dict - Download subtitles, return structured dataextract_text()→ dict - Extract text from subtitles, return structured dataload_config()→ dict - Configuration management- Helper functions for parsing, sanitization, logging
- Importable as Python module
-
cli.py: CLI + API implementation- Typer commands:
list,subs,batch,extract,serve,version - FastAPI HTTP server
- JSON output handling (
--json,--json-file) - Traditional console output with emojis
- Typer commands:
-
__main__.py: Minimal entry point (Python best practice)- 3 lines: import and run CLI
- Enables
python -m subxxusage
Testing
Run Tests
# All tests
make test
# Unit tests only (fast, no network)
make test-unit
# Integration tests only
make test-integration
# With coverage report
make test-coverage
# Verbose output
make test-verbose
Test Categories
- Unit tests (
@pytest.mark.unit): No external dependencies, mocked I/O - Integration tests (
@pytest.mark.integration): May use files/network - E2E tests (
@pytest.mark.e2e): Real YouTube API, requires internet - Slow tests (
@pytest.mark.slow): Network I/O, real downloads
Running Specific Test Categories
# Run all tests except e2e (fast, for CI)
pytest -m "not e2e"
# Run only e2e tests (slow, requires internet)
pytest -m e2e
# Run unit tests only
pytest -m unit
Test Coverage
Current coverage: ~50 tests (unit, integration, and e2e)
Key areas tested:
- Configuration loading and defaults
- Language parsing
- Filename sanitization
- Text extraction (txt/md/pdf)
- Timestamp markers
- CLI commands
- Overwrite protection
- Real YouTube subtitle download (e2e)
Exit Codes
0- Success1- User cancelled2- No subtitles available3- Network error4- Invalid URL5- Configuration error6- File error
Troubleshooting
Missing Dependencies for Text Extraction
Error:
❌ Error: Missing dependencies for text extraction
Solution:
uv sync --extra extract
Missing Dependencies for API
Error:
❌ Error: API dependencies not installed
Solution:
uv sync --extra api
Windows Console Encoding Issues
If you see encoding errors on Windows, the tool automatically attempts to reconfigure stdout/stderr to UTF-8. If issues persist, use:
# Set console to UTF-8
chcp 65001
yt-dlp Network Errors
If downloads fail with network errors:
-
Update yt-dlp:
uv sync --upgrade
-
Check firewall/proxy settings
-
Try with
--verbosefor debug output:uv run python __main__.py subs URL --verbose
Roadmap
Completed (v0.4.x)
- JSON output support (
--json,--json-file) - Importable Python module (library-first architecture)
- Published package on test.pypi.org
- Pythonic project structure (cli.py, minimal main.py)
Future Enhancements
- Publish to PyPI (production)
- Progress bars for downloads
- Retry logic for network failures
- Subtitle merging/combining
- Translation support
- Docker container
- GitHub Actions CI/CD
- SRT/VTT format conversion
- Subtitle editing/manipulation
- Batch command JSON support
- Extract command JSON support
Contributing
Contributions welcome! This is an alpha project under active development.
How to Contribute
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass:
make test - Submit a pull request
Guidelines
- Follow existing code style
- Add docstrings for new functions
- Update tests for changes
- Update README for new features
- Keep commits focused and atomic
License
This project is licensed under CC BY 4.0 (Creative Commons Attribution 4.0 International).
You are free to:
- Share - Copy and redistribute the material
- Adapt - Remix, transform, and build upon the material
Under the following terms:
- Attribution - You must give appropriate credit
See LICENSE for full details.
Credits
- Built with yt-dlp for video subtitle extraction
- CLI powered by Typer
- API built with FastAPI
- Text extraction using srt and fpdf2
Author
Christian Prior-Mamulyan
- Email: cprior@gmail.com
- GitHub: @cprima
Support
- Report issues: GitHub Issues
- Documentation: GitHub Gist
subxx - Simple, powerful YouTube transcript / subtitle fetching for Python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file subxx-0.4.2.tar.gz.
File metadata
- Download URL: subxx-0.4.2.tar.gz
- Upload date:
- Size: 210.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
999e9a5b8c3512808a00f66d5cfd4ba232a56c05cd814013b514dd71e1ae21b6
|
|
| MD5 |
79647ca6f666e91022385083d00c28a2
|
|
| BLAKE2b-256 |
c3d05c91d16632cf051e680e3742e732b1468949ea34fda284fecf765fbff011
|
File details
Details for the file subxx-0.4.2-py3-none-any.whl.
File metadata
- Download URL: subxx-0.4.2-py3-none-any.whl
- Upload date:
- Size: 37.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2860ca5f37b1040a372d3f4f3e682e238475f699aa55f1bd7fd9be200b3f0fc3
|
|
| MD5 |
5ad6a96391955f48cd126470253feafb
|
|
| BLAKE2b-256 |
1cee696931dfa8ec1ac894e6c482c83349790ea93f08034eff1efba9b713cf94
|