Skip to main content

A modular podcast episode downloader with RSS feed parsing and progress tracking

Project description

Podcast Tracker

A modular Python package for downloading podcast episodes from RSS feeds and transcribing them using AI. Features progress tracking, metadata management, duplicate detection, and WhisperX-powered transcription with speaker diarization.

Python Version Requirements

This package requires Python 3.10, 3.11, or 3.12. Python 3.13+ is not supported due to dependency limitations with the WhisperX library.

Features

  • RSS Feed Parsing: Download and parse podcast RSS feeds
  • Episode Management: Track downloaded episodes with JSONL metadata
  • Progress Tracking: Visual progress bars for downloads
  • AI Transcription: WhisperX-powered transcription with speaker diarization
  • Duplicate Detection: Automatically skip already downloaded episodes
  • Type Safety: Comprehensive type hints throughout

Installation

Standard Installation

git clone https://github.com/falahat/podcast.git
cd podcast
pip install -e .

Installation with Transcription Support

git clone https://github.com/falahat/podcast.git
cd podcast
pip install -e .[transcribe]

Development Installation

git clone https://github.com/falahat/podcast.git
cd podcast
pip install -e .[dev,notebook,transcribe]

Quick Start

Command Line Interface

# Download episodes from an RSS feed
podcast_downloader "https://example.com/podcast/rss.xml"

# Specify custom data directory  
podcast_downloader "https://example.com/podcast/rss.xml" --data-dir ./my_podcasts

# List episodes without downloading
podcast_downloader "https://example.com/podcast/rss.xml" --list-only

# Disable progress bars
podcast_downloader "https://example.com/podcast/rss.xml" --no-progress

Python API

from easy_podcast.manager import PodcastManager

# Create manager from RSS URL (downloads and parses automatically)
manager = PodcastManager.from_rss_url("https://example.com/podcast/rss.xml")

if manager:
    podcast = manager.get_podcast()
    print(f"Podcast: {podcast.title}")
    
    # Get new episodes to download
    new_episodes = manager.get_new_episodes()
    print(f"Found {len(new_episodes)} new episodes")
    
    # Download episodes with progress tracking
    successful, skipped, failed = manager.download_episodes(new_episodes)
    print(f"Downloaded: {successful}, Skipped: {skipped}, Failed: {failed}")

Working with Existing Podcast Data

# Load manager from existing podcast folder
manager = PodcastManager.from_podcast_folder("data/My Podcast/")

if manager:
    # Continue downloading new episodes
    new_episodes = manager.get_new_episodes()
    manager.download_episodes(new_episodes)

Audio Transcription

The package includes AI-powered audio transcription using WhisperX with GPU acceleration and speaker diarization. Transcription functionality is available as an optional dependency.

Installation: To use transcription features, install with the [transcribe] option:

pip install -e .[transcribe]

Prerequisites for Transcription

  1. NVIDIA GPU with CUDA support
  2. Hugging Face Token (for speaker diarization models)
  3. PyTorch with GPU support (automatically installed with easy-whisperx)

Note: PyTorch is automatically installed as part of the easy-whisperx dependency. No manual installation required.

Setting up Transcription Environment

  1. Get a Hugging Face Token:

    • Go to Hugging Face Settings
    • Create a token with "read" permissions
    • Accept user agreements for segmentation and diarization models
  2. Set Environment Variable:

    # Windows PowerShell
    $env:HF_TOKEN="your_token_here"
    
    # Linux/macOS
    export HF_TOKEN="your_token_here"
    

Using Transcription in Python

from easy_whisperx.transcriber import Transcriber

# Initialize transcriber
transcriber = Transcriber(
    model_size="base",
    device="cuda",  # or "cpu"
    compute_type="float16",
    batch_size=16
)

# Transcribe audio file
with transcriber:
    result = transcriber("path/to/audio.mp3")
    print(result["text"])

Data Storage Structure

Podcast data is organized in a clear directory structure:

data/
└── [Sanitized Podcast Name]/
    ├── episodes.jsonl      # Episode metadata (one JSON object per line)
    ├── rss.xml            # Cached RSS feed
    └── downloads/         # Downloaded audio files
        ├── episode1.mp3
        ├── episode2.mp3
        └── ...

Important: Episode objects store filenames only (e.g., "727175.mp3"), not full paths. Use manager.get_episode_audio_path(episode) to get complete file paths.

Development

Setting up Development Environment

git clone https://github.com/falahat/podcast.git
cd podcast

# Create virtual environment (note the .venv name)
python -m venv .venv

# Activate virtual environment
# Windows PowerShell:
.\.venv\Scripts\Activate.ps1
# Linux/macOS:
source .venv/bin/activate

# Install in development mode
pip install -e .[dev,notebook]

Running Tests

# Run all tests
pytest

# Run with coverage report
pytest --cov=easy_podcast --cov-report=html

# Run specific test file
pytest tests/test_manager.py -v

Code Quality Tools

The project uses:

  • Black for code formatting
  • mypy for type checking
  • flake8 for linting
  • pytest for testing
# Format code
black src/ tests/

# Type checking
mypy src/easy_podcast/

# Linting
flake8 src/easy_podcast/

Core Components

The package is built with a modular architecture:

  • PodcastManager - Main orchestrator for the complete workflow
  • Episode/Podcast - Data models with computed properties
  • EpisodeTracker - JSONL-based metadata persistence
  • PodcastParser - RSS feed parsing with custom episode ID extraction
  • PodcastDownloader - HTTP downloads with progress tracking
  • Transcription - WhisperX-based transcription module

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests
  4. Ensure all tests pass (pytest)
  5. Check code quality (black src/ tests/ and mypy src/)
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_podcast-0.0.1.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easy_podcast-0.0.1-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file easy_podcast-0.0.1.tar.gz.

File metadata

  • Download URL: easy_podcast-0.0.1.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for easy_podcast-0.0.1.tar.gz
Algorithm Hash digest
SHA256 69ea5db9174d8d6dc8387f19036e59f7a54a6a9450c5426ebc9e8ebd6b2b5768
MD5 54e574ac8c93a6ddbd949b94111bc3d6
BLAKE2b-256 adc0d48a94dd908fde3235c2285068ad8f126c719565dae181fffd2dfd0b411e

See more details on using hashes here.

Provenance

The following attestation bundles were made for easy_podcast-0.0.1.tar.gz:

Publisher: python-publish.yml on falahat/easy-podcast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file easy_podcast-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: easy_podcast-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for easy_podcast-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 572cde6220a3297c9da8592c7931848191dcab9b686590edbe1ca002a402dd15
MD5 51cf74d93f80307e91392bf26ba2ea0a
BLAKE2b-256 7067a2ded323eed765473b0cae3224e5d083ea1bef586acc0e8b474e8367d963

See more details on using hashes here.

Provenance

The following attestation bundles were made for easy_podcast-0.0.1-py3-none-any.whl:

Publisher: python-publish.yml on falahat/easy-podcast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page