Skip to main content

A Python package for working with the SPORC (Structured Podcast Open Research Corpus) dataset

Project description

SPORC: Structured Podcast Open Research Corpus

A Python package for working with the SPORC (Structured Podcast Open Research Corpus) dataset from Hugging Face.

Overview

SPORC is a large multimodal dataset for the study of the podcast ecosystem. This package provides easy-to-use Python classes and functions to interact with the dataset, including:

  • Podcast class: Collection of episodes and metadata about a podcast
  • Episode class: Single episode with information about its contents
  • Turn class: Individual conversation turns with speaker information
  • Search functionality for podcasts and episodes
  • Conversation turn analysis and filtering
  • Sliding windows for processing large episodes in manageable chunks
  • Streaming support for memory-efficient processing of large datasets
  • Selective loading for filtering and loading specific podcast subsets into memory
  • Lazy loading for efficient turn data access

Installation

Prerequisites

Before installing this package, you need to:

  1. Accept the SPORC dataset terms on Hugging Face:

  2. Set up Hugging Face credentials on your local machine:

    pip install huggingface_hub
    huggingface-cli login
    

Install the Package

pip install sporc

Or install from source:

git clone https://github.com/yourusername/sporc.git
cd sporc
pip install -e .

Quick Start

from sporc import SPORCDataset

# Initialize the dataset
sporc = SPORCDataset()

# Search for a specific podcast
podcast = sporc.search_podcast("SingOut SpeakOut")

# Get all episodes for this podcast
for episode in podcast.episodes:
    print(f"Episode: {episode.title}")
    print(f"Duration: {episode.duration_seconds} seconds")
    print(f"Hosts: {episode.host_names}")

# Search for episodes with specific criteria
episodes = sporc.search_episodes(
    min_duration=300,  # At least 5 minutes
    max_speakers=3,    # Maximum 3 speakers
    host_name="Simon Shapiro"
)

# Get conversation turns for a specific episode
episode = episodes[0]
turns = episode.get_turns_by_time_range(0, 180)  # First 3 minutes
for turn in turns:
    print(f"Speaker: {turn.speaker}")
    print(f"Text: {turn.text[:100]}...")

Core Classes

SPORCDataset

The main class for interacting with the SPORC dataset.

from sporc import SPORCDataset

# Memory mode (default)
sporc = SPORCDataset()

# Streaming mode for memory efficiency
sporc = SPORCDataset(streaming=True)

# Selective mode to load specific podcasts
sporc = SPORCDataset(streaming=True)
sporc.load_podcast_subset(categories=['education'])

Podcast

Represents a podcast with its episodes and metadata.

podcast = sporc.search_podcast("Example Podcast")
print(f"Title: {podcast.title}")
print(f"Category: {podcast.category}")
print(f"Number of episodes: {len(podcast.episodes)}")

Episode

Represents a single podcast episode.

episode = podcast.episodes[0]
print(f"Title: {episode.title}")
print(f"Duration: {episode.duration_seconds} seconds")
print(f"Hosts: {episode.host_names}")

Turn

Represents a single conversation turn in an episode.

turn = episode.get_all_turns()[0]
print(f"Speaker: {turn.speaker}")
print(f"Text: {turn.text}")
print(f"Duration: {turn.duration} seconds")

Key Features

Memory Modes

The package supports three modes for different use cases:

  • Memory Mode: Fast access, high memory usage (default)
  • Streaming Mode: Memory efficient, slower access
  • Selective Mode: Best of both worlds - load specific subsets into memory

Sliding Windows

Process large episodes in manageable chunks with configurable overlap:

# Process episode in 10-turn windows with 2-turn overlap
for window in episode.sliding_window(window_size=10, overlap=2):
    print(f"Window: {window.size} turns")
    print(f"Time range: {window.time_range[0]/60:.1f}-{window.time_range[1]/60:.1f}min")

Search Capabilities

Search podcasts and episodes by various criteria:

# Search by duration, speakers, hosts, categories, etc.
episodes = sporc.search_episodes(
    min_duration=1800,  # 30+ minutes
    category="education",
    host_name="Simon Shapiro"
)

Documentation

For comprehensive documentation and examples, see the Wiki:

Performance Considerations

  • Memory Mode: Requires 8GB+ RAM, fast access to all data
  • Streaming Mode: Works with 4GB+ RAM, slower but memory efficient
  • Selective Mode: Best balance for working with specific subsets

Error Handling

The package includes comprehensive error handling:

from sporc import SPORCDataset, SPORCError

try:
    sporc = SPORCDataset()
    podcast = sporc.search_podcast("Example Podcast")
except SPORCError as e:
    print(f"Error: {e}")

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this package in your research, please cite the original SPORC paper:

@article{blitt2025sporc,
  title={SPORC: the Structured Podcast Open Research Corpus},
  author={Litterer, Ben and Jurgens, David and Card, Dallas},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}

Support

For questions, issues, or feature requests, please:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue if your problem isn't already addressed

Acknowledgments

  • Hugging Face for hosting the dataset
  • The open-source community for the tools that made this package possible

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sporc-0.2.0.tar.gz (120.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sporc-0.2.0-py3-none-any.whl (58.5 kB view details)

Uploaded Python 3

File details

Details for the file sporc-0.2.0.tar.gz.

File metadata

  • Download URL: sporc-0.2.0.tar.gz
  • Upload date:
  • Size: 120.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.12

File hashes

Hashes for sporc-0.2.0.tar.gz
Algorithm Hash digest
SHA256 66dd4d5daa0a01a5b9928f562d115b0464346ebb4c83f458c1a3c982188df6a1
MD5 3c92124b29f0b4c50ce8d60f48c05c52
BLAKE2b-256 19247a35f998575f5051b2b27aeea90cb175c3c868fc7e29d54be675f90d0b56

See more details on using hashes here.

File details

Details for the file sporc-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: sporc-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 58.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.12

File hashes

Hashes for sporc-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a80f16bf2b4014b8a68f3ad7402055c88b95425dea6f4b4f54d51784c3eab58
MD5 b7b59413f8cf6cc4d70aa8603e45d6ed
BLAKE2b-256 4d4c41a02ba66cc30b97c509ebc5b74820fe3b91f18f67bdbca57dc49a9d5e69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page