A Python package for working with the SPORC (Structured Podcast Open Research Corpus) dataset
Project description
SPORC: Structured Podcast Open Research Corpus
A Python package for working with the SPORC (Structured Podcast Open Research Corpus) dataset from Hugging Face.
Overview
SPORC is a large multimodal dataset for the study of the podcast ecosystem. This package provides easy-to-use Python classes and functions to interact with the dataset, including:
- Podcast class: Collection of episodes and metadata about a podcast
- Episode class: Single episode with information about its contents
- Turn class: Individual conversation turns with speaker information
- Search functionality for podcasts and episodes
- Conversation turn analysis and filtering
- Sliding windows for processing large episodes in manageable chunks
- Streaming support for memory-efficient processing of large datasets
- Selective loading for filtering and loading specific podcast subsets into memory
- Lazy loading for efficient turn data access
Installation
Prerequisites
Before installing this package, you need to:
-
Accept the SPORC dataset terms on Hugging Face:
- Visit https://huggingface.co/datasets/blitt/SPoRC
- Log in to your Hugging Face account
- Click "I agree" to accept the dataset terms
-
Set up Hugging Face credentials on your local machine:
pip install huggingface_hub huggingface-cli login
Install the Package
pip install sporc
Or install from source:
git clone https://github.com/yourusername/sporc.git
cd sporc
pip install -e .
Quick Start
from sporc import SPORCDataset
# Initialize the dataset
sporc = SPORCDataset()
# Search for a specific podcast
podcast = sporc.search_podcast("SingOut SpeakOut")
# Get all episodes for this podcast
for episode in podcast.episodes:
print(f"Episode: {episode.title}")
print(f"Duration: {episode.duration_seconds} seconds")
print(f"Hosts: {episode.host_names}")
# Search for episodes with specific criteria
episodes = sporc.search_episodes(
min_duration=300, # At least 5 minutes
max_speakers=3, # Maximum 3 speakers
host_name="Simon Shapiro"
)
# Get conversation turns for a specific episode
episode = episodes[0]
turns = episode.get_turns_by_time_range(0, 180) # First 3 minutes
for turn in turns:
print(f"Speaker: {turn.speaker}")
print(f"Text: {turn.text[:100]}...")
Core Classes
SPORCDataset
The main class for interacting with the SPORC dataset.
from sporc import SPORCDataset
# Memory mode (default)
sporc = SPORCDataset()
# Streaming mode for memory efficiency
sporc = SPORCDataset(streaming=True)
# Selective mode to load specific podcasts
sporc = SPORCDataset(streaming=True)
sporc.load_podcast_subset(categories=['education'])
Podcast
Represents a podcast with its episodes and metadata.
podcast = sporc.search_podcast("Example Podcast")
print(f"Title: {podcast.title}")
print(f"Category: {podcast.category}")
print(f"Number of episodes: {len(podcast.episodes)}")
Episode
Represents a single podcast episode.
episode = podcast.episodes[0]
print(f"Title: {episode.title}")
print(f"Duration: {episode.duration_seconds} seconds")
print(f"Hosts: {episode.host_names}")
Turn
Represents a single conversation turn in an episode.
turn = episode.get_all_turns()[0]
print(f"Speaker: {turn.speaker}")
print(f"Text: {turn.text}")
print(f"Duration: {turn.duration} seconds")
Key Features
Memory Modes
The package supports three modes for different use cases:
- Memory Mode: Fast access, high memory usage (default)
- Streaming Mode: Memory efficient, slower access
- Selective Mode: Best of both worlds - load specific subsets into memory
Sliding Windows
Process large episodes in manageable chunks with configurable overlap:
# Process episode in 10-turn windows with 2-turn overlap
for window in episode.sliding_window(window_size=10, overlap=2):
print(f"Window: {window.size} turns")
print(f"Time range: {window.time_range[0]/60:.1f}-{window.time_range[1]/60:.1f}min")
Search Capabilities
Search podcasts and episodes by various criteria:
# Search by duration, speakers, hosts, categories, etc.
episodes = sporc.search_episodes(
min_duration=1800, # 30+ minutes
category="education",
host_name="Simon Shapiro"
)
Documentation
For comprehensive documentation and examples, see the Wiki:
- Installation Guide: Detailed setup instructions
- Basic Usage: Simple examples to get started
- Search Examples: How to search for podcasts and episodes
- Conversation Analysis: Analyzing conversation turns and patterns
- Sliding Windows: Process large episodes in manageable chunks
- Streaming Mode: Memory-efficient processing
- Selective Loading: Filtered subset processing
- Lazy Loading: Efficient turn data loading
- API Reference: Complete API documentation
Performance Considerations
- Memory Mode: Requires 8GB+ RAM, fast access to all data
- Streaming Mode: Works with 4GB+ RAM, slower but memory efficient
- Selective Mode: Best balance for working with specific subsets
Error Handling
The package includes comprehensive error handling:
from sporc import SPORCDataset, SPORCError
try:
sporc = SPORCDataset()
podcast = sporc.search_podcast("Example Podcast")
except SPORCError as e:
print(f"Error: {e}")
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this package in your research, please cite the original SPORC paper:
@article{blitt2025sporc,
title={SPORC: the Structured Podcast Open Research Corpus},
author={Litterer, Ben and Jurgens, David and Card, Dallas},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025}
}
Support
For questions, issues, or feature requests, please:
- Check the documentation
- Search existing issues
- Create a new issue if your problem isn't already addressed
Acknowledgments
- Hugging Face for hosting the dataset
- The open-source community for the tools that made this package possible
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sporc-0.2.0.tar.gz.
File metadata
- Download URL: sporc-0.2.0.tar.gz
- Upload date:
- Size: 120.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66dd4d5daa0a01a5b9928f562d115b0464346ebb4c83f458c1a3c982188df6a1
|
|
| MD5 |
3c92124b29f0b4c50ce8d60f48c05c52
|
|
| BLAKE2b-256 |
19247a35f998575f5051b2b27aeea90cb175c3c868fc7e29d54be675f90d0b56
|
File details
Details for the file sporc-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sporc-0.2.0-py3-none-any.whl
- Upload date:
- Size: 58.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a80f16bf2b4014b8a68f3ad7402055c88b95425dea6f4b4f54d51784c3eab58
|
|
| MD5 |
b7b59413f8cf6cc4d70aa8603e45d6ed
|
|
| BLAKE2b-256 |
4d4c41a02ba66cc30b97c509ebc5b74820fe3b91f18f67bdbca57dc49a9d5e69
|