Audiobook scraper — search and stream from Librivox, LoyalBooks, and more
Project description
AudioBooker
Python library for searching and streaming free audiobooks from multiple sources.
Parallel search across all sources, fuzzy matching, relevance scoring, and a unified
AudioBook dataclass — one API regardless of where the book comes from.
Supported Sources
| Source | Site | Catalogue | Native Search | Genres / Tags |
|---|---|---|---|---|
Librivox |
librivox.org | ~18 000 books | title, author, narrator, tag (REST API) | 30+ |
LoyalBooks |
loyalbooks.com | ~3 500 books | title, author (sitemap), tag (genre pages) | 41 |
StephenKingAudioBooks |
stephenkingaudiobooks.com | ~113 books | full-text site search | — |
GoldenAudioBooks |
goldenaudiobook.co | ~6 500 books | title, author, tag (linear scan) | — |
AudioAnarchy |
audioanarchy.org | ~11 books | title, author, tag (linear scan) | Anarchy, Radio Drama |
DarkerProjects |
darkerprojects.com | ~244 episodes | title, author, tag (linear scan) | Audio Drama |
HPTalesAudioBooks |
hpaudiotales.com | ~20 books | title, author, tag (linear scan) | Harry Potter |
YouTube sources (optional, requires pip install audiobooker[youtube]):
| Source | Channel | Content | Tags |
|---|---|---|---|
TheCybrarian |
@TheCybrarian | Robert E. Howard fiction (Conan, Solomon Kane, Kull…) | Fantasy, Sword and Sorcery, Robert E. Howard |
HorrorBabble |
@HorrorBabble | Horror short fiction narrated by Ian Gordon | Horror, Lovecraft, Weird Fiction |
Total indexed: ~28 000+ titles across 7 web sources + 2 YouTube channels.
LoyalBooks genres (41): Action and Adventure, Ancient Texts, Animals, Art Design and Architecture, Biography and Memoir, Children in Fiction, Children Non-fiction, Classics (Antiquity), Comedy and Humour, Drama, Early Modern, Fantasy, General Fiction, Historical Fiction, History, Horror and Supernatural Fiction, Humor, Instruction and How-To, Language, Literary Fiction, Love Romance and Marriage, Modern (19th C), Music and Theatre, Myths Legends and Fairy Tales, Nature and Wildlife, Non-fiction, Philosophy, Poetry, Politics and Economics, Psychology, Religion, Science, Science Fiction, Short Stories, Short Works, Spiritual and Inspirational, Sport and Recreation, Tragedy, Travel and Geography, War and Military, Westerns.
YouTube support
Install the optional YouTube extra:
pip install audiobooker[youtube]
# or
pip install tutubo
Use the pre-configured channel sources or define your own:
from audiobooker.scrappers.youtube import HorrorBabble, TheCybrarian, YoutubeChannelSource
from audiobooker.base import BookAuthor
# Pre-configured channels
for book in HorrorBabble().iterate_all():
print(book.title, book.streams) # streams = YouTube watch URLs
for book in TheCybrarian().search_by_title("Conan"):
print(book.title, book.runtime)
# Custom channel
my_channel = YoutubeChannelSource(
channel_url="https://www.youtube.com/@SomeChannel/videos",
authors=[BookAuthor(last_name="Unknown")],
tags=["Audiobook"],
language="en",
min_runtime=300, # skip anything under 5 minutes
)
for book in my_channel.iterate_all():
print(book.title)
# Custom playlist
from audiobooker.scrappers.youtube import YoutubePlaylistSource
playlist = YoutubePlaylistSource(
playlist_url="https://www.youtube.com/playlist?list=PLxxxxxx",
authors=[BookAuthor(last_name="Various")],
tags=["Horror"],
)
for book in playlist.iterate_all():
print(book.title, book.runtime)
When tutubo is installed, TheCybrarian and HorrorBabble are automatically included
in ALL_SOURCES and participate in all unified search*() calls.
Install
pip install audiobooker
Unified search
Search all sources in parallel — results arrive sorted by relevance score.
from audiobooker import search, search_by_author, search_by_title, search_by_tag, search_by_narrator
# Search all sources, deduplicated, scored, timeout 30s
for book in search("Lovecraft", max_per_source=5, timeout=30):
print(f"[{book.score:.2f}] [{book.source}] {book.title}")
print(f" author={book.authors} streams={len(book.streams)}")
# Targeted searches
for book in search_by_author("Dickens", max_per_source=5):
print(book.title)
for book in search_by_title("Sherlock Holmes", max_per_source=5):
print(book.title, book.language)
for book in search_by_tag("horror", max_per_source=5):
print(book.title)
for book in search_by_narrator("Frank Muller", max_per_source=5):
print(book.title, book.narrator)
Search parameters
| Parameter | Default | Description |
|---|---|---|
sources |
all 7 | list of instantiated AudioBookSource objects to restrict search |
max_per_source |
10 | max results collected per source before stopping that thread |
timeout |
30.0 | seconds before slow sources are cancelled |
deduplicate |
True | skip books with identical title+author from a second source |
Scoring
Each result carries a score field (0.0–1.0) computed by score_book().
Weights depend on the search method so cross-field contamination is avoided:
| Method | Title | Author | Tag | Narrator |
|---|---|---|---|---|
search_by_title |
100% | — | — | — |
search_by_author |
— | 100% | — | — |
search_by_tag |
— | — | 100% | — |
search_by_narrator |
— | — | — | 100% |
search |
55% | 30% | 10% | 5% |
Uses rapidfuzz WRatio — handles token reordering, typos, and partial matches. Title scoring adds a containment bonus when all query words appear verbatim in the title.
Results scoring below 0.45 are filtered out automatically.
Per-source usage
All scrapers share the same interface via AudioBookSource.
from audiobooker.scrappers.librivox import Librivox
from audiobooker.scrappers.loyalbooks import LoyalBooks
from audiobooker.scrappers.goldenaudiobooks import GoldenAudioBooks
from audiobooker.scrappers.audioanarchy import AudioAnarchy
from audiobooker.scrappers.darkerprojects import DarkerProjects
from audiobooker.scrappers.hpaudiotales import HPTalesAudioBooks
from audiobooker.scrappers.stephenkingaudiobooks import StephenKingAudioBooks
# Common interface
source = Librivox()
source.search(query) # title + author + tag
source.search_by_title(query)
source.search_by_author(query)
source.search_by_tag(query)
source.search_by_narrator(query)
source.iterate_all() # every book in the catalogue
source.iterate_popular() # front-page / curated selection
source.iterate_by_author(author)
source.iterate_by_tag(tag)
Librivox — REST API, fastest source
lv = Librivox()
for book in lv.search_by_author("Lovecraft", max_per_source=5):
print(book.title, book.runtime, "s")
for book in lv.search_by_narrator("LibriVox"):
print(book.title, book.narrator)
LoyalBooks — sitemap + genre pages
lb = LoyalBooks()
for book in lb.search_by_tag("Horror and Supernatural Fiction"):
print(book.title) # uses genre page, not linear scan
for book in lb.iterate_popular():
print(book.title) # front-page featured books
Linear-scan sources
GoldenAudioBooks, AudioAnarchy, DarkerProjects, HPTalesAudioBooks, and
StephenKingAudioBooks all support iterate_all(). StephenKingAudioBooks also
has a native site search for title/author queries.
for book in AudioAnarchy().iterate_all():
print(book.title, book.tags) # tags: ["Anarchy"] or ["Anarchy", "Radio Drama"]
for book in DarkerProjects().iterate_popular():
print(book.title) # front-page shows
AudioBook dataclass
@dataclass
class AudioBookChapter:
title: str = ""
offset: float = 0.0 # seconds from start of book
runtime: float = 0.0 # seconds
stream: str = "" # per-chapter audio URL
image: str = ""
@dataclass
class AudioBook:
title: str = ""
description: str = ""
image: str = "" # cover art URL
language: str = "" # ISO 639-1 code (normalised from source)
authors: List[BookAuthor] = field(default_factory=list)
tags: List[str] = field(default_factory=list)
streams: List[str] = field(default_factory=list) # direct audio URLs
narrator: Optional[AudiobookNarrator] = None # primary reader
narrators: List[AudiobookNarrator] = field(default_factory=list) # full reader cast
chapters: List[AudioBookChapter] = field(default_factory=list)
genres: List[str] = field(default_factory=list) # taxonomy genres
year: int = 0
runtime: int = 0 # seconds (where available)
source: str = "" # e.g. "Librivox", "LoyalBooks"
score: float = 0.0 # relevance score from last search (0..1)
codec: str = "" # e.g. "mp3"
bitrate: str = "" # e.g. "128"
external_ids: dict = field(default_factory=dict) # e.g. {"librivox_id": "47"}
def has_live_streams(self) -> bool: ... # HEAD-checks stream URLs
AudioBook supports == and hash() based on (title, sorted authors) — use a
set to deduplicate across sources.
Utilities
from audiobooker import score_book, iter_sitemap_urls, check_url_availability, normalize_language
# Score a book against a query manually
score = score_book("Lovecraft", book, method="search_by_author")
# Walk any sitemap or sitemap index recursively
for url in iter_sitemap_urls("https://example.com/sitemap.xml"):
print(url)
# Check if a stream URL is reachable
if check_url_availability("https://example.com/book.mp3"):
print("live")
# Normalise language strings to ISO 639-1
normalize_language("English") # → "en"
normalize_language("en-US") # → "en"
mediavocab integration
mediavocab is a hard runtime dependency. Every AudioBook can be projected
into the typed mediavocab.Release schema via audiobook_to_release():
from audiobooker import search, audiobook_to_release
# Search → typed mediavocab Release with parsed_license filtering
for book in search("Lovecraft", max_per_source=3):
release = audiobook_to_release(book)
if release.parsed_license and release.parsed_license.is_open():
# public domain / CC-licensed: free to redistribute
print(release.work.title, release.parsed_license.identifier)
The converter populates a wide swath of the Release / Work schema:
| mediavocab field | Source data |
|---|---|
Work.title, Work.year, Work.runtime, Work.language |
direct |
Work.content_genres |
AudioBook.genres (e.g. LibriVox genres) |
Work.credits |
authors → RelationRole.CREATOR, every reader → RelationRole.PERFORMER |
Work.external_ids |
librivox_id and any other typed ID the source supplied |
Release.chapters |
AudioBook.chapters → Chapter(offset, end, title) |
Release.codec, Release.bitrate |
LibriVox publishes 128 kbps MP3 by policy |
Release.audio_language |
mirrors Work.language |
Release.license |
public_domain for LibriVox / LoyalBooks |
Release.release_date |
IsoDate-compatible YYYY from AudioBook.year |
LibriVox emits one Release per book with full per-section chapters and a
deduplicated reader cast. Other sources populate whatever subset their public
data exposes — fields are only set when the source actually carries the data.
Error handling
Network failures and malformed pages are swallowed per-item — a bad page never
aborts an iterate_all() run. If a source site is down or has restructured its
HTML, that scraper silently yields nothing.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audiobooker-0.9.0a1.tar.gz.
File metadata
- Download URL: audiobooker-0.9.0a1.tar.gz
- Upload date:
- Size: 68.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc28385646ea357a22a0105b51a558f543d9ffabae2aedc7319422c550650aa2
|
|
| MD5 |
27e59e4ca1fb809daab3758f8ae94335
|
|
| BLAKE2b-256 |
5bdaaa379c97a89dae4af6d69c2f7bc976f924d8535fbd8798a2ef21f2402813
|
File details
Details for the file audiobooker-0.9.0a1-py3-none-any.whl.
File metadata
- Download URL: audiobooker-0.9.0a1-py3-none-any.whl
- Upload date:
- Size: 50.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4376578d36c974f863328230b10ace8b83f22323503471a14f00700040cee7f
|
|
| MD5 |
f7cb1927753fcfbf5ec2147eb6935a1b
|
|
| BLAKE2b-256 |
dc2f86f1038a68915835495ab1c7abbb838ef55103fe9d28b682b6a7cf542a58
|