Media indexer for YouTube, YouTube Music, Internet Archive, Bandcamp and SoundCloud — index streams, download on demand
Project description
media_archivist
Cross-source media indexer. Builds a local JSON database of stream metadata from YouTube, YouTube Music, Internet Archive, Bandcamp and SoundCloud.
| Backend | Library | What you can index |
|---|---|---|
| YouTube | tutubo |
channels, playlists, videos (no API key) |
| YouTube Music | tutubo.ytmus (via ytmusicapi) |
tracks, albums, artists, playlists |
| Internet Archive | internetarchive |
items, collections |
| Bandcamp | py_bandcamp |
tracks, albums, artists, tag/search |
| SoundCloud | nuvem_de_som |
tracks, sets, profiles, search |
media_archivist is metadata-only: it indexes streams; it does not
download them. Pair it with yt-dlp (or
SoundCloud's resolve_stream, Bandcamp's track.stream) for on-demand
extraction, or use the JSON DB to drive dataset-collection scripts, recommender
experiments, OVOS skills, etc.
Ships as both a Python library and a media-archivist CLI.
Install
pip install media_archivist # core (YouTube + IA + YT Music)
pip install media_archivist[bandcamp] # + py_bandcamp
pip install media_archivist[soundcloud] # + nuvem_de_som
pip install media_archivist[all] # everything
CLI
Every subcommand takes either:
--db-file PATH— explicit path to a.jsonfile (recommended for datasets you want to commit alongside scripts), or--db NAME— auto-place under XDG at~/.local/share/media_archivist/<NAME>.json.
# Index a channel, a playlist, or individual videos
media-archivist add --db-file talks.json https://www.youtube.com/@LinusTechTips
media-archivist add --db-file talks.json --blacklist "#shorts" \
https://www.youtube.com/playlist?list=PL...
# Browse the DB
media-archivist list --db-file talks.json --limit 20
media-archivist list --db-file talks.json --grep "review" --json
media-archivist stats --db-file talks.json
# Pair with yt-dlp — index once, download on demand
media-archivist urls --db-file talks.json --grep "tutorial" | yt-dlp -a -
# Drop dead videos / unwanted titles
media-archivist prune --db-file talks.json --unavailable --blacklist sponsor
# Background-monitor a set of URLs (re-syncs every --interval seconds)
media-archivist monitor --db-file talks.json --interval 600 \
https://www.youtube.com/@LinusTechTips \
https://www.youtube.com/@SomeOtherChannel
# Internet Archive
media-archivist add --db-file ia_movies.json --ia classic_cartoons
media-archivist urls --db-file ia_movies.json | xargs -n1 -P4 wget
# YouTube Music — rich track metadata (artist, album, year, duration, explicit)
media-archivist add --db-file songs.json --music --skip-explicit "lo-fi beats"
media-archivist add --db-file songs.json --music \
"https://music.youtube.com/playlist?list=PL..."
# Bandcamp — tracks have direct stream URLs in the entry
media-archivist add --db-file bandcamp.json --bandcamp \
"https://artistname.bandcamp.com/album/some-album"
media-archivist add --db-file bandcamp.json --bandcamp "ambient drone"
# SoundCloud — search, profile, or set URLs
media-archivist add --db-file sc.json --soundcloud \
"https://soundcloud.com/some-artist"
media-archivist add --db-file sc.json --soundcloud "footwork"
Pick the backend with --ia, --music, --bandcamp, or --soundcloud
(default: YouTube). Every other subcommand (list, export, urls, prune,
merge, stats, …) works the same way against any backend's DB.
DBs are plain JSON — edit, back up, version-control, share. With --db NAME the
file is managed under XDG via
json_database.
Homelab / HTTP service
media-archivist serve exposes a FastAPI HTTP API on port 8000. The Docker
image includes yt-dlp and stores everything under /data.
# One command brings up the service with a persistent named volume,
# automatic restart-on-reboot, and a /healthz healthcheck.
docker compose -f deploy/docker-compose.yml up -d
The service is single-tenant, no authentication. It is designed to run on your LAN or behind your existing reverse proxy (Caddy, Traefik, nginx). Do not expose port 8000 directly to the internet.
Integration endpoints
| Endpoint | Purpose |
|---|---|
GET /strm/{id} |
Returns playable URL as text/plain — drop into .strm files for Jellyfin / Kodi. |
GET /m3u |
M3U playlist of stream URLs. Accepts source, where, has_stream, limit. |
GET /feed.rss |
RSS feed for podcast clients or Freshrss. Accepts limit. |
GET /healthz |
Liveness check for Uptime Kuma, Docker, k8s. Returns {status, version, db_path}. |
GET /providers |
Inspect which metadatarr providers are active (available, media, modality, genre_filter). |
POST /canonicalize |
Run the resolver against the DB. Body: {providers?, stamp_rows?, max_workers?}. |
GET /quarantine |
List entries the resolver could not confidently match. |
POST /quarantine/{id}/accept |
Accept a quarantined row (optional ?canonical_id= to link). |
POST /quarantine/{id}/reject |
Reject and force a fresh canonical_id. |
GET /docs |
Auto-generated OpenAPI / Swagger UI. |
See docs/deploy.md for the full route table, Systemd
unit, and reverse-proxy tips. For Jellyfin .strm export see
docs/jellyfin.md.
Building datasets
media_archivist is metadata-only: it indexes streams; downloads happen on
demand via yt-dlp (or any other tool that reads URLs). The export,
import, merge, and stats subcommands turn the JSON DB into a workable
dataset.
# Build an index of three channels into one explicit file
media-archivist add --db-file documentaries.json \
https://www.youtube.com/@FreeDocumentary \
https://www.youtube.com/@FDSpace \
https://www.youtube.com/@FreeDocumentaryOcean
# Project specific fields → CSV (great for pandas / sklearn)
media-archivist export --db-file documentaries.json --format csv \
--fields videoId,title,url,published,tags,description \
-o documentaries.csv
# JSONL is the canonical "one-row-per-line" format for ML pipelines
media-archivist export --db-file documentaries.json --format jsonl \
-o documentaries.jsonl
# Just URLs (txt) for downstream tools
media-archivist export --db-file documentaries.json --format txt \
-o urls.txt
# Inspect coverage before training
media-archivist stats --db-file documentaries.json
# Merge per-topic indexes into a master dataset
media-archivist merge --db-file all_docs.json \
space.json ocean.json nature.json --overwrite
# Round-trip: import an existing JSONL produced elsewhere
media-archivist import --db-file talks.json talks.jsonl --overwrite
Output formats
--format |
Use case |
|---|---|
jsonl (default) |
streaming pipelines, HuggingFace datasets, jq |
json |
small datasets, human inspection |
csv |
pandas, spreadsheets — list/dict fields auto-serialized to JSON strings |
txt |
flat URL list for yt-dlp -a - / wget -i / xargs |
Combine with --fields to project only what you need, --grep to filter by
title substring, and --limit N to cap row count.
Stored fields per video
| field | source |
|---|---|
videoId, url, title, thumbnail |
tutubo Video |
tags |
union of Video.keywords and inferred Video.tags |
is_live, published, views, description |
tutubo channel-grid metadata |
playlist |
only set when archived from a playlist |
See examples/ for end-to-end dataset-creation scripts.
YouTube (library)
from media_archivist import YoutubeArchivist
archivist = YoutubeArchivist(
db_path="./talks.json", # explicit file (or use db_name="..." for XDG)
blacklisted_kwords=["#shorts", "trailer"],
required_kwords=[], # all must appear in the title
)
# Channel — handles /channel/, /c/, /@handle, /user/
archivist.archive("https://www.youtube.com/@LinusTechTips")
# Playlist
archivist.archive("https://www.youtube.com/playlist?list=PL...")
# Single video (watch / youtu.be / shorts URLs)
archivist.archive("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
# All playlists of a channel
archivist.archive_channel_playlists("https://www.youtube.com/@LinusTechTips")
# Drop entries whose videos are no longer reachable
archivist.remove_unavailable()
for entry in archivist.sorted_entries():
print(entry["title"], entry["url"])
Note on duration: tutubo's bare
Channel.videos/Playlist.videositerators don't expose track length, so--min-durationis a no-op for plain channel scrapes. It does apply when length is available — i.e. with--music(YT Music tracks),--bandcamp,--soundcloud,--ia, and YouTube search-result previews.publishedis a relative string ("2 days ago") rather than a timestamp.
Background monitor
from media_archivist import YoutubeMonitor
mon = YoutubeMonitor(db_name="my_channels")
mon.start()
mon.monitor("https://www.youtube.com/@LinusTechTips") # re-syncs every sync_interval
mon.sync("https://www.youtube.com/@SomeOtherChannel") # one-shot
YoutubeMonitor.bootstrap_from_url(url) seeds an empty database from a remote
JSON dump — handy for distributing pre-built indexes.
YouTube Music (library)
from media_archivist import YoutubeMusicArchivist
m = YoutubeMusicArchivist(db_path="./songs.json", skip_explicit=True)
m.archive_search("lo-fi beats")
m.archive_playlist("https://music.youtube.com/playlist?list=PL...")
m.archive_album("MPREb_xxx") # browseId
m.archive_artist("UCxxx") # channelId
Each entry includes artist, album, year, duration (seconds), explicit,
video_type (MUSIC_VIDEO_TYPE_ATV etc.), audio_only, music_video.
Bandcamp (library)
from media_archivist import BandcampArchivist
bc = BandcampArchivist(db_path="./bandcamp.json")
bc.archive("https://artist.bandcamp.com/album/some-album")
bc.archive_artist("https://artist.bandcamp.com")
bc.archive_search("ambient drone")
Each entry stores artist, album, track_number, duration (seconds),
thumbnail, and stream (a direct audio URL when Bandcamp exposes one).
SoundCloud (library)
from media_archivist import SoundCloudArchivist
sc = SoundCloudArchivist(db_path="./sc.json", resolve_streams=True)
sc.archive("https://soundcloud.com/some-artist") # profile
sc.archive("https://soundcloud.com/some-artist/sets/some-set") # set
sc.archive_search("footwork")
resolve_streams=True calls nuvem_de_som's stream resolver per track and
stores the resulting MP3/HLS URL under stream.
Internet Archive (library)
from media_archivist import IAArchivist
ia = IAArchivist(db_path="./ia_movies.json")
ia.archive("classic_cartoons") # collection or single item id
ia.archive_item("Popeye_forPresident")
Stream URLs are filtered to formats in IAArchivist.VALID_FORMATS
(MPEG2, Ogg Video, 512Kb MPEG4, h.264).
Filtering helpers
All archivists inherit from JsonArchivist:
remove_keyword(kwords)— drop entries whose title matches any keywordremove_missing(keys)— drop entries missing any of the given fieldsremove_below_duration(minutes)— drop entries shorter than N minutessorted_entries()— entries sorted byupload_ts(descending)
Metadata providers
media-archivist canonicalize enriches indexed entries with external IDs
and structured metadata via the cross-source resolver in
metadatarr. The provider
registry, dispatcher, and ~24 built-in providers (MusicBrainz, Wikidata,
TMDB, AniList, Jikan, Google Books, LibriVox, Apple Podcasts, *arr family,
Discogs, Blu-ray.com, DVDCompare, OpenLibrary, Anna's Archive, Bandcamp,
SoundCloud, YouTube / YouTube Music, Metal Archives, …) all live in
metadatarr and self-register on import. See
docs/metadatarr.md for the full table.
All resolver providers — including metal_archives — live in metadatarr.
There are no media-archivist-specific resolver providers.
The resolver gates providers on three independent axes: media (MediaType),
modality (PlaybackModality — AUDIO / VIDEO / TEXT / INTERACTIVE / UNKNOWN),
and genre_filter (genre tag set). Callers constructing Signals directly can
pass modality=PlaybackModality.AUDIO to restrict resolution to audio-only
providers. See docs/metadatarr.md for details.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file media_archivist-0.1.1a1.tar.gz.
File metadata
- Download URL: media_archivist-0.1.1a1.tar.gz
- Upload date:
- Size: 85.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bb4a9e1f8ca329fafa2fb360091e3158adf75a7acd6ebdc07e84cac75ec749d
|
|
| MD5 |
d4a67720caecb43975cc6e7ef3175054
|
|
| BLAKE2b-256 |
7930d5c93c424a78ef32cedc28ee3bc3042d8d28f8a53593b59e853c0d92295f
|
File details
Details for the file media_archivist-0.1.1a1-py3-none-any.whl.
File metadata
- Download URL: media_archivist-0.1.1a1-py3-none-any.whl
- Upload date:
- Size: 76.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ec3d3a54eca020a656992b92aecdc1f78d4c2971f07b051c5a656d452491526
|
|
| MD5 |
726b77b8ed29f652463d54301572c692
|
|
| BLAKE2b-256 |
9f1949ac27d2fb27f32076f7d51a5bb7d8a0b2d7b72577742bbae54cf7d54f7d
|