Skip to main content

Media indexer for YouTube, YouTube Music, Internet Archive, Bandcamp and SoundCloud — index streams, download on demand

Project description

media_archivist

Cross-source media indexer. Builds a local JSON database of stream metadata from YouTube, YouTube Music, Internet Archive, Bandcamp and SoundCloud.

Backend Library What you can index
YouTube tutubo channels, playlists, videos (no API key)
YouTube Music tutubo.ytmus (via ytmusicapi) tracks, albums, artists, playlists
Internet Archive internetarchive items, collections
Bandcamp py_bandcamp tracks, albums, artists, tag/search
SoundCloud nuvem_de_som tracks, sets, profiles, search

media_archivist is metadata-only: it indexes streams; it does not download them. Pair it with yt-dlp (or SoundCloud's resolve_stream, Bandcamp's track.stream) for on-demand extraction, or use the JSON DB to drive dataset-collection scripts, recommender experiments, OVOS skills, etc.

Ships as both a Python library and a media-archivist CLI.

Install

pip install media_archivist                 # core (YouTube + IA + YT Music)
pip install media_archivist[bandcamp]       # + py_bandcamp
pip install media_archivist[soundcloud]     # + nuvem_de_som
pip install media_archivist[all]            # everything

CLI

Every subcommand takes either:

  • --db-file PATH — explicit path to a .json file (recommended for datasets you want to commit alongside scripts), or
  • --db NAME — auto-place under XDG at ~/.local/share/media_archivist/<NAME>.json.
# Index a channel, a playlist, or individual videos
media-archivist add --db-file talks.json https://www.youtube.com/@LinusTechTips
media-archivist add --db-file talks.json --blacklist "#shorts" \
    https://www.youtube.com/playlist?list=PL...

# Browse the DB
media-archivist list  --db-file talks.json --limit 20
media-archivist list  --db-file talks.json --grep "review" --json
media-archivist stats --db-file talks.json

# Pair with yt-dlp — index once, download on demand
media-archivist urls --db-file talks.json --grep "tutorial" | yt-dlp -a -

# Drop dead videos / unwanted titles
media-archivist prune --db-file talks.json --unavailable --blacklist sponsor

# Background-monitor a set of URLs (re-syncs every --interval seconds)
media-archivist monitor --db-file talks.json --interval 600 \
    https://www.youtube.com/@LinusTechTips \
    https://www.youtube.com/@SomeOtherChannel

# Internet Archive
media-archivist add --db-file ia_movies.json --ia classic_cartoons
media-archivist urls --db-file ia_movies.json | xargs -n1 -P4 wget

# YouTube Music — rich track metadata (artist, album, year, duration, explicit)
media-archivist add --db-file songs.json --music --skip-explicit "lo-fi beats"
media-archivist add --db-file songs.json --music \
    "https://music.youtube.com/playlist?list=PL..."

# Bandcamp — tracks have direct stream URLs in the entry
media-archivist add --db-file bandcamp.json --bandcamp \
    "https://artistname.bandcamp.com/album/some-album"
media-archivist add --db-file bandcamp.json --bandcamp "ambient drone"

# SoundCloud — search, profile, or set URLs
media-archivist add --db-file sc.json --soundcloud \
    "https://soundcloud.com/some-artist"
media-archivist add --db-file sc.json --soundcloud "footwork"

Pick the backend with --ia, --music, --bandcamp, or --soundcloud (default: YouTube). Every other subcommand (list, export, urls, prune, merge, stats, …) works the same way against any backend's DB.

DBs are plain JSON — edit, back up, version-control, share. With --db NAME the file is managed under XDG via json_database.

Building datasets

media_archivist is metadata-only: it indexes streams; downloads happen on demand via yt-dlp (or any other tool that reads URLs). The export, import, merge, and stats subcommands turn the JSON DB into a workable dataset.

# Build an index of three channels into one explicit file
media-archivist add --db-file documentaries.json \
    https://www.youtube.com/@FreeDocumentary \
    https://www.youtube.com/@FDSpace \
    https://www.youtube.com/@FreeDocumentaryOcean

# Project specific fields → CSV (great for pandas / sklearn)
media-archivist export --db-file documentaries.json --format csv \
    --fields videoId,title,url,published,tags,description \
    -o documentaries.csv

# JSONL is the canonical "one-row-per-line" format for ML pipelines
media-archivist export --db-file documentaries.json --format jsonl \
    -o documentaries.jsonl

# Just URLs (txt) for downstream tools
media-archivist export --db-file documentaries.json --format txt \
    -o urls.txt

# Inspect coverage before training
media-archivist stats --db-file documentaries.json

# Merge per-topic indexes into a master dataset
media-archivist merge --db-file all_docs.json \
    space.json ocean.json nature.json --overwrite

# Round-trip: import an existing JSONL produced elsewhere
media-archivist import --db-file talks.json talks.jsonl --overwrite

Output formats

--format Use case
jsonl (default) streaming pipelines, HuggingFace datasets, jq
json small datasets, human inspection
csv pandas, spreadsheets — list/dict fields auto-serialized to JSON strings
txt flat URL list for yt-dlp -a - / wget -i / xargs

Combine with --fields to project only what you need, --grep to filter by title substring, and --limit N to cap row count.

Stored fields per video

field source
videoId, url, title, thumbnail tutubo Video
tags union of Video.keywords and inferred Video.tags
is_live, published, views, description tutubo channel-grid metadata
playlist only set when archived from a playlist

See examples/ for end-to-end dataset-creation scripts.

YouTube (library)

from media_archivist import YoutubeArchivist

archivist = YoutubeArchivist(
    db_path="./talks.json",       # explicit file (or use db_name="..." for XDG)
    blacklisted_kwords=["#shorts", "trailer"],
    required_kwords=[],           # all must appear in the title
)

# Channel — handles /channel/, /c/, /@handle, /user/
archivist.archive("https://www.youtube.com/@LinusTechTips")

# Playlist
archivist.archive("https://www.youtube.com/playlist?list=PL...")

# Single video (watch / youtu.be / shorts URLs)
archivist.archive("https://www.youtube.com/watch?v=dQw4w9WgXcQ")

# All playlists of a channel
archivist.archive_channel_playlists("https://www.youtube.com/@LinusTechTips")

# Drop entries whose videos are no longer reachable
archivist.remove_unavailable()

for entry in archivist.sorted_entries():
    print(entry["title"], entry["url"])

Note on duration: tutubo's bare Channel.videos / Playlist.videos iterators don't expose track length, so --min-duration is a no-op for plain channel scrapes. It does apply when length is available — i.e. with --music (YT Music tracks), --bandcamp, --soundcloud, --ia, and YouTube search-result previews. published is a relative string ("2 days ago") rather than a timestamp.

Background monitor

from media_archivist import YoutubeMonitor

mon = YoutubeMonitor(db_name="my_channels")
mon.start()
mon.monitor("https://www.youtube.com/@LinusTechTips")  # re-syncs every sync_interval
mon.sync("https://www.youtube.com/@SomeOtherChannel")  # one-shot

YoutubeMonitor.bootstrap_from_url(url) seeds an empty database from a remote JSON dump — handy for distributing pre-built indexes.

YouTube Music (library)

from media_archivist import YoutubeMusicArchivist

m = YoutubeMusicArchivist(db_path="./songs.json", skip_explicit=True)
m.archive_search("lo-fi beats")
m.archive_playlist("https://music.youtube.com/playlist?list=PL...")
m.archive_album("MPREb_xxx")          # browseId
m.archive_artist("UCxxx")             # channelId

Each entry includes artist, album, year, duration (seconds), explicit, video_type (MUSIC_VIDEO_TYPE_ATV etc.), audio_only, music_video.

Bandcamp (library)

from media_archivist import BandcampArchivist

bc = BandcampArchivist(db_path="./bandcamp.json")
bc.archive("https://artist.bandcamp.com/album/some-album")
bc.archive_artist("https://artist.bandcamp.com")
bc.archive_search("ambient drone")

Each entry stores artist, album, track_number, duration (seconds), thumbnail, and stream (a direct audio URL when Bandcamp exposes one).

SoundCloud (library)

from media_archivist import SoundCloudArchivist

sc = SoundCloudArchivist(db_path="./sc.json", resolve_streams=True)
sc.archive("https://soundcloud.com/some-artist")     # profile
sc.archive("https://soundcloud.com/some-artist/sets/some-set")  # set
sc.archive_search("footwork")

resolve_streams=True calls nuvem_de_som's stream resolver per track and stores the resulting MP3/HLS URL under stream.

Internet Archive (library)

from media_archivist import IAArchivist

ia = IAArchivist(db_path="./ia_movies.json")
ia.archive("classic_cartoons")           # collection or single item id
ia.archive_item("Popeye_forPresident")

Stream URLs are filtered to formats in IAArchivist.VALID_FORMATS (MPEG2, Ogg Video, 512Kb MPEG4, h.264).

Filtering helpers

All archivists inherit from JsonArchivist:

  • remove_keyword(kwords) — drop entries whose title matches any keyword
  • remove_missing(keys) — drop entries missing any of the given fields
  • remove_below_duration(minutes) — drop entries shorter than N minutes
  • sorted_entries() — entries sorted by upload_ts (descending)

Metadata providers

media-archivist canonicalize enriches indexed entries with external IDs and structured metadata via the cross-source resolver in metadatarr. The provider registry, dispatcher, and ~24 built-in providers (MusicBrainz, Wikidata, TMDB, AniList, Jikan, Google Books, LibriVox, Apple Podcasts, *arr family, Discogs, Blu-ray.com, DVDCompare, OpenLibrary, Anna's Archive, Bandcamp, SoundCloud, YouTube / YouTube Music, Metal Archives, …) all live in metadatarr and self-register on import. See docs/metadatarr.md for the full table.

All resolver providers — including metal_archives — live in metadatarr. There are no media-archivist-specific resolver providers.

The resolver gates providers on three independent axes: media (MediaType), modality (PlaybackModality — AUDIO / VIDEO / TEXT / INTERACTIVE / UNKNOWN), and genre_filter (genre tag set). Callers constructing Signals directly can pass modality=PlaybackModality.AUDIO to restrict resolution to audio-only providers. See docs/metadatarr.md for details.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

media_archivist-0.1.0.tar.gz (82.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

media_archivist-0.1.0-py3-none-any.whl (75.0 kB view details)

Uploaded Python 3

File details

Details for the file media_archivist-0.1.0.tar.gz.

File metadata

  • Download URL: media_archivist-0.1.0.tar.gz
  • Upload date:
  • Size: 82.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for media_archivist-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0c9275a9cb2c1f9028e0dbf02ee538b2d6e2eb31d981d96c6c144bf9f3fa9878
MD5 d2183e854b5ea97a806eeaf0b6ec0a29
BLAKE2b-256 e59718d4964b6d4e69388cf0682424b7434366023c4f95fa8e56bd5257303911

See more details on using hashes here.

File details

Details for the file media_archivist-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for media_archivist-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54ea2d0fd9008f35a169f36d9f4dc24736177d56e647a02a15bc55ce3d84bf77
MD5 ad63b747fc53012a9566bd58467aad6a
BLAKE2b-256 ac9522a46952f8df3fa6653af9d3666de670ba160fe8cadb1b7db7bd88f76115

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page