Skip to main content

Reference vocabulary and pydantic data model for media cataloguing.

Project description

mediavocab

Reference vocabulary and pydantic data model for cataloguing media works: movies, music, books, comics, games, podcasts, audio dramas, radio, sound effects, and procedural ambient streams — all in a single shared schema.

mediavocab is a foundation library. It defines the vocabulary (enums, genre constants) and the structural models (Work, Release, Entity, Credit, Membership, Appearance). Application logic — provider clients, resolvers, playback, UI — lives outside this package.

Install

pip install mediavocab

The only runtime dependency is pydantic>=2. The taxonomy/ and text/ subpackages import nothing beyond the stdlib, so they are safe to vendor in minimal environments.

Quickstart

from mediavocab import (
    Credit, CreditSection, EntityKind, EntityRef, MediaType,
    RelationRole, Release, VariantKind, Work, WorkRelation, WorkRelationKind,
)
from mediavocab.text import score, work_hash

# Each cut is its own Work (spec §3.4); director's cut links via WorkRelation.
theatrical = Work(
    title="Blade Runner", media_type=MediaType.MOVIE,
    year=1982, runtime=117 * 60.0, production_country="US",
    variant_kind=VariantKind.THEATRICAL,
    credits=[Credit(
        entity=EntityRef(name="Ridley Scott", kind=EntityKind.PERSON),
        role="Director", relation_role=RelationRole.DIRECTOR,
        section=CreditSection.PRINCIPAL,
    )],
)
directors = Work(
    title="Blade Runner", media_type=MediaType.MOVIE,
    year=1992, runtime=116 * 60.0, production_country="US",
    variant_kind=VariantKind.DIRECTORS,
    relations=[WorkRelation(kind=WorkRelationKind.DERIVED_FROM, target=theatrical)],
)

# A Release manifests a Work — many formats, mirrors, packages per Work.
bluray = Release(work=theatrical, container="Blu-ray", region="US",
                 uri="file:///library/blade-runner.mkv")

print(work_hash(theatrical))            # stable SHA-256 identity hash
print(score(theatrical, theatrical))    # 1.0 (self-match)

More walked-through examples in examples/ covering albums, band lineups, radio stations, IoT device routing, work comparison, and the pipeline-sentinel NOT_MEDIA / CONTROL flow.

What's in the box

Module Contents
mediavocab.taxonomy MediaType (+ PIPELINE_SENTINELS), VariantKind, ReleasePackaging, EntityKind, OrganisationKind, RelationRole, CreditSection, MembershipKind, TemporalState, ReleaseStatus, StreamMode, WorkRelationKind, ReleaseRelationKind, ContentForm, ProgrammeFormat, AccessibilityKind, PlaybackType, plus GENRE_* string constants. Zero deps.
mediavocab.models Work, Release, Appearance, Chapter, AccessibilityTrack, AvailabilityWindow, LocalizedTitle, WorkRelation, ReleaseRelation, Entity, EntityRef, Membership, Credit, ExternalIds, License, Signals. Pydantic v2.
mediavocab.text Normalisation, fuzzy matching, work / release comparison and scoring, SHA-256 identity hashes (work_hash / release_hash), merge with MergeStrategy / IdentityConflict, title parser, content classifier, ISO 639 / 3166 / 8601 / ISBN helpers. Stdlib only.
mediavocab.helpers Classifier predicates (is_not_media, is_device_entity, is_continuous_release), credit lookups (director, author, performers, filmography_of, episodes_of), and release availability / rights predicates (is_available, release_is_open, release_allows_commercial). Non-normative.

Design highlights

  • A type earns its place by changing the schema (A1). SOUND_EFFECT, PROCEDURAL_AMBIENT, AUDIO_DRAMA, MUSIC_VIDEO, etc. each catalogue against different external databases or with different runtime tolerances.
  • Devices are entities, not works (A3). EntityKind.DEVICE represents physical playback endpoints. The Work is still a RADIO/MOVIE/MUSIC; the device is how the consumer routes playback. A receiver-class device additionally has a Work counterpart for "turn on the radio" invocation.
  • Pipeline sentinels never reach a canonical Work (T8). MediaType.GENERIC, NOT_MEDIA, and CONTROL live on the resolver bag and are rejected at Work construction.
  • Each cut is its own Work (§3.4). Theatrical, director's, extended, remaster, fanedit — restructurings of the canonical artefact each get a new Work linked by WorkRelation. ReleasePackaging (deluxe / reissue / box-set / bootleg) is independent — that's how an edition ships.
  • PlaybackType is derived from MediaType (A6). AUDIO / VIDEO / PAGED / INTERACTIVE routes resolver dispatch by playback intent. Never persisted on Work or Release. Declare playback_type: ClassVar[Set[PlaybackType]] on each provider.
  • Genre is a free List[str] with canonical spellings in mediavocab.taxonomy.genre. ASMR, ambient, anime, adult, etc. are genre tags applied across multiple media types — not types of their own (T1). Programme formats (documentary, concert, talk show) live in ProgrammeFormat, not in genres.

See docs/ for full reference and pattern guides.

Traceability — code cites the spec

The formal specification (docs/mediavocab_spec.md, SPEC_VERSION) is prescriptive: it is the source of truth, and the code implements it. Its §2 axioms (A1–A9) and theorems (T1–T9), together with §3–§8, admit and justify every axis, enum value, model field, and operation.

So the two stay bidirectionally traceable, every public enum, model, and operation carries a greppable clause citation in its docstring naming the axiom / theorem / section that admits it, in the form (spec: <clause>) — e.g. (spec: A1, §3.2/§4.1) on MediaType, (spec: A8a, §3.3/§4.2) on ContentForm, (spec: §6.3, A6) on work_hash. Field-level comments cite the §1.5 identity / routing / description family and, for identity-hash inputs, the hash rule (e.g. content_form → A8b enters work_hash §6.3; routing fields → excluded by A6). Validators cite the axiom they enforce (sentinel rejection → T8; one-MediaType-for-life → A4; org_kind warn → §4.5/A9).

Grep the convention with grep -rn "(spec:" mediavocab/. To find which clause governs a symbol, open it; to find the code for a clause, grep the clause id (e.g. grep -rn "A8b" mediavocab/).

Workspace position

mediavocab sits at the bottom of the stack. Every other package in this workspace depends on it:

                          mediavocab
                              ▲
        ┌───────────┬─────────┼─────────┬───────────┐
        │           │         │         │           │
      tutubo   pyfanedit   pymetal   pyo*…       py_bandcamp / nuvem-de-som
        ▲           ▲         ▲                       ▲
        └────────┬──┴─────────┴───────────────────────┘
                 │
              metadatarr  ◄── canonical resolver, ships every provider above
                 ▲
                 │
           media-archivist  ◄── source-DB orchestrator + sidecars + CLI/server
  • mediavocab: vocabulary + structural models (this package).
  • tutubo, pyfanedit, pymetal, py_bandcamp, nuvem_de_som, radiosoma, tunein, audiobooker: API clients / scrapers. Each emits mediavocab.Work / Release / Entity directly.
  • metadatarr: cross-source resolver framework. Bundles every first-party scraper as a hard runtime dep (no extras juggling) and ships ~24 providers under metadatarr.resolve.providers.
  • media-archivist: local source-DB indexer / canonicalizer / CLI / web server. Consumes metadatarr's resolver.

Testing

pip install -e ".[test]"
pytest -q

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mediavocab-2.0.0a2.tar.gz (115.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mediavocab-2.0.0a2-py3-none-any.whl (96.0 kB view details)

Uploaded Python 3

File details

Details for the file mediavocab-2.0.0a2.tar.gz.

File metadata

  • Download URL: mediavocab-2.0.0a2.tar.gz
  • Upload date:
  • Size: 115.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mediavocab-2.0.0a2.tar.gz
Algorithm Hash digest
SHA256 c187831586d703458e85a86c5e94c388cbd1e602e0f4200f9ae1356d2ec6851e
MD5 d6c5cf839144e006bd3b1188d71f1c5d
BLAKE2b-256 3b44a585dd62d536087f2bf5f73a6d441e4e2177c1c903e6b69715a79b02fb29

See more details on using hashes here.

File details

Details for the file mediavocab-2.0.0a2-py3-none-any.whl.

File metadata

  • Download URL: mediavocab-2.0.0a2-py3-none-any.whl
  • Upload date:
  • Size: 96.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mediavocab-2.0.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 a85830fe4c9e935244bece5a4d6622eb17335f202439b5dc1ee5e5ac3eeabd23
MD5 30c03e4542847fd3816737584fc3c0b9
BLAKE2b-256 b765ee26198e2e8ded43e6457494def79dd9bf1a786ad06fa22c54638e44f2f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page