Skip to main content

A high-fidelity multimodal AI pipeline for J! data extraction

Project description

 ╺┳╸┏━┓┏━╸┏┓ ┏━╸┃┏╸
  ┃ ┣┳┛┣╸ ┣┻┓┣╸ ┣┻┓
  ╹ ╹┗╸┗━╸┗━┛┗━╸╹ ╹

The definitive multimodal AI pipeline for extracting structured game data from J! episodes.

From casual trivia lovers to ML engineers — one dataset to rule them all.

CI PyPI PyPI - Downloads Python 3.11+ Ruff Mypy License

What is Trebek?

Trebek is an advanced, fault-tolerant pipeline that processes raw J! video recordings — not scraped web pages — and produces a surgically clean, event-sourced relational dataset of every game event that occurred on screen. It bridges local GPU compute (WhisperX, Pyannote), cloud LLMs (Google Gemini 3.1), and a deterministic Python state machine into a single, continuously running daemon.

The resulting dataset doesn't just capture questions and answers. It captures the full cognitive fingerprint of each game:

  • Millisecond-precision buzzer latencies — cross-referenced from visual podium illumination and acoustic buzz detection
  • 🗣️ Speech disfluency tracking — ums, uhs, and stutters from WhisperX logprobs, not LLM hallucinations
  • 🎲 Game-theory optimal wager analysis — calculated wagers compared against actual contestant choices
  • 🧠 Semantic lateral distance — cosine distance on embeddings distinguishing wordplay from direct recall
  • 🏗️ Board control & Forrest Bounce detection — strategic selection pattern analysis
  • 📊 Coryat scores — calculated deterministically per contestant per episode

Trebek vs. J-Archive

Existing J! datasets are static text scrapes — frozen lists of clues and responses with no temporal, behavioral, or strategic context. Trebek extracts from the raw video, producing an entirely different class of dataset.

Dimension J-Archive / Scraped Data Trebek
Source Web scraping Raw video processing
Buzzer timing ❌ Not available ✅ True ms-precision latency
Speech patterns ❌ Not available ✅ Disfluency counts, acoustic confidence
Wager analysis Partial (raw numbers only) ✅ Game-theory optimal + irrationality delta
Board control ❌ Not available ✅ Full selection order + Forrest Bounce index
Score adjustments Sometimes noted ✅ Chronologically anchored to exact clue index
Visual clues Text description ✅ Multimodal extraction from video frames
Semantic analysis ❌ Not available ✅ Clue-response embedding distances
Data format Flat HTML / CSV ✅ Normalized relational DB (9 tables)
Freshness Depends on scraper maintenance ✅ Process your own recordings on demand
Coryat scores Manual fan calculation ✅ Deterministic, per-contestant

Who Is This For?

🎯 Trivia Enthusiasts

Explore your favorite episodes with deep analytics. Query buzzer speeds, track contestant strategies, and discover board control patterns across seasons.

📊 Data Scientists

A richly normalized relational dataset designed for analytical queries. 9 tables, foreign keys, embeddings, and temporal data — ready for your notebooks.

🤖 ML Engineers

Train predictive models on human decision-making under televised pressure. Buzzer latency, wager irrationality, disfluency signals — features you can't get anywhere else.


✨ Feature Highlights

🔄 True Crash Immunity

Database-backed queueing via SQLite pipeline_state. Kill the daemon at any point — SIGINT, SIGTERM, crash, power failure — and it resumes exactly where it left off. Zero data loss. Zero re-processing.

🧠 Multi-Pass LLM Architecture

  • Pass 1 (Flash-Lite): Speaker anchoring from host interview audio
  • Pass 2 (Pro): Map-reduce structured extraction with Pydantic self-healing retry
  • Pass 3 (Pro): Multimodal visual clue reconstruction + podium illumination detection

⚙️ Deterministic State Machine

Pure Python TrebekStateMachine replays game events chronologically. LLMs extract facts; the state machine does all arithmetic. Running scores, True Daily Double resolution, Coryat scores, and game-theory optimal wagers — all calculated deterministically.

🔥 Warm Worker GPU Architecture

PyTorch/WhisperX model weights stay resident in VRAM. No cold starts. Automatic OOM recovery with pool restarts. Explicit memory management for multi-day inference runs.

🎯 Physics Engine

Cross-references visual podium illumination (Gemini Vision) with WhisperX acoustic boundaries to compute true contestant reaction speeds. Also calculates acoustic confidence scores and semantic lateral distance.

🗄️ Actor-Pattern Database

All SQLite writes serialized through a single DatabaseWriter actor (asyncio.Queue + Future). No database is locked exceptions. Atomic transactions for high-throughput batched commits.


🚀 Quick Start

The fastest way to get Trebek running is using the official Docker image via Hybrid Mode. The lightweight CLI runs on your host, while the heavy GPU workloads (PyTorch, WhisperX) are safely delegated to the ghcr.io container.

# 1. Install lightweight CLI
pip install trebek

# 2. Configure (requires a free Gemini API key)
echo "GEMINI_API_KEY=your_key_here" > .env

# 3. Run with Docker GPU delegation
trebek run --input-dir /path/to/your/videos --docker

📖 Full installation guide: See SETUP.md for docker-compose deployments, native installations (no Docker), HuggingFace token configuration, and detailed CLI usage.

🏗️ Architecture deep-dive: See DESIGN.md for the complete system architecture, data model, pipeline stages, and safety invariants.


📊 Stats Dashboard

Run trebek stats for a live analytics dashboard showing pipeline health, cost tracking, stage timing, and recent episode status:

┌─ Pipeline Health ─────────────────────────────────────────┐
│  ✅ COMPLETED  42    ⏳ PENDING  3    ❌ FAILED  1       │
│  ████████████████████████████████████░░░░  91.3%          │
├─ Cost & Performance ──────────────────────────────────────┤
│  Tokens: 12.4M in / 2.1M out    Cost: $4.82 USD          │
│  Peak VRAM: 14.2 GB    Avg GPU: 87%                      │
├─ Stage Timing (avg) ──────────────────────────────────────┤
│  transcribe: 4m 12s    extract: 2m 38s    verify: 0.4s   │
└───────────────────────────────────────────────────────────┘

🧪 Development

make all          # Full quality gate (test + lint + typecheck)
make test         # pytest with coverage
make lint         # ruff check
make typecheck    # mypy strict mode
Tool Purpose
pytest Test runner (pytest-asyncio for async)
ruff Linter + formatter (line-length 120)
mypy Static type checker (strict mode)
pre-commit Git hook enforcement

📁 Project Structure

trebek/
├── trebek/
│   ├── cli.py              # CLI parser + Docker orchestration
│   ├── config.py           # Pydantic Settings + validators
│   ├── schemas.py          # Pydantic v2 data contracts
│   ├── schema.sql          # SQLite DDL (9 tables)
│   ├── state_machine.py    # Deterministic game state replay
│   ├── database/           # Actor-pattern writer + pipeline ops
│   ├── gpu/                # Warm Worker pool + VRAM management
│   ├── llm/                # Multi-pass Gemini extraction pipeline
│   ├── pipeline/           # Async orchestrator + stage workers
│   ├── analysis/           # Post-extraction analytics (embeddings)
│   └── ui/                 # Rich console dashboard + rendering
├── tests/                  # Comprehensive test suite
├── docs/                   # Design documents and plans
├── Dockerfile              # GPU-enabled container
├── docker-compose.yml      # One-command deployment
├── Makefile                # Developer shortcuts
└── pyproject.toml          # Build system + tool config

📄 License

AGPL-3.0


Built for anyone who believes the best datasets are the ones you extract yourself.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trebek-1.0.9.tar.gz (98.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trebek-1.0.9-py3-none-any.whl (116.9 kB view details)

Uploaded Python 3

File details

Details for the file trebek-1.0.9.tar.gz.

File metadata

  • Download URL: trebek-1.0.9.tar.gz
  • Upload date:
  • Size: 98.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trebek-1.0.9.tar.gz
Algorithm Hash digest
SHA256 b7dcd3d148e7d0cde2dd33ce78d1b00bb1169e16f4b45530e56046b8a261c041
MD5 9ff4d2bb0d5add2c6a10a2b333bb7244
BLAKE2b-256 c8ea6e63f3b9f9d2140beb1b0a7fc0d9a0fcd3c87f04b347ba4d048bfcf223cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for trebek-1.0.9.tar.gz:

Publisher: release.yml on arvarik/trebek

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file trebek-1.0.9-py3-none-any.whl.

File metadata

  • Download URL: trebek-1.0.9-py3-none-any.whl
  • Upload date:
  • Size: 116.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trebek-1.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 221f36cd69695d41a07242df12b2aa27257afb19092498553c57f6c1234426b5
MD5 6cf8bbbc8ea15ab92bcf068e9d81ed03
BLAKE2b-256 2b6e0b886fd7220ac894d1089dc116b130cc0e8ebdca65ad28408abe07a640d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for trebek-1.0.9-py3-none-any.whl:

Publisher: release.yml on arvarik/trebek

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page