A high-fidelity multimodal AI pipeline for J! data extraction
Project description
╺┳╸┏━┓┏━╸┏┓ ┏━╸┃┏╸
┃ ┣┳┛┣╸ ┣┻┓┣╸ ┣┻┓
╹ ╹┗╸┗━╸┗━┛┗━╸╹ ╹
The definitive multimodal AI pipeline for extracting structured game data from J! episodes.
From casual trivia lovers to ML engineers — one dataset to rule them all.
What is Trebek?
Trebek is an advanced, fault-tolerant pipeline that processes raw J! video recordings — not scraped web pages — and produces a surgically clean, event-sourced relational dataset of every game event that occurred on screen. It bridges local GPU compute (WhisperX, Pyannote), cloud LLMs (Google Gemini 3.1), and a deterministic Python state machine into a single, continuously running daemon.
The resulting dataset doesn't just capture questions and answers. It captures the full cognitive fingerprint of each game:
- ⚡ Millisecond-precision buzzer latencies — cross-referenced from visual podium illumination and acoustic buzz detection
- 🗣️ Speech disfluency tracking — ums, uhs, and stutters from WhisperX logprobs, not LLM hallucinations
- 🎲 Game-theory optimal wager analysis — calculated wagers compared against actual contestant choices
- 🧠 Semantic lateral distance — cosine distance on embeddings distinguishing wordplay from direct recall
- 🏗️ Board control & Forrest Bounce detection — strategic selection pattern analysis
- 📊 Coryat scores — calculated deterministically per contestant per episode
Trebek vs. J-Archive
Existing J! datasets are static text scrapes — frozen lists of clues and responses with no temporal, behavioral, or strategic context. Trebek extracts from the raw video, producing an entirely different class of dataset.
| Dimension | J-Archive / Scraped Data | Trebek |
|---|---|---|
| Source | Web scraping | Raw video processing |
| Buzzer timing | ❌ Not available | ✅ True ms-precision latency |
| Speech patterns | ❌ Not available | ✅ Disfluency counts, acoustic confidence |
| Wager analysis | Partial (raw numbers only) | ✅ Game-theory optimal + irrationality delta |
| Board control | ❌ Not available | ✅ Full selection order + Forrest Bounce index |
| Score adjustments | Sometimes noted | ✅ Chronologically anchored to exact clue index |
| Visual clues | Text description | ✅ Multimodal extraction from video frames |
| Semantic analysis | ❌ Not available | ✅ Clue-response embedding distances |
| Data format | Flat HTML / CSV | ✅ Normalized relational DB (9 tables) |
| Freshness | Depends on scraper maintenance | ✅ Process your own recordings on demand |
| Coryat scores | Manual fan calculation | ✅ Deterministic, per-contestant |
Who Is This For?
🎯 Trivia EnthusiastsExplore your favorite episodes with deep analytics. Query buzzer speeds, track contestant strategies, and discover board control patterns across seasons. |
📊 Data ScientistsA richly normalized relational dataset designed for analytical queries. 9 tables, foreign keys, embeddings, and temporal data — ready for your notebooks. |
🤖 ML EngineersTrain predictive models on human decision-making under televised pressure. Buzzer latency, wager irrationality, disfluency signals — features you can't get anywhere else. |
✨ Feature Highlights
🔄 True Crash Immunity
Database-backed queueing via SQLite pipeline_state. Kill the daemon at any point — SIGINT, SIGTERM, crash, power failure — and it resumes exactly where it left off. Zero data loss. Zero re-processing.
🧠 Multi-Pass LLM Architecture
- Pass 1 (Flash-Lite): Speaker anchoring from host interview audio
- Pass 2 (Pro): Map-reduce structured extraction with Pydantic self-healing retry
- Pass 3 (Pro): Multimodal visual clue reconstruction + podium illumination detection
⚙️ Deterministic State Machine
Pure Python TrebekStateMachine replays game events chronologically. LLMs extract facts; the state machine does all arithmetic. Running scores, True Daily Double resolution, Coryat scores, and game-theory optimal wagers — all calculated deterministically.
🔥 Warm Worker GPU Architecture
PyTorch/WhisperX model weights stay resident in VRAM. No cold starts. Automatic OOM recovery with pool restarts. Explicit memory management for multi-day inference runs.
🎯 Physics Engine
Cross-references visual podium illumination (Gemini Vision) with WhisperX acoustic boundaries to compute true contestant reaction speeds. Also calculates acoustic confidence scores and semantic lateral distance.
🗄️ Actor-Pattern Database
All SQLite writes serialized through a single DatabaseWriter actor (asyncio.Queue + Future). No database is locked exceptions. Atomic transactions for high-throughput batched commits.
🚀 Quick Start
The fastest way to get Trebek running is using the official Docker image via Hybrid Mode. The lightweight CLI runs on your host, while the heavy GPU workloads (PyTorch, WhisperX) are safely delegated to the ghcr.io container.
# 1. Install lightweight CLI
pip install trebek
# 2. Configure (requires a free Gemini API key)
echo "GEMINI_API_KEY=your_key_here" > .env
# 3. Run with Docker GPU delegation
trebek run --input-dir /path/to/your/videos --docker
📖 Full installation guide: See SETUP.md for
docker-composedeployments, native installations (no Docker), HuggingFace token configuration, and detailed CLI usage.
🏗️ Architecture deep-dive: See DESIGN.md for the complete system architecture, data model, pipeline stages, and safety invariants.
📊 Stats Dashboard
Run trebek stats for a live analytics dashboard showing pipeline health, cost tracking, stage timing, and recent episode status:
┌─ Pipeline Health ─────────────────────────────────────────┐
│ ✅ COMPLETED 42 ⏳ PENDING 3 ❌ FAILED 1 │
│ ████████████████████████████████████░░░░ 91.3% │
├─ Cost & Performance ──────────────────────────────────────┤
│ Tokens: 12.4M in / 2.1M out Cost: $4.82 USD │
│ Peak VRAM: 14.2 GB Avg GPU: 87% │
├─ Stage Timing (avg) ──────────────────────────────────────┤
│ transcribe: 4m 12s extract: 2m 38s verify: 0.4s │
└───────────────────────────────────────────────────────────┘
🧪 Development
make all # Full quality gate (test + lint + typecheck)
make test # pytest with coverage
make lint # ruff check
make typecheck # mypy strict mode
| Tool | Purpose |
|---|---|
| pytest | Test runner (pytest-asyncio for async) |
| ruff | Linter + formatter (line-length 120) |
| mypy | Static type checker (strict mode) |
| pre-commit | Git hook enforcement |
📁 Project Structure
trebek/
├── trebek/
│ ├── cli.py # CLI parser + Docker orchestration
│ ├── config.py # Pydantic Settings + validators
│ ├── schemas.py # Pydantic v2 data contracts
│ ├── schema.sql # SQLite DDL (9 tables)
│ ├── state_machine.py # Deterministic game state replay
│ ├── database/ # Actor-pattern writer + pipeline ops
│ ├── gpu/ # Warm Worker pool + VRAM management
│ ├── llm/ # Multi-pass Gemini extraction pipeline
│ ├── pipeline/ # Async orchestrator + stage workers
│ ├── analysis/ # Post-extraction analytics (embeddings)
│ └── ui/ # Rich console dashboard + rendering
├── tests/ # Comprehensive test suite
├── docs/ # Design documents and plans
├── Dockerfile # GPU-enabled container
├── docker-compose.yml # One-command deployment
├── Makefile # Developer shortcuts
└── pyproject.toml # Build system + tool config
📄 License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trebek-1.1.3.tar.gz.
File metadata
- Download URL: trebek-1.1.3.tar.gz
- Upload date:
- Size: 104.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81654c0364c4a97b43951a9ad7188d285a9b18c65685056eb78d34789764d3ff
|
|
| MD5 |
31d4c05346d25a5a51ed7a676d95a79a
|
|
| BLAKE2b-256 |
6274cd3c0c5285a9a9b0e9f2224fabca93faebfa1e0bf9928af385de65ef7b8c
|
Provenance
The following attestation bundles were made for trebek-1.1.3.tar.gz:
Publisher:
release.yml on arvarik/trebek
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trebek-1.1.3.tar.gz -
Subject digest:
81654c0364c4a97b43951a9ad7188d285a9b18c65685056eb78d34789764d3ff - Sigstore transparency entry: 1399214269
- Sigstore integration time:
-
Permalink:
arvarik/trebek@72c76a82de1a0cf114b32bcbba3356d6a3037cfd -
Branch / Tag:
refs/tags/v1.1.3 - Owner: https://github.com/arvarik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@72c76a82de1a0cf114b32bcbba3356d6a3037cfd -
Trigger Event:
push
-
Statement type:
File details
Details for the file trebek-1.1.3-py3-none-any.whl.
File metadata
- Download URL: trebek-1.1.3-py3-none-any.whl
- Upload date:
- Size: 123.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cae31e90a876ed8872ec781c7fe749802af2f3d87f1615adf1fe7f9e972b7d1
|
|
| MD5 |
23b2b0ae55907847e434c3cc37631008
|
|
| BLAKE2b-256 |
cdee144d94c6f0dd23661c5bb6c66cb36c41e01fd6475deba539245bf597d910
|
Provenance
The following attestation bundles were made for trebek-1.1.3-py3-none-any.whl:
Publisher:
release.yml on arvarik/trebek
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trebek-1.1.3-py3-none-any.whl -
Subject digest:
7cae31e90a876ed8872ec781c7fe749802af2f3d87f1615adf1fe7f9e972b7d1 - Sigstore transparency entry: 1399214275
- Sigstore integration time:
-
Permalink:
arvarik/trebek@72c76a82de1a0cf114b32bcbba3356d6a3037cfd -
Branch / Tag:
refs/tags/v1.1.3 - Owner: https://github.com/arvarik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@72c76a82de1a0cf114b32bcbba3356d6a3037cfd -
Trigger Event:
push
-
Statement type: