Skip to main content

A high-fidelity multimodal AI pipeline for J! data extraction

Project description

 ╺┳╸┏━┓┏━╸┏┓ ┏━╸┏┓╻
  ┃ ┣┳┛┣╸ ┣┻┓┣╸ ┣┻┓
  ╹ ╹┗╸┗━╸┗━┛┗━╸╹ ╹

The definitive multimodal AI pipeline for extracting structured game data from J! episodes.

From casual trivia lovers to ML engineers — one dataset to rule them all.

CI Python 3.11+ SQLite Pydantic v2 Google Gemini WhisperX Ruff Mypy License

What is Trebek?

Trebek is an advanced, fault-tolerant pipeline that processes raw J! video recordings — not scraped web pages — and produces a surgically clean, event-sourced relational dataset of every game event that occurred on screen. It bridges local GPU compute (WhisperX, Pyannote), cloud LLMs (Google Gemini 3.1), and a deterministic Python state machine into a single, continuously running daemon.

The resulting dataset doesn't just capture questions and answers. It captures the full cognitive fingerprint of each game:

  • Millisecond-precision buzzer latencies — cross-referenced from visual podium illumination and acoustic buzz detection
  • 🗣️ Speech disfluency tracking — ums, uhs, and stutters from WhisperX logprobs, not LLM hallucinations
  • 🎲 Game-theory optimal wager analysis — calculated wagers compared against actual contestant choices
  • 🧠 Semantic lateral distance — cosine distance on embeddings distinguishing wordplay from direct recall
  • 🏗️ Board control & Forrest Bounce detection — strategic selection pattern analysis
  • 📊 Coryat scores — calculated deterministically per contestant per episode

Trebek vs. J-Archive

Existing J! datasets are static text scrapes — frozen lists of clues and responses with no temporal, behavioral, or strategic context. Trebek extracts from the raw video, producing an entirely different class of dataset.

Dimension J-Archive / Scraped Data Trebek
Source Web scraping Raw video processing
Buzzer timing ❌ Not available ✅ True ms-precision latency
Speech patterns ❌ Not available ✅ Disfluency counts, acoustic confidence
Wager analysis Partial (raw numbers only) ✅ Game-theory optimal + irrationality delta
Board control ❌ Not available ✅ Full selection order + Forrest Bounce index
Score adjustments Sometimes noted ✅ Chronologically anchored to exact clue index
Visual clues Text description ✅ Multimodal extraction from video frames
Semantic analysis ❌ Not available ✅ Clue-response embedding distances
Data format Flat HTML / CSV ✅ Normalized relational DB (9 tables)
Freshness Depends on scraper maintenance ✅ Process your own recordings on demand
Coryat scores Manual fan calculation ✅ Deterministic, per-contestant

Who Is This For?

🎯 Trivia Enthusiasts

Explore your favorite episodes with deep analytics. Query buzzer speeds, track contestant strategies, and discover board control patterns across seasons.

📊 Data Scientists

A richly normalized relational dataset designed for analytical queries. 9 tables, foreign keys, embeddings, and temporal data — ready for your notebooks.

🤖 ML Engineers

Train predictive models on human decision-making under televised pressure. Buzzer latency, wager irrationality, disfluency signals — features you can't get anywhere else.


✨ Feature Highlights

🔄 True Crash Immunity

Database-backed queueing via SQLite pipeline_state. Kill the daemon at any point — SIGINT, SIGTERM, crash, power failure — and it resumes exactly where it left off. Zero data loss. Zero re-processing.

🧠 Multi-Pass LLM Architecture

  • Pass 1 (Flash-Lite): Speaker anchoring from host interview audio
  • Pass 2 (Pro): Map-reduce structured extraction with Pydantic self-healing retry
  • Pass 3 (Pro): Multimodal visual clue reconstruction + podium illumination detection

⚙️ Deterministic State Machine

Pure Python TrebekStateMachine replays game events chronologically. LLMs extract facts; the state machine does all arithmetic. Running scores, True Daily Double resolution, Coryat scores, and game-theory optimal wagers — all calculated deterministically.

🔥 Warm Worker GPU Architecture

PyTorch/WhisperX model weights stay resident in VRAM. No cold starts. Automatic OOM recovery with pool restarts. Explicit memory management for multi-day inference runs.

🎯 Physics Engine

Cross-references visual podium illumination (Gemini Vision) with WhisperX acoustic boundaries to compute true contestant reaction speeds. Also calculates acoustic confidence scores and semantic lateral distance.

🗄️ Actor-Pattern Database

All SQLite writes serialized through a single DatabaseWriter actor (asyncio.Queue + Future). No database is locked exceptions. Atomic transactions for high-throughput batched commits.


🚀 Quick Start

# 1. Install
pip install trebek

# 2. Configure (just need a free Gemini API key)
echo "GEMINI_API_KEY=your_key_here" > .env

# 3. Run
trebek run --input-dir /path/to/your/videos

📖 Full installation guide: See SETUP.md for Docker setup (recommended), GPU dependencies, HuggingFace token configuration, and detailed usage instructions.

🏗️ Architecture deep-dive: See DESIGN.md for the complete system architecture, data model, pipeline stages, and safety invariants.


📊 Stats Dashboard

Run trebek stats for a live analytics dashboard showing pipeline health, cost tracking, stage timing, and recent episode status:

┌─ Pipeline Health ─────────────────────────────────────────┐
│  ✅ COMPLETED  42    ⏳ PENDING  3    ❌ FAILED  1       │
│  ████████████████████████████████████░░░░  91.3%          │
├─ Cost & Performance ──────────────────────────────────────┤
│  Tokens: 12.4M in / 2.1M out    Cost: $4.82 USD          │
│  Peak VRAM: 14.2 GB    Avg GPU: 87%                      │
├─ Stage Timing (avg) ──────────────────────────────────────┤
│  transcribe: 4m 12s    extract: 2m 38s    verify: 0.4s   │
└───────────────────────────────────────────────────────────┘

🧪 Development

make all          # Full quality gate (test + lint + typecheck)
make test         # pytest with coverage
make lint         # ruff check
make typecheck    # mypy strict mode
Tool Purpose
pytest Test runner (pytest-asyncio for async)
ruff Linter + formatter (line-length 120)
mypy Static type checker (strict mode)
pre-commit Git hook enforcement

📁 Project Structure

trebek/
├── trebek/
│   ├── cli.py              # CLI parser + Docker orchestration
│   ├── config.py           # Pydantic Settings + validators
│   ├── schemas.py          # Pydantic v2 data contracts
│   ├── schema.sql          # SQLite DDL (9 tables)
│   ├── state_machine.py    # Deterministic game state replay
│   ├── database/           # Actor-pattern writer + pipeline ops
│   ├── gpu/                # Warm Worker pool + VRAM management
│   ├── llm/                # Multi-pass Gemini extraction pipeline
│   ├── pipeline/           # Async orchestrator + stage workers
│   ├── analysis/           # Post-extraction analytics (embeddings)
│   └── ui/                 # Rich console dashboard + rendering
├── tests/                  # Comprehensive test suite
├── docs/                   # Design documents and plans
├── Dockerfile              # GPU-enabled container
├── docker-compose.yml      # One-command deployment
├── Makefile                # Developer shortcuts
└── pyproject.toml          # Build system + tool config

📄 License

AGPL-3.0


Built for anyone who believes the best datasets are the ones you extract yourself.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trebek-1.0.0.tar.gz (94.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trebek-1.0.0-py3-none-any.whl (112.8 kB view details)

Uploaded Python 3

File details

Details for the file trebek-1.0.0.tar.gz.

File metadata

  • Download URL: trebek-1.0.0.tar.gz
  • Upload date:
  • Size: 94.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for trebek-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4496c3ebc5a3dc7f8bd92dd662da958e28fce36f2feac0bcdaabaebf2153ed78
MD5 0165eb8e4fc96d5daca327df96f3843a
BLAKE2b-256 ed99eea21e69c14e91ec045a8a315bfde5a6e10e7ef84e85ed35af57b5733a3f

See more details on using hashes here.

File details

Details for the file trebek-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: trebek-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 112.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for trebek-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2831e586b298f3ce1a1b6cccf01eac647d024b54d3bf42122353b24bd8f1c91
MD5 226c59b8f53034075a64119bd4240319
BLAKE2b-256 4963dc9bd00289ce97488a8734bd341c4ed1adf5bd9c4837ae7ac574d32f556d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page