A high-fidelity multimodal AI pipeline for data extraction
Project description
Trebek ๐๏ธ
A highly resilient, fault-tolerant data extraction pipeline for transcribing and extracting structured game events from Jeopardy! episodes.
Trebek is an advanced orchestration system that bridges local GPU compute (WhisperX, Pyannote), Cloud LLMs (Google Gemini 3.1 Pro, Gemini 3.1 Flash-Lite), and a deterministic Python state machine into a single, continuously running pipeline daemon. It extracts highly accurate, chronological, and structurally validated data from raw Jeopardy! video episodes into a normalized relational format designed for RAG semantic searches and game-theoretic analysis.
The resulting dataset captures not just the questions and answers, but the full cognitive fingerprint of each game: true buzzer reaction times, speech disfluency counts, wager irrationality deltas, board control patterns, and semantic lateral distances between clues and responses.
Table of Contents
- Why Trebek?
- Core Features
- System Architecture
- Pipeline Stages
- Data Model
- ML/AI Integration
- Installation
- Configuration
- Usage
- Development
- Project Structure
- Safety Invariants
- Design Philosophy
Why Trebek?
Existing Jeopardy! datasets are typically scraped text archives โ static lists of clues and responses with no temporal, behavioral, or strategic context. Trebek fills this gap by processing the raw video, producing a dataset that includes:
- Millisecond-precision buzzer latencies calculated from cross-referencing visual podium illumination timestamps with acoustic buzz detection.
- Disfluency tracking (ums, uhs, stutters) via WhisperX word-level logprobs, not LLM hallucinations.
- Game-theory optimal wager calculations compared against actual contestant wagers to quantify irrationality.
- Semantic lateral distance between clues and responses, distinguishing wordplay from direct factual recall.
- Forrest Bounce detection and board control analysis for strategic game modeling.
The target audience is ML engineers, data scientists, and researchers who need a surgically clean, event-sourced dataset of human decision-making under televised pressure for predictive modeling.
โจ Core Features
Database-Backed Queueing (True Resumability)
Uses a persistent SQLite pipeline_state table to manage jobs across all stages of execution. The daemon can be interrupted at any point โ via SIGINT, SIGTERM, or a crash โ and will seamlessly resume exactly where it left off. No data is lost. No re-processing is required.
VRAM Fragmentation Immunity
Local GPU operations (PyTorch/WhisperX) are sandboxed in a ProcessPoolExecutor with max_tasks_per_child=1. Worker processes forcefully die after every episode, which defragments 100% of VRAM. This makes the system immune to PyTorch's internal memory fragmentation during multi-day inference runs โ a problem torch.cuda.empty_cache() alone cannot solve.
Multi-Pass LLM Architecture
- Pass 1 (Gemini 3.1 Flash-Lite): Fast speaker anchoring. Extracts a rigid
{SPEAKER_XX: "Name"}mapping from the host interview segment to prevent hallucinations in later passes. - Pass 2 (Gemini 3.1 Pro): Massive structured extraction of clues, buzzes, and wagers into strict JSON. Includes a Pydantic self-healing retry loop โ if the LLM output fails schema validation, the
ValidationErroris injected back into the prompt for automatic correction (up to 2 retries). - Pass 3 (Gemini 3.1 Pro): Multimodal augmentation for visual clue reconstruction and exact podium lockout illumination frame detection.
Deterministic State Machine
A pure Python TrebekStateMachine replays extracted atomic game events chronologically to:
- Calculate perfect running scores (never trusting LLMs to do arithmetic).
- Resolve "True Daily Double" wagers at runtime against current scores.
- Apply chronologically anchored score adjustments (judge reversals) at exactly the right moment.
- Track board control shifts and detect Forrest Bounce patterns.
Physics Engine (True Buzzer Latency)
Cross-references visual podium illumination timestamps (from Gemini Vision) with WhisperX's acoustic word-level boundaries to calculate true contestant reaction speeds, independent of host cadence variance. Also computes:
- Acoustic confidence scores from raw WhisperX logprobs.
- Deterministic disfluency counts (ums/uhs) from acoustic data, not LLM guesses.
- Semantic lateral distance via cosine distance on text embeddings.
Actor-Pattern Database Writer
All SQLite writes are routed through a single DatabaseWriter actor โ an asyncio task owning an internal asyncio.Queue. This serializes concurrent write requests, preventing database is locked exceptions. Every enqueued operation returns an asyncio.Future protected by asyncio.wait_for() to prevent silent deadlocks.
๐๏ธ System Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TrebekPipelineOrchestrator โ
โ (asyncio event loop) โ
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโค
โ Ingest โ GPU โ LLM โ State Machineโ
โ Worker โ Worker โ Worker โ Worker โ
โ โ โ โ โ
โ polls โ FFmpeg + โ Flash- โ Score verify โ
โ input/ โ WhisperX โ Lite + โ Board ctrl โ
โ dir โ Pyannote โ Pro โ Wager math โ
โโโโโโฌโโโโโโดโโโโโฌโโโโโโดโโโโโฌโโโโโดโโโโโโโโฌโโโโโโโ
โ โ โ โ
โผ โผ โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DatabaseWriter (Actor) โ
โ asyncio.Queue โ sqlite3.Connection โ
โ journal_mode=WAL | foreign_keys=ON โ
โ busy_timeout=5000 | auto_vacuum=INCREMENTAL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SQLite Database โ
โ pipeline_state โ episodes โ contestants โ
โ clues โ buzz_attempts โ wagers โ
โ score_adjustments โ episode_performances โ
โ job_telemetry โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Concurrency Model
| Layer | Technology | Purpose |
|---|---|---|
| I/O Orchestration | asyncio event loop |
State polling, signal handling, worker coordination |
| GPU Isolation | ProcessPoolExecutor (spawn) |
Subprocess dies after every task โ 100% VRAM reclamation |
| Write Serialization | Actor pattern (Queue + Future) |
Prevents SQLite database is locked errors |
| CPU Offloading | asyncio.to_thread() |
Offloads Pydantic JSON validation off the event loop |
| IPC Optimization | Filepath strings over .json.gz |
Avoids pickling large JSON across process boundaries |
๐ Pipeline Stages
The pipeline processes each episode through a rigorous sequence of stages, with the pipeline_state table acting as a persistent, crash-safe queue:
| Stage | Name | Engine | Description |
|---|---|---|---|
| 1 | Ingestion | Filesystem polling | New video files registered as PENDING |
| 2โ3 | GPU Extraction | FFmpeg + WhisperX | Audio extraction, transcription, diarization |
| 4 | Commercial Filtering | Gemini Flash-Lite | Ad removal preserving word-level timings |
| 5 | Structured Extraction | Flash-Lite + Pro | Speaker anchoring โ full event extraction |
| 6 | Multimodal Augmentation | Gemini Pro | Visual clue + podium illumination detection |
| 7 | State Verification | TrebekStateMachine |
Deterministic score/adjustment validation |
| 8โ9 | Relational & Semantic Commit | DatabaseWriter |
Normalized INSERT + vector embeddings |
If any stage fails, the episode status is set to FAILED and logged for manual review. The daemon continues processing other episodes.
๐๏ธ Data Model
The SQLite schema is designed as a normalized relational model optimized for analytical queries:
Core Tables
pipeline_state
โโโ episode_id (PK)
โโโ status PENDING โ TRANSCRIBING โ
โ TRANSCRIPT_READY โ CLEANED โ
โ SAVING โ VECTORIZING โ COMPLETED
โโโ transcript_path Filepath to .json.gz output
โโโ created_at
โโโ updated_at
episodes
โโโ episode_id (PK)
โโโ air_date
โโโ host_name
โโโ is_tournament
contestants
โโโ contestant_id (PK)
โโโ name
โโโ occupational_category
โโโ is_returning_champion
episode_performances
โโโ episode_id (FK)
โโโ contestant_id (FK)
โโโ podium_position 1 (left), 2 (center), 3 (right)
โโโ coryat_score
โโโ final_score
โโโ forrest_bounce_index
clues
โโโ clue_id (PK)
โโโ episode_id (FK)
โโโ round Jeopardy / Double / Final
โโโ category / board_row / board_col
โโโ selection_order
โโโ clue_text / correct_response
โโโ is_daily_double / daily_double_wager
โโโ host_start_timestamp_ms
โโโ host_finish_timestamp_ms
โโโ clue_syllable_count
โโโ requires_visual_context
โโโ clue_embedding (BLOB)
โโโ response_embedding (BLOB)
โโโ semantic_lateral_distance
buzz_attempts
โโโ attempt_id (PK)
โโโ clue_id (FK) / contestant_id (FK)
โโโ attempt_order
โโโ buzz_timestamp_ms
โโโ podium_light_timestamp_ms
โโโ true_buzzer_latency_ms
โโโ is_lockout_inferred
โโโ response_given / is_correct
โโโ brain_freeze_duration_ms
โโโ true_acoustic_confidence_score
โโโ disfluency_count
โโโ phonetic_similarity_score
wagers
โโโ wager_id (PK)
โโโ clue_id (FK) / contestant_id (FK)
โโโ running_score_at_time
โโโ actual_wager
โโโ game_theory_optimal_wager
โโโ wager_irrationality_delta
score_adjustments
โโโ adjustment_id (PK)
โโโ episode_id (FK) / contestant_id (FK)
โโโ points_adjusted
โโโ reason
โโโ effective_after_clue_selection_order
job_telemetry
โโโ episode_id (FK)
โโโ peak_vram_mb / avg_gpu_utilization_pct
โโโ gemini_total_input_tokens
โโโ gemini_total_output_tokens
โโโ gemini_total_cached_tokens
โโโ gemini_total_cost_usd
โโโ stage_ingestion_ms
โโโ stage_gpu_extraction_ms
โโโ stage_structured_extraction_ms
โโโ stage_vectorization_ms
โโโ gemini_api_latency_ms
โโโ pydantic_retry_count
Pydantic Data Contracts
All LLM extraction outputs are validated against strict Pydantic v2 models defined in trebek/schemas.py. Key models include:
| Model | Description |
|---|---|
Episode |
Top-level container: contestants, clues, final jeopardy, score adjustments |
Clue |
Board position, temporal bounds, Daily Double metadata, buzz attempts |
BuzzAttempt |
Per-buzz reaction data: timestamps, lockout inference, response text |
Contestant |
Name, podium position, occupation category, champion status |
FinalJeopardy |
Category, clue text, per-contestant wagers and responses |
ScoreAdjustment |
Chronologically anchored point corrections with reasons |
JobTelemetry |
Hardware signatures, token usage, latency, and cost tracking |
๐ค ML/AI Integration
| Provider | Model | Stage | Application |
|---|---|---|---|
| Local GPU | WhisperX / Pyannote | 2โ3 | Large-v3 float16 transcription, diarization |
| Gemini 3.1 Flash-Lite | 4โ5 | Speaker anchoring, commercial filtering | |
| Gemini 3.1 Pro | 5 | Structured extraction + Pydantic self-healing | |
| Gemini 3.1 Pro | 6 | Visual clue + podium illumination detection | |
| Local/API | Text Embeddings | 9 | Cosine distance for semantic_lateral_distance |
๐ ๏ธ Installation
Prerequisites
| Requirement | Notes |
|---|---|
| Python | 3.11 or higher |
| FFmpeg | Required for audio extraction from video files |
| NVIDIA GPU | 16GB VRAM recommended (RTX 4060 Ti / 5060 Ti) |
| CUDA Toolkit | Required for WhisperX GPU acceleration |
| SQLite | 3.35+ (for RETURNING clause support) |
| Gemini API Key | Required for LLM extraction stages |
Quick Start
# 1. Install the package
pip install trebek
# 2. Create your config
cp .env.example .env
# Edit .env with your GEMINI_API_KEY
# Get a free key at https://aistudio.google.com/apikey
# 3. Run the pipeline
trebek
GPU Dependencies (Optional)
For native GPU processing without Docker:
pip install torch torchaudio \
--index-url https://download.pytorch.org/whl/cu121
pip install whisperx pyannote.audio
If you prefer not to install these heavy dependencies, use the built-in Docker wrapper instead (see below).
๐ณ Docker Hybrid Execution (Recommended)
To completely bypass complex PyTorch and CUDA dependency issues on your host, Trebek includes a seamless Docker orchestrator.
Prerequisites:
- Docker and the NVIDIA Container Toolkit installed on your host.
Usage:
Simply append the --docker flag to any trebek command. The CLI will automatically spin up the official GPU-enabled container, mapping your working directory and .env variables seamlessly:
trebek --docker
trebek --docker --once --input-dir ./videos
โ ๏ธ WARNING โ SQLite WAL Mode & Network Drives Trebek uses SQLite Write-Ahead Logging (WAL) which requires strict POSIX advisory locking. Your
trebek.dbvolume must be mounted to a local disk (ext4, NTFS, APFS). Mapping it to a network share (NFS, SMB, CIFS) will result in database corruption.
โ๏ธ Configuration
Trebek uses Pydantic Settings for configuration, automatically loading values from environment variables or a .env file in the project root.
Create a .env file:
# โโโ Core Paths โโโ
db_path=trebek.db
output_dir=gpu_outputs
input_dir=input_videos
# โโโ API Keys โโโ
# Get your key at https://aistudio.google.com/apikey
GEMINI_API_KEY=your_api_key_here
# โโโ Logging โโโ
log_level=INFO
# โโโ GPU Constraints โโโ
gpu_vram_target_gb=16
whisper_batch_size=8
whisper_compute_type=float16
Configuration Validation
The Settings class enforces runtime constraints via Pydantic field validators:
| Setting | Constraint | Default |
|---|---|---|
gpu_vram_target_gb |
Between 4 and 24 (inclusive) | 16 |
whisper_compute_type |
float16 or float32 |
float16 |
whisper_batch_size |
Must be greater than 0 | 8 |
Invalid configurations will raise a ValidationError at startup, preventing the daemon from running with unsafe GPU parameters.
๐ Usage
Trebek is designed to run as a continuous daemon. Once started, it will recursively scan the configured input_dir (and all subdirectories) for video files and orchestrate the full extraction pipeline automatically. Trebek supports 12 video formats natively: MP4, TS, MKV, AVI, MOV, WebM, MPG, MPEG, FLV, WMV, M2TS, and VOB.
Start the Pipeline
# Point at a media library with nested folders
trebek --input-dir /path/to/TV/Jeopardy
# Docker mode (recommended)
trebek --docker --input-dir /path/to/TV/Jeopardy
# Process current queue then exit
trebek --once
# Preview discovered files without processing
trebek --dry-run
# View database analytics dashboard
trebek --stats
Process Episodes
- Point
--input-dirat any directory (nested season folders work automatically). - The ingestion worker recursively discovers all video files and registers them as
PENDING. - Each episode flows through the pipeline stages automatically.
- Monitor progress via the Rich console output, or run
trebek --statsto view aggregate metrics.
Graceful Shutdown
Send SIGINT (Ctrl+C) or SIGTERM to the process. The daemon will:
- Stop accepting new work.
- Cancel all running async tasks.
- Wait for the GPU subprocess to complete its current task.
- Flush and close the database connection.
- Render a final session summary with telemetry.
Querying Results
After processing, query the SQLite database directly:
-- Find the fastest buzzers
SELECT c.name, ba.true_buzzer_latency_ms
FROM buzz_attempts ba
JOIN contestants c
ON ba.contestant_id = c.contestant_id
WHERE ba.is_correct = 1
ORDER BY ba.true_buzzer_latency_ms ASC
LIMIT 5;
-- Identify irrational Daily Double wagers
SELECT c.name,
w.actual_wager,
w.game_theory_optimal_wager,
w.wager_irrationality_delta
FROM wagers w
JOIN contestants c
ON w.contestant_id = c.contestant_id
WHERE ABS(w.wager_irrationality_delta) > 500
ORDER BY ABS(w.wager_irrationality_delta) DESC;
-- Wordplay-heavy categories
SELECT category,
AVG(semantic_lateral_distance) AS avg_dist
FROM clues
GROUP BY category
ORDER BY avg_dist DESC
LIMIT 10;
๐งช Development
Toolchain
| Tool | Purpose | Configuration |
|---|---|---|
| pytest | Test runner (pytest-asyncio for async) |
pyproject.toml |
| ruff | Linter + formatter | Line length 120, target py311 |
| mypy | Static type checker | Strict mode enabled |
| pre-commit | Git hook enforcement | .pre-commit-config.yaml |
Commands
# Run the full quality gate
make all
# Individual commands
make test # pytest with coverage
make lint # ruff check
make typecheck # mypy
make format # ruff auto-format
# Or directly:
pytest tests/ -v
ruff check .
mypy trebek/
Test Coverage
The test suite validates critical system contracts:
| Test Module | Coverage Area |
|---|---|
test_state_machine.py |
Score calculation, board control, chronological adjustments |
test_core_database.py |
Actor-pattern write execution, atomic polling |
test_schema_integrity.py |
Foreign key, CHECK, and NOT NULL constraints |
test_config_validation.py |
GPU VRAM bounds, compute type, batch size validation |
test_schemas.py |
Pydantic model constraints, podium positions, wager types |
test_gpu_orchestrator.py |
Subprocess lifecycle, .json.gz output, mock binaries |
test_llm_pipeline.py |
Speaker anchoring Pass 1 with mocked Gemini client |
test_job_telemetry.py |
Telemetry schema, validation rules, upsert logic |
๐ Project Structure
trebek/
โโโ trebek/
โ โโโ cli.py # CLI parser + Docker orchestration
โ โโโ main.py # Pipeline orchestrator daemon
โ โโโ config.py # Pydantic Settings + validators
โ โโโ console.py # Rich UI: banners, diagnostics, stats
โ โโโ schemas.py # Pydantic v2 data contracts
โ โโโ schema.sql # SQLite DDL (9 tables)
โ โโโ core_database.py # Actor-pattern DatabaseWriter
โ โโโ gpu_orchestrator.py # ProcessPoolExecutor + VRAM mgmt
โ โโโ llm_pipeline.py # Multi-pass Gemini extraction
โ โโโ state_machine.py # Deterministic game state replay
โ โโโ physics_engine.py # Buzzer latency + semantic distance
โโโ tests/
โ โโโ conftest.py # Shared fixtures
โ โโโ mock_bin/ # Mock ffmpeg/whisperx binaries
โ โโโ test_state_machine.py
โ โโโ test_core_database.py
โ โโโ test_schema_integrity.py
โ โโโ test_config_validation.py
โ โโโ test_schemas.py
โ โโโ test_gpu_orchestrator.py
โ โโโ test_llm_pipeline.py
โ โโโ test_job_telemetry.py
โโโ docs/ # Design documents and plans
โโโ Makefile # Developer shortcuts
โโโ pyproject.toml # Build system + tool config
โโโ .pre-commit-config.yaml # Git hook enforcement
โโโ .env.example # Template configuration
โโโ .gitignore
โโโ README.md
๐ Safety Invariants
These are non-negotiable constraints that must be preserved across all contributions:
-
GPU Subprocess Isolation. All PyTorch/WhisperX operations must execute inside a
ProcessPoolExecutorwithmax_tasks_per_child=1. Workers must die after every task to guarantee VRAM defragmentation. Never usetorch.cuda.empty_cache()as a substitute. -
Database Write Serialization. All SQLite write operations must be routed through the
DatabaseWriteractor queue. Directconn.execute()calls from workers will causedatabase is lockederrors under concurrent load. -
Event Loop Protection. Heavy CPU-bound operations (specifically
Episode.model_validate_json) must be offloaded to a background thread viaasyncio.to_thread(). Blocking the main event loop will trigger watchdog heartbeat timeouts. -
IPC Boundary Hygiene. Never pass large JSON structures across process boundaries (IPC pickling). Write data to disk as compressed
.json.gzand pass the filepath string instead. -
LLM Fact Extraction Only. LLMs must never perform running score math or wager calculations. They extract facts; the
TrebekStateMachineexecutes all arithmetic deterministically. -
Chronological Score Adjustments. Score adjustments must be applied at exactly the correct
selection_orderindex โ not before, not after. -
Persistent Queue Only. The SQLite
pipeline_statetable must act as the inter-stage queue. Never useasyncio.Queuefor passing work between pipeline stages.
๐ก Design Philosophy
Database-Driven State Machine over Memory
True resumability and crash immunity are paramount. Zero data loss during multi-day inference runs requires database-backed queueing, not fragile in-memory queues. The pipeline can be killed at any point and will resume cleanly.
Deterministic Math over LLM Approximations
LLMs are hallucination-prone when performing arithmetic. They extract pure facts from transcripts; deterministic Python state machines execute the score tracking, True Daily Double resolution, and game-theory optimal wager calculations.
Hardware Isolation is Safety
VRAM fragmentation is inevitable in long-running PyTorch processes. Forceful memory reclamation via ephemeral subprocesses (max_tasks_per_child=1) guarantees stability over multi-day batch runs processing hundreds of episodes.
What Trebek Is NOT
- Not a real-time application. This is a batch-processing daemon pipeline, not an interactive or real-time streaming service.
- Not an API server. It operates via filesystem polling and SQLite state management, not over HTTP endpoints.
- Not a keyword matcher. The dataset relies on vectorized embeddings (
sqlite-vec) for semantic evaluation of clues, isolating wordplay from direct factual recall.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trebek-0.1.3.tar.gz.
File metadata
- Download URL: trebek-0.1.3.tar.gz
- Upload date:
- Size: 63.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d203f38531885a2b2b321a5805d34fb4214c9d7d5914527382767616171c3d8
|
|
| MD5 |
f79e699793fcb8934f0b2ecb09baafc6
|
|
| BLAKE2b-256 |
2b33f0aadb956e434dc68c5cfe8bf06643376c9257a3d395a5671fa5645587f5
|
File details
Details for the file trebek-0.1.3-py3-none-any.whl.
File metadata
- Download URL: trebek-0.1.3-py3-none-any.whl
- Upload date:
- Size: 55.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3aa5f1dda019b0f91ee963d4677a0e0cc2853e28ce97c5e945c73efc1bedff4c
|
|
| MD5 |
4f717f6fc2c89b217dd7d8643cd0c5fd
|
|
| BLAKE2b-256 |
bef91be2fe53addee09fdfb917bdd1de2dee2f14601443ff2d264e3e56fb86e5
|