Skip to main content

A high-fidelity multimodal AI pipeline for data extraction

Project description

Trebek ๐ŸŽ™๏ธ

A highly resilient, fault-tolerant data extraction pipeline daemon for transcribing and extracting structured game events from Jeopardy! episodes.

CI Python 3.9+ SQLite Pydantic v2 Google Gemini WhisperX Ruff Mypy License


Trebek is an advanced orchestration system that bridges local GPU compute (WhisperX, Pyannote), Cloud LLMs (Google Gemini 3.1 Pro, Gemini 3.1 Flash-Lite), and a deterministic Python state machine into a single, continuously running pipeline daemon. Its core purpose is to extract highly accurate, chronological, and structurally validated data from raw Jeopardy! video episodes into a normalized relational format designed for RAG semantic searches and game-theoretic analysis.

The resulting dataset captures not just the questions and answers, but the full cognitive fingerprint of each game: true buzzer reaction times, speech disfluency counts, wager irrationality deltas, board control patterns, and semantic lateral distances between clues and responses.

Table of Contents


Why Trebek?

Existing Jeopardy! datasets are typically scraped text archives โ€” static lists of clues and responses with no temporal, behavioral, or strategic context. Trebek fills this gap by processing the raw video, producing a dataset that includes:

  • Millisecond-precision buzzer latencies calculated from cross-referencing visual podium illumination timestamps with acoustic buzz detection.
  • Disfluency tracking (ums, uhs, stutters) via WhisperX word-level logprobs, not LLM hallucinations.
  • Game-theory optimal wager calculations compared against actual contestant wagers to quantify irrationality.
  • Semantic lateral distance between clues and responses, distinguishing wordplay from direct factual recall.
  • Forrest Bounce detection and board control analysis for strategic game modeling.

The target audience is ML engineers, data scientists, and researchers who need a surgically clean, event-sourced dataset of human decision-making under televised pressure for predictive modeling.


โœจ Core Features

Database-Backed Queueing (True Resumability)

Uses a persistent SQLite pipeline_state table to manage jobs across all stages of execution. The daemon can be interrupted at any point โ€” via SIGINT, SIGTERM, or a crash โ€” and will seamlessly resume exactly where it left off. No data is lost. No re-processing is required.

VRAM Fragmentation Immunity

Local GPU operations (PyTorch/WhisperX) are sandboxed in a ProcessPoolExecutor with max_tasks_per_child=1. Worker processes forcefully die after every episode, which defragments 100% of VRAM. This makes the system immune to PyTorch's internal memory fragmentation during multi-day inference runs โ€” a problem torch.cuda.empty_cache() alone cannot solve.

Multi-Pass LLM Architecture

  • Pass 1 (Gemini 3.1 Flash-Lite): Fast speaker anchoring. Extracts a rigid {SPEAKER_XX: "Name"} mapping from the host interview segment to prevent hallucinations in later passes.
  • Pass 2 (Gemini 3.1 Pro): Massive structured extraction of clues, buzzes, and wagers into strict JSON. Includes a Pydantic self-healing retry loop โ€” if the LLM output fails schema validation, the ValidationError is injected back into the prompt for automatic correction (up to 2 retries).
  • Pass 3 (Gemini 3.1 Pro): Multimodal augmentation for visual clue reconstruction and exact podium lockout illumination frame detection.

Deterministic State Machine

A pure Python TrebekStateMachine replays extracted atomic game events chronologically to:

  • Calculate perfect running scores (never trusting LLMs to do arithmetic).
  • Resolve "True Daily Double" wagers at runtime against current scores.
  • Apply chronologically anchored score adjustments (judge reversals) at exactly the right moment.
  • Track board control shifts and detect Forrest Bounce patterns.

Physics Engine (True Buzzer Latency)

Cross-references visual podium illumination timestamps (from Gemini Vision) with WhisperX's acoustic word-level boundaries to calculate true contestant reaction speeds, independent of host cadence variance. Also computes:

  • Acoustic confidence scores from raw WhisperX logprobs.
  • Deterministic disfluency counts (ums/uhs) from acoustic data, not LLM guesses.
  • Semantic lateral distance via cosine distance on text embeddings.

Actor-Pattern Database Writer

All SQLite writes are routed through a single DatabaseWriter actor โ€” an asyncio task owning an internal asyncio.Queue. This serializes concurrent write requests, preventing database is locked exceptions. Every enqueued operation returns an asyncio.Future protected by asyncio.wait_for() to prevent silent deadlocks.


๐Ÿ—๏ธ System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    TrebekPipelineOrchestrator                    โ”‚
โ”‚                      (asyncio event loop)                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Ingest  โ”‚  Extractor  โ”‚  LLM Worker   โ”‚  State Machine Worker   โ”‚
โ”‚ Worker  โ”‚   Worker    โ”‚               โ”‚                         โ”‚
โ”‚         โ”‚             โ”‚               โ”‚                         โ”‚
โ”‚ polls   โ”‚ dispatches  โ”‚ Gemini Flash  โ”‚ TrebekStateMachine      โ”‚
โ”‚ input/  โ”‚ to GPU      โ”‚ Gemini Pro    โ”‚ Score verification      โ”‚
โ”‚ dir     โ”‚ subprocess  โ”‚ Self-healing  โ”‚ Board control           โ”‚
โ”‚         โ”‚             โ”‚ retry loop    โ”‚ Wager math              โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚           โ”‚              โ”‚                  โ”‚
     โ–ผ           โ–ผ              โ–ผ                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     DatabaseWriter (Actor)                       โ”‚
โ”‚              asyncio.Queue โ†’ sqlite3.Connection                 โ”‚
โ”‚         PRAGMA foreign_keys=ON | journal_mode=WAL               โ”‚
โ”‚       PRAGMA busy_timeout=5000 | auto_vacuum=INCREMENTAL        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚
     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        SQLite Database                          โ”‚
โ”‚  pipeline_state โ”‚ episodes โ”‚ contestants โ”‚ clues โ”‚ buzz_attemptsโ”‚
โ”‚  wagers โ”‚ score_adjustments โ”‚ episode_performances              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Concurrency Model

Layer Technology Purpose
I/O Orchestration asyncio event loop Manages state polling, signal handling, and worker coordination
GPU Isolation ProcessPoolExecutor (spawn, max_tasks_per_child=1) Subprocess dies after every task โ†’ 100% VRAM reclamation
Write Serialization Actor pattern (asyncio.Queue + Future) Prevents SQLite database is locked errors
CPU Offloading asyncio.to_thread() Offloads heavy Pydantic JSON validation off the event loop
IPC Optimization Filepath strings over .json.gz Avoids pickling overhead of massive JSON structures across process boundaries

๐Ÿ“Š Pipeline Stages

The pipeline processes each episode through a rigorous sequence of stages, with the pipeline_state table acting as a persistent, crash-safe queue:

Stage Name Status Transition Engine Description
1 Ingestion โ†’ PENDING Filesystem polling .mp4 files detected in input_dir are registered in pipeline_state
2โ€“3 GPU Extraction PENDING โ†’ TRANSCRIBING โ†’ TRANSCRIPT_READY FFmpeg + WhisperX + Pyannote Audio extraction, Large-v3 float16 transcription, forced alignment diarization. Output: .json.gz
4 Commercial Filtering TRANSCRIPT_READY โ†’ CLEANED Gemini 3.1 Flash-Lite Hardware-accelerated advertisement removal while preserving exact word-level timings
5 Structured Extraction CLEANED โ†’ SAVING Gemini 3.1 Flash-Lite + Pro Pass 1: Speaker anchoring. Pass 2: Full game event extraction with Pydantic self-healing
6 Multimodal Augmentation (inline) Gemini 3.1 Pro Visual clue reconstruction and podium illumination timestamp extraction
7 State Verification SAVING โ†’ VECTORIZING TrebekStateMachine Deterministic replay validates score sequences, adjustments, and board control logic
8โ€“9 Relational & Semantic Commit VECTORIZING โ†’ COMPLETED DatabaseWriter + sqlite-vec Normalized INSERT into relational tables + vector embedding for semantic search

If any stage fails, the episode status is set to FAILED and logged for manual review. The daemon continues processing other episodes.


๐Ÿ—‚๏ธ Data Model

The SQLite schema is designed as a normalized relational model optimized for analytical queries:

Core Tables

pipeline_state          The persistent job queue
โ”œโ”€โ”€ episode_id (PK)
โ”œโ”€โ”€ status              PENDING โ†’ TRANSCRIBING โ†’ TRANSCRIPT_READY โ†’ CLEANED โ†’ SAVING โ†’ VECTORIZING โ†’ COMPLETED
โ”œโ”€โ”€ transcript_path     Filepath to .json.gz GPU output
โ”œโ”€โ”€ created_at
โ””โ”€โ”€ updated_at

episodes                High-level episode metadata
โ”œโ”€โ”€ episode_id (PK)
โ”œโ”€โ”€ air_date
โ”œโ”€โ”€ host_name
โ””โ”€โ”€ is_tournament

contestants             Unique contestant profiles
โ”œโ”€โ”€ contestant_id (PK)
โ”œโ”€โ”€ name
โ”œโ”€โ”€ occupational_category    LLM-classified (e.g., 'Academia', 'STEM', 'Law')
โ””โ”€โ”€ is_returning_champion

episode_performances    Per-episode contestant stats
โ”œโ”€โ”€ episode_id (FK)
โ”œโ”€โ”€ contestant_id (FK)
โ”œโ”€โ”€ podium_position     1 (left), 2 (center), 3 (right)
โ”œโ”€โ”€ coryat_score        Score without Daily Doubles and Final Jeopardy
โ”œโ”€โ”€ final_score
โ””โ”€โ”€ forrest_bounce_index

clues                   The board matrix with temporal and semantic markers
โ”œโ”€โ”€ clue_id (PK)
โ”œโ”€โ”€ episode_id (FK)
โ”œโ”€โ”€ round               CHECK('Jeopardy', 'Double', 'Final', 'Tiebreaker')
โ”œโ”€โ”€ category / board_row / board_col / selection_order
โ”œโ”€โ”€ clue_text / correct_response
โ”œโ”€โ”€ is_daily_double / daily_double_wager / wagerer_name
โ”œโ”€โ”€ host_start_timestamp_ms / host_finish_timestamp_ms
โ”œโ”€โ”€ clue_syllable_count / host_speech_rate_wpm
โ”œโ”€โ”€ requires_visual_context
โ”œโ”€โ”€ clue_embedding (BLOB)     Vector for semantic search
โ”œโ”€โ”€ response_embedding (BLOB)
โ””โ”€โ”€ semantic_lateral_distance  Cosine distance: wordplay vs. factual recall

buzz_attempts           Behavioral physics per buzz-in
โ”œโ”€โ”€ attempt_id (PK)
โ”œโ”€โ”€ clue_id (FK) / contestant_id (FK)
โ”œโ”€โ”€ attempt_order       1st buzz, 2nd/3rd for rebounds
โ”œโ”€โ”€ buzz_timestamp_ms / podium_light_timestamp_ms
โ”œโ”€โ”€ true_buzzer_latency_ms   Reaction time (visual - acoustic)
โ”œโ”€โ”€ is_lockout_inferred      0.25s penalty detection
โ”œโ”€โ”€ response_given / is_correct
โ”œโ”€โ”€ brain_freeze_duration_ms
โ”œโ”€โ”€ true_acoustic_confidence_score   From WhisperX logprobs
โ”œโ”€โ”€ disfluency_count
โ””โ”€โ”€ phonetic_similarity_score

wagers                  Game theory analysis
โ”œโ”€โ”€ wager_id (PK)
โ”œโ”€โ”€ clue_id (FK) / contestant_id (FK)
โ”œโ”€โ”€ running_score_at_time / opponent scores
โ”œโ”€โ”€ actual_wager
โ”œโ”€โ”€ game_theory_optimal_wager
โ””โ”€โ”€ wager_irrationality_delta

score_adjustments       Chronological host/judge corrections
โ”œโ”€โ”€ adjustment_id (PK)
โ”œโ”€โ”€ episode_id (FK) / contestant_id (FK)
โ”œโ”€โ”€ points_adjusted
โ”œโ”€โ”€ reason
โ””โ”€โ”€ effective_after_clue_selection_order

Pydantic Data Contracts

All LLM extraction outputs are validated against strict Pydantic v2 models defined in src/schemas.py. Key models include:

Model Description
Episode Top-level container: contestants, clues, final jeopardy, score adjustments
Clue Board position, temporal bounds, Daily Double metadata, buzz attempts
BuzzAttempt Per-buzz reaction data: timestamps, lockout inference, response text
Contestant Name, podium position, occupation category, champion status
FinalJeopardy Category, clue text, per-contestant wagers and responses
ScoreAdjustment Chronologically anchored point corrections with reasons

๐Ÿค– ML/AI Integration

Provider Model Stage Application
Local GPU WhisperX / Pyannote 2โ€“3 Large-v3 float16 transcription, forced alignment, speaker diarization
Google Gemini 3.1 Flash-Lite 4โ€“5 Speaker anchoring and commercial filtering (high-speed structured mapping)
Google Gemini 3.1 Pro 5 Massive structured extraction with Pydantic self-healing retry loop
Google Gemini 3.1 Pro 6 Visual clue reconstruction, podium lockout illumination frame detection
Local GPU Ollama / Llama-3-8B 5 (fallback) Offline structured extraction for environments without Gemini API access
Local/API Text Embeddings 9 Cosine distance calculation for semantic_lateral_distance

๐Ÿ› ๏ธ Installation

Prerequisites

Requirement Notes
Python 3.9 or higher
FFmpeg Required for audio extraction from video files
NVIDIA GPU Minimum 16GB VRAM recommended (optimized for RTX 4060 Ti / 5060 Ti)
CUDA Toolkit Required for WhisperX GPU acceleration
SQLite 3.35+ (for RETURNING clause support in atomic polling)
Gemini API Key Required for LLM extraction stages

Setup

Trebek is published to PyPI and can be installed globally.

  1. Install the package:

    pip install trebek
    
  2. (Optional) Install GPU dependencies for native processing:

    pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
    pip install whisperx pyannote.audio
    

    Note: If you do not wish to install these heavy dependencies, you can use the built-in Docker wrapper (see below).

๐Ÿณ Docker Hybrid Execution (Recommended)

To completely bypass complex PyTorch and CUDA dependency issues on your host, Trebek includes a seamless Docker orchestrator.

Prerequisites:

Usage: Simply append the --docker flag to any trebek command. The CLI will automatically spin up the official GPU-enabled container, mapping your current working directory and .env variables seamlessly:

trebek --docker
trebek --docker --once --input-dir ./videos

โš ๏ธ WARNING - SQLite WAL Mode & Network Drives Trebek uses SQLite Write-Ahead Logging (WAL) which requires strict POSIX advisory locking. Your trebek.db volume must be mounted to a local disk (ext4, NTFS, APFS). Mapping it to a network share (NFS, SMB, CIFS) will result in immediate database corruption or locking failures.


โš™๏ธ Configuration

Trebek uses Pydantic Settings for configuration, automatically loading values from environment variables or a .env file in the project root.

Create a .env file:

# โ”€โ”€โ”€ Core Paths โ”€โ”€โ”€
db_path=trebek.db                     # Path to the SQLite database
output_dir=gpu_outputs                # Directory for intermediate pipeline outputs (.json.gz)
input_dir=input_videos                # Directory to poll for new .mp4 files

# โ”€โ”€โ”€ API Keys โ”€โ”€โ”€
GEMINI_API_KEY=your_api_key_here      # Google Gemini API key

# โ”€โ”€โ”€ Logging โ”€โ”€โ”€
log_level=INFO                        # DEBUG, INFO, WARNING, ERROR

# โ”€โ”€โ”€ GPU Constraints โ”€โ”€โ”€
gpu_vram_target_gb=16                 # Target VRAM ceiling (4โ€“24 GB)
whisper_batch_size=16                 # WhisperX batch size (tuned for 16GB VRAM)
whisper_compute_type=float16          # float16 or float32

Configuration Validation

The Settings class enforces runtime constraints via Pydantic field validators:

Setting Constraint Default
gpu_vram_target_gb Must be between 4 and 24 (inclusive) 16
whisper_compute_type Must be float16 or float32 float16
whisper_batch_size Must be greater than 0 16

Invalid configurations will raise a ValidationError at startup, preventing the daemon from running with unsafe GPU parameters.


๐Ÿš€ Usage

Trebek is designed to run as a continuous daemon. Once started, it will poll the configured input_dir for new .mp4 files and orchestrate the full extraction pipeline automatically.

Start the Pipeline

trebek                   # Native mode (requires GPU dependencies)
trebek --docker          # Docker mode (recommended)

Process Episodes

  1. Place .mp4 video files into the input_videos/ directory (or your configured input_dir).
  2. The ingestion worker will detect new files within 5 seconds and register them as PENDING.
  3. Each episode flows through the pipeline stages automatically.
  4. Monitor progress via structured JSON logs (stdout) or query the pipeline_state table directly.

Graceful Shutdown

Send SIGINT (Ctrl+C) or SIGTERM to the process. The daemon will:

  1. Stop accepting new work.
  2. Cancel all running async tasks.
  3. Wait for the GPU subprocess to complete its current task.
  4. Flush and close the database connection.
  5. Log a clean shutdown confirmation.

Querying Results

After processing, query the normalized SQLite database directly:

-- Find the fastest buzzer in a specific episode
SELECT c.name, ba.true_buzzer_latency_ms
FROM buzz_attempts ba
JOIN contestants c ON ba.contestant_id = c.contestant_id
WHERE ba.is_correct = 1
ORDER BY ba.true_buzzer_latency_ms ASC
LIMIT 5;

-- Identify irrational Daily Double wagers
SELECT c.name, w.actual_wager, w.game_theory_optimal_wager, w.wager_irrationality_delta
FROM wagers w
JOIN contestants c ON w.contestant_id = c.contestant_id
WHERE ABS(w.wager_irrationality_delta) > 500
ORDER BY ABS(w.wager_irrationality_delta) DESC;

-- Semantic search for wordplay-heavy categories
SELECT category, AVG(semantic_lateral_distance) as avg_distance
FROM clues
GROUP BY category
ORDER BY avg_distance DESC
LIMIT 10;

๐Ÿงช Development

Toolchain

Tool Purpose Configuration
pytest Test runner (with pytest-asyncio for async tests) pyproject.toml
ruff Linter and import sorter Line length: 120, target: Python 3.9
mypy Static type checker Strict mode enabled

Commands

# Run the full test suite
pytest tests/

# Run with verbose output
pytest tests/ -v

# Run the linter
ruff check .

# Run the type checker
mypy trebek/

Test Coverage

The test suite validates critical system contracts:

Test Module Coverage Area
test_state_machine.py Score calculation, board control, chronological adjustments, True Daily Double resolution
test_core_database.py Actor-pattern write execution, atomic polling (RETURNING clause)
test_schema_integrity.py Foreign key enforcement, CHECK constraints, NOT NULL constraints
test_config_validation.py GPU VRAM bounds, compute type validation, batch size validation
test_schemas.py Pydantic model constraints: podium positions, Daily Double wager union types
test_gpu_orchestrator.py Subprocess lifecycle, .json.gz output generation, mock binary integration
test_llm_pipeline.py Speaker anchoring Pass 1 with mocked Gemini client

๐Ÿ“ Project Structure

trebek/
โ”œโ”€โ”€ trebek/
โ”‚   โ”œโ”€โ”€ cli.py                 # Pipeline entrypoint (CLI parser and Docker orchestration)
โ”‚   โ”œโ”€โ”€ main.py                # Pipeline orchestrator daemon (asyncio event loop, workers, signal handling)
โ”‚   โ”œโ”€โ”€ config.py              # Pydantic Settings with GPU constraint validators
โ”‚   โ”œโ”€โ”€ schemas.py             # Pydantic v2 data contracts (Episode, Clue, BuzzAttempt, etc.)
โ”‚   โ”œโ”€โ”€ schema.sql             # SQLite DDL: 8 tables with foreign keys, CHECK constraints, PRAGMAs
โ”‚   โ”œโ”€โ”€ core_database.py       # Actor-pattern DatabaseWriter with deadlock protection
โ”‚   โ”œโ”€โ”€ gpu_orchestrator.py    # ProcessPoolExecutor with spawn context and SIGKILL safety
โ”‚   โ”œโ”€โ”€ llm_pipeline.py        # Multi-pass Gemini extraction with self-healing retry loop
โ”‚   โ”œโ”€โ”€ state_machine.py       # Deterministic game state replay (scores, adjustments, board control)
โ”‚   โ””โ”€โ”€ physics_engine.py      # Buzzer latency, acoustic metrics, semantic distance, Vision client
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ conftest.py            # Shared fixtures (in-memory SQLite with schema)
โ”‚   โ”œโ”€โ”€ mock_bin/              # Mock ffmpeg/whisperx binaries for GPU orchestrator tests
โ”‚   โ”œโ”€โ”€ test_state_machine.py
โ”‚   โ”œโ”€โ”€ test_core_database.py
โ”‚   โ”œโ”€โ”€ test_schema_integrity.py
โ”‚   โ”œโ”€โ”€ test_config_validation.py
โ”‚   โ”œโ”€โ”€ test_schemas.py
โ”‚   โ”œโ”€โ”€ test_gpu_orchestrator.py
โ”‚   โ””โ”€โ”€ test_llm_pipeline.py
โ”œโ”€โ”€ docs/                      # Design documents, plans, explorations, and archived specs
โ”œโ”€โ”€ .agent/                    # Agent lifecycle metadata (architecture, philosophy, status, style)
โ”œโ”€โ”€ pyproject.toml             # Build system, dependencies, tool configuration
โ”œโ”€โ”€ .gitignore
โ””โ”€โ”€ README.md

๐Ÿ”’ Safety Invariants

These are non-negotiable constraints that must be preserved across all contributions:

  1. GPU Subprocess Isolation. All PyTorch/WhisperX operations must execute inside a ProcessPoolExecutor with max_tasks_per_child=1. Workers must die after every task to guarantee VRAM defragmentation. Never use torch.cuda.empty_cache() as a substitute.

  2. Database Write Serialization. All SQLite write operations must be routed through the DatabaseWriter actor queue. Direct conn.execute() calls from workers will cause database is locked errors under concurrent load.

  3. Event Loop Protection. Heavy CPU-bound operations (specifically Episode.model_validate_json) must be offloaded to a background thread via asyncio.to_thread(). Blocking the main event loop will trigger watchdog heartbeat timeouts.

  4. IPC Boundary Hygiene. Never pass large JSON structures across process boundaries (IPC pickling). Write data to disk as compressed .json.gz and pass the filepath string instead.

  5. LLM Fact Extraction Only. LLMs must never perform running score math or wager calculations. They extract facts; the TrebekStateMachine executes all arithmetic deterministically.

  6. Chronological Score Adjustments. Score adjustments must be applied at exactly the correct selection_order index โ€” not before, not after.

  7. Persistent Queue Only. The SQLite pipeline_state table must act as the inter-stage queue. Never use asyncio.Queue for passing work between pipeline stages.


๐Ÿ’ก Design Philosophy

Database-Driven State Machine over Memory

True resumability and crash immunity are paramount. Zero data loss during multi-day inference runs requires database-backed queueing, not fragile in-memory queues. The pipeline can be killed at any point and will resume cleanly.

Deterministic Math over LLM Approximations

LLMs are hallucination-prone when performing arithmetic. They extract pure facts from transcripts; deterministic Python state machines execute the score tracking, True Daily Double resolution, and game-theory optimal wager calculations.

Hardware Isolation is Safety

VRAM fragmentation is inevitable in long-running PyTorch processes. Forceful memory reclamation via ephemeral subprocesses (max_tasks_per_child=1) guarantees stability over multi-day batch runs processing hundreds of episodes.

What Trebek Is NOT

  • Not a real-time application. This is a batch-processing, heavy-compute daemon pipeline, not an interactive or real-time streaming service.
  • Not an API server. It operates via filesystem polling and SQLite state management, not over HTTP endpoints.
  • Not a keyword matcher. The dataset relies on vectorized embeddings (sqlite-vec) for semantic evaluation of clues, isolating wordplay from direct factual recall.

Built for ML researchers who believe the best datasets are the ones you extract yourself.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trebek-0.1.0.tar.gz (59.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trebek-0.1.0-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file trebek-0.1.0.tar.gz.

File metadata

  • Download URL: trebek-0.1.0.tar.gz
  • Upload date:
  • Size: 59.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for trebek-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c506c4f13bf5cf6aadd34b16c75bd07aca2d8a3434b6f513de1b12cc63ee454d
MD5 ad4807cc02cbd476bcfe80dedfacfdaf
BLAKE2b-256 88febc17b0f23f255d5e0128efde68e1692bf2bc704c6de1936733a46a43f6f8

See more details on using hashes here.

File details

Details for the file trebek-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: trebek-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for trebek-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8c91774def3aa95402e7c1e203ab245a4a9895fa2d4b56bae50aa223b06a4de
MD5 809be8d2dfc7cf470be22ac136504754
BLAKE2b-256 1028d4b8fc551f3219cae8df39687c25bed8cea93e6e2e8c5c02ac0b5e37b1c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page