trebek

A high-fidelity multimodal AI pipeline for J! data extraction

These details have not been verified by PyPI

License
- OSI Approved :: GNU Affero General Public License v3
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

 ╺┳╸┏━┓┏━╸┏┓ ┏━╸┃┏╸
  ┃ ┣┳┛┣╸ ┣┻┓┣╸ ┣┻┓
  ╹ ╹┗╸┗━╸┗━┛┗━╸╹ ╹

The definitive multimodal AI pipeline for extracting structured game data from J! episodes.

From casual trivia lovers to ML engineers — one dataset to rule them all.

What is Trebek?

Trebek is an advanced, fault-tolerant pipeline that processes raw J! video recordings — not scraped web pages — and produces a surgically clean, event-sourced relational dataset of every game event that occurred on screen. It bridges local GPU compute (WhisperX, Pyannote), cloud LLMs (Google Gemini 3.1), and a deterministic Python state machine into a single, continuously running daemon.

The resulting dataset doesn't just capture questions and answers. It captures the full cognitive fingerprint of each game:

⚡ Millisecond-precision buzzer latencies — cross-referenced from visual podium illumination and acoustic buzz detection
🗣️ Speech disfluency tracking — ums, uhs, and stutters from WhisperX logprobs, not LLM hallucinations
🎲 Game-theory optimal wager analysis — calculated wagers compared against actual contestant choices
🧠 Semantic lateral distance — cosine distance on embeddings distinguishing wordplay from direct recall
🏗️ Board control & Forrest Bounce detection — strategic selection pattern analysis
📊 Coryat scores — calculated deterministically per contestant per episode

Trebek vs. J-Archive

Existing J! datasets are static text scrapes — frozen lists of clues and responses with no temporal, behavioral, or strategic context. Trebek extracts from the raw video, producing an entirely different class of dataset.

Dimension	J-Archive / Scraped Data	Trebek
Source	Web scraping	Raw video processing
Buzzer timing	❌ Not available	✅ True ms-precision latency
Speech patterns	❌ Not available	✅ Disfluency counts, acoustic confidence
Wager analysis	Partial (raw numbers only)	✅ Game-theory optimal + irrationality delta
Board control	❌ Not available	✅ Full selection order + Forrest Bounce index
Score adjustments	Sometimes noted	✅ Chronologically anchored to exact clue index
Visual clues	Text description	✅ Multimodal extraction from video frames
Semantic analysis	❌ Not available	✅ Clue-response embedding distances
Data format	Flat HTML / CSV	✅ Normalized relational DB (9 tables)
Freshness	Depends on scraper maintenance	✅ Process your own recordings on demand
Coryat scores	Manual fan calculation	✅ Deterministic, per-contestant

Who Is This For?

🎯 Trivia Enthusiasts

Explore your favorite episodes with deep analytics. Query buzzer speeds, track contestant strategies, and discover board control patterns across seasons.

📊 Data Scientists

A richly normalized relational dataset designed for analytical queries. 9 tables, foreign keys, embeddings, and temporal data — ready for your notebooks.

🤖 ML Engineers

Train predictive models on human decision-making under televised pressure. Buzzer latency, wager irrationality, disfluency signals — features you can't get anywhere else.

✨ Feature Highlights

🔄 True Crash Immunity

Database-backed queueing via SQLite pipeline_state. Kill the daemon at any point — SIGINT, SIGTERM, crash, power failure — and it resumes exactly where it left off. Zero data loss. Zero re-processing.

🧠 Multi-Pass LLM Architecture

Pass 1 (Flash-Lite): Speaker anchoring from host interview audio
Pass 2 (Pro): Map-reduce structured extraction with Pydantic self-healing retry
Pass 3 (Pro): Multimodal visual clue reconstruction + podium illumination detection

⚙️ Deterministic State Machine

Pure Python TrebekStateMachine replays game events chronologically. LLMs extract facts; the state machine does all arithmetic. Running scores, True Daily Double resolution, Coryat scores, and game-theory optimal wagers — all calculated deterministically.

🔥 Warm Worker GPU Architecture

PyTorch/WhisperX model weights stay resident in VRAM. No cold starts. Automatic OOM recovery with pool restarts. Explicit memory management for multi-day inference runs.

🎯 Physics Engine

Cross-references visual podium illumination (Gemini Vision) with WhisperX acoustic boundaries to compute true contestant reaction speeds. Also calculates acoustic confidence scores and semantic lateral distance.

🗄️ Actor-Pattern Database

All SQLite writes serialized through a single DatabaseWriter actor (asyncio.Queue + Future). No database is locked exceptions. Atomic transactions for high-throughput batched commits.

🚀 Quick Start

The fastest way to get Trebek running is using the official Docker image via Hybrid Mode. The lightweight CLI runs on your host, while the heavy GPU workloads (PyTorch, WhisperX) are safely delegated to the ghcr.io container.

# 1. Install lightweight CLI
pip install trebek

# 2. Configure (requires a free Gemini API key)
echo "GEMINI_API_KEY=your_key_here" > .env

# 3. Run with Docker GPU delegation
trebek run --input-dir /path/to/your/videos --docker

📖 Full installation guide: See SETUP.md for docker-compose deployments, native installations (no Docker), HuggingFace token configuration, and detailed CLI usage.

🏗️ Architecture deep-dive: See DESIGN.md for the complete system architecture, data model, pipeline stages, and safety invariants.

📊 Stats Dashboard

Run trebek stats for a live analytics dashboard showing pipeline health, cost tracking, stage timing, and recent episode status:

┌─ Pipeline Health ─────────────────────────────────────────┐
│  ✅ COMPLETED  42    ⏳ PENDING  3    ❌ FAILED  1       │
│  ████████████████████████████████████░░░░  91.3%          │
├─ Cost & Performance ──────────────────────────────────────┤
│  Tokens: 12.4M in / 2.1M out    Cost: $4.82 USD          │
│  Peak VRAM: 14.2 GB    Avg GPU: 87%                      │
├─ Stage Timing (avg) ──────────────────────────────────────┤
│  transcribe: 4m 12s    extract: 2m 38s    verify: 0.4s   │
└───────────────────────────────────────────────────────────┘

🧪 Development

make all          # Full quality gate (test + lint + typecheck)
make test         # pytest with coverage
make lint         # ruff check
make typecheck    # mypy strict mode

Tool	Purpose
pytest	Test runner (`pytest-asyncio` for async)
ruff	Linter + formatter (line-length 120)
mypy	Static type checker (strict mode)
pre-commit	Git hook enforcement

📁 Project Structure

trebek/
├── trebek/
│   ├── cli.py              # CLI parser + Docker orchestration
│   ├── config.py           # Pydantic Settings + validators
│   ├── schemas.py          # Pydantic v2 data contracts
│   ├── schema.sql          # SQLite DDL (9 tables)
│   ├── state_machine.py    # Deterministic game state replay
│   ├── database/           # Actor-pattern writer + pipeline ops
│   ├── gpu/                # Warm Worker pool + VRAM management
│   ├── llm/                # Multi-pass Gemini extraction pipeline
│   ├── pipeline/           # Async orchestrator + stage workers
│   ├── analysis/           # Post-extraction analytics (embeddings)
│   └── ui/                 # Rich console dashboard + rendering
├── tests/                  # Comprehensive test suite
├── docs/                   # Design documents and plans
├── Dockerfile              # GPU-enabled container
├── docker-compose.yml      # One-command deployment
├── Makefile                # Developer shortcuts
└── pyproject.toml          # Build system + tool config

📄 License

AGPL-3.0

_{Built for anyone who believes the best datasets are the ones you extract yourself.}

Project details

These details have not been verified by PyPI

License
- OSI Approved :: GNU Affero General Public License v3
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.1.5

Apr 29, 2026

1.1.4

Apr 29, 2026

This version

1.1.3

Apr 29, 2026

1.1.2

Apr 29, 2026

1.1.1

Apr 29, 2026

1.1.0

Apr 29, 2026

1.0.9

Apr 29, 2026

1.0.8

Apr 29, 2026

1.0.7

Apr 28, 2026

1.0.6

Apr 28, 2026

1.0.5

Apr 28, 2026

1.0.4

Apr 28, 2026

1.0.3

Apr 28, 2026

1.0.2

Apr 28, 2026

1.0.1

Apr 28, 2026

1.0.0

Apr 28, 2026

0.1.23

Apr 28, 2026

0.1.22

Apr 28, 2026

0.1.21

Apr 28, 2026

0.1.20

Apr 28, 2026

0.1.19

Apr 28, 2026

0.1.18

Apr 28, 2026

0.1.17

Apr 28, 2026

0.1.16

Apr 28, 2026

0.1.15

Apr 28, 2026

0.1.14

Apr 27, 2026

0.1.13

Apr 27, 2026

0.1.12

Apr 27, 2026

0.1.11

Apr 27, 2026

0.1.10

Apr 27, 2026

0.1.9

Apr 27, 2026

0.1.8

Apr 27, 2026

0.1.7

Apr 27, 2026

0.1.6

Apr 27, 2026

0.1.5

Apr 25, 2026

0.1.4

Apr 25, 2026

0.1.3

Apr 25, 2026

0.1.2

Apr 25, 2026

0.1.1

Apr 25, 2026

0.1.0

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trebek-1.1.3.tar.gz (104.3 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trebek-1.1.3-py3-none-any.whl (123.5 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file trebek-1.1.3.tar.gz.

File metadata

Download URL: trebek-1.1.3.tar.gz
Upload date: Apr 29, 2026
Size: 104.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trebek-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`81654c0364c4a97b43951a9ad7188d285a9b18c65685056eb78d34789764d3ff`
MD5	`31d4c05346d25a5a51ed7a676d95a79a`
BLAKE2b-256	`6274cd3c0c5285a9a9b0e9f2224fabca93faebfa1e0bf9928af385de65ef7b8c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for trebek-1.1.3.tar.gz:

Publisher: release.yml on arvarik/trebek

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: trebek-1.1.3.tar.gz
- Subject digest: 81654c0364c4a97b43951a9ad7188d285a9b18c65685056eb78d34789764d3ff
- Sigstore transparency entry: 1399214269
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: arvarik/trebek@72c76a82de1a0cf114b32bcbba3356d6a3037cfd
- Branch / Tag: refs/tags/v1.1.3
- Owner: https://github.com/arvarik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@72c76a82de1a0cf114b32bcbba3356d6a3037cfd
- Trigger Event: push

File details

Details for the file trebek-1.1.3-py3-none-any.whl.

File metadata

Download URL: trebek-1.1.3-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 123.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trebek-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7cae31e90a876ed8872ec781c7fe749802af2f3d87f1615adf1fe7f9e972b7d1`
MD5	`23b2b0ae55907847e434c3cc37631008`
BLAKE2b-256	`cdee144d94c6f0dd23661c5bb6c66cb36c41e01fd6475deba539245bf597d910`

See more details on using hashes here.

Provenance

The following attestation bundles were made for trebek-1.1.3-py3-none-any.whl:

Publisher: release.yml on arvarik/trebek

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: trebek-1.1.3-py3-none-any.whl
- Subject digest: 7cae31e90a876ed8872ec781c7fe749802af2f3d87f1615adf1fe7f9e972b7d1
- Sigstore transparency entry: 1399214275
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: arvarik/trebek@72c76a82de1a0cf114b32bcbba3356d6a3037cfd
- Branch / Tag: refs/tags/v1.1.3
- Owner: https://github.com/arvarik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@72c76a82de1a0cf114b32bcbba3356d6a3037cfd
- Trigger Event: push

trebek 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

What is Trebek?

Trebek vs. J-Archive

Who Is This For?

🎯 Trivia Enthusiasts

📊 Data Scientists

🤖 ML Engineers

✨ Feature Highlights

🔄 True Crash Immunity

🧠 Multi-Pass LLM Architecture

⚙️ Deterministic State Machine

🔥 Warm Worker GPU Architecture

🎯 Physics Engine

🗄️ Actor-Pattern Database

🚀 Quick Start

📊 Stats Dashboard

🧪 Development

📁 Project Structure

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance