Skip to main content

Pipeline to transform webinar videos into searchable slide and transcript PDFs.

Project description

Video Lectures to Searchable PDFs

Pipeline for turning webinar-style videos into searchable lecture artifacts:

  • OCR-driven slide PDF
  • Whisper transcript PDF
  • Slide-aligned combined PDF

Requirements

  • Python: 3.10+
  • System binaries:
    • ffmpeg (for audio + frame extraction)
  • Hardware:
    • CPU-only is supported (default will fall back to CPU).
    • GPU (CUDA) is recommended for faster Whisper + OCR if available.

On Ubuntu/Debian, install FFmpeg with:

sudo apt-get update
sudo apt-get install -y ffmpeg

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -e .
vlsp --help

CLI Usage

vlsp run --type local --source /path/to/webinar.mp4
vlsp run --type youtube --source https://youtu.be/xxxx
vlsp run --type gdrive --source https://drive.google.com/file/d/ID/view

Outputs land in data/processed/<video_id>/.

API Server

uvicorn app.server:app --reload --port 8080

POST payload:

{
  "source_type": "youtube",
  "source": "https://youtu.be/... "
}

Architecture Overview

flowchart LR
    subgraph Ingestion
        SRC[(Video Source)]
        SRC -->|local / youtube / gdrive| DL[Downloader]
    end

    DL --> FF[FFmpeg Extractor]
    FF -->|audio| WHISPER[faster-whisper]
    FF -->|frames| OCR[PaddleOCR and optional VLM captions]

    WHISPER --> ALIGN[Slide/Text Aligner]
    OCR --> ALIGN

    ALIGN --> PDFGEN[ReportLab / PyPDF Builder]
    PDFGEN --> OUT[Searchable PDFs]

    OUT -->|persist| STORE[data/processed/<video_id>]
    ALIGN -->|serve| API[(FastAPI + Typer CLI)]

The CLI (vlsp) and FastAPI server share the same pipeline, so you can drive the workflow via command line, HTTP, or by importing the pipeline directly in Python.

End-to-End Workflow

  1. Ingestion: Video is pulled from the specified target (local, youtube, or gdrive). Metadata such as ID, title, and duration is captured for downstream file naming.
  2. Media Extraction: FFmpeg splits the video into a high-quality WAV track and evenly spaced video frames with timestamps.
  3. Speech + Slide Text Understanding:
    • faster-whisper produces bilingual-friendly transcripts and per-segment timestamps.
    • PaddleOCR extracts slide text from frames.
    • (Optional) A vision-language model (e.g. BLIP / LLaVA) can generate rich slide captions; this is disabled by default to keep VRAM usage modest.
  4. Alignment: Transcript chunks are matched to their corresponding slide frames using temporal overlap and cosine similarity on embeddings.
  5. PDF Generation:
    • OCR-driven slide PDF for crisp slide reproduction with searchable overlays.
    • Whisper transcript PDF containing time-linked dialogues.
    • Combined PDF merges slides and transcripts per page for study-ready notes.
  6. Delivery: Artifacts are written to data/processed/<video_id>/ and optionally surfaced via the FastAPI endpoint.

Component Details

  1. Multi-source ingestion (local path, YouTube URL, Google Drive URL)
  2. Media extraction via FFmpeg (audio WAV + timestamped frames)
  3. GPU-friendly AI models:
    • faster-whisper (configurable checkpoint)
    • PaddleOCR for slide OCR
    • Optional VLM (BLIP / LLaVA via 🤗 Transformers) for dense slide captions
  4. PDF creation using ReportLab + PyPDF
  5. Slide-by-slide synchronization with transcript blocks
  6. FastAPI service & Typer CLI orchestrating the workflow

See docs/models.md for recommended checkpoints and VRAM needs.

Configuration

All runtime settings are driven by a Pydantic Settings model and can be overridden via environment variables:

  • Model selection:
    • MODELS__whisper_model – e.g. small, medium, large-v3 (default: medium).
    • MODELS__vlm_model – set to a HF model id (e.g. Salesforce/blip-image-captioning-base) to enable captions, or "none" (default) to skip VLM entirely.
    • MODELS__devicecuda or cpu (default: cuda, will fall back to CPU if GPU is not available).
  • Storage paths:
    • PATHS__root – project root (default: cwd).
    • PATHS__raw_dir, PATHS__processed_dir, PATHS__temp_dir – override data directories if needed.
  • Binaries:
    • FFMPEG_BINARY – override the ffmpeg executable name/path if it is not on PATH.

By default the system runs with VLM captions off, uses ffmpeg from your PATH, and writes results under data/processed/<video_id>/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

video_lectures_to_searchable_pdf-0.1.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file video_lectures_to_searchable_pdf-0.1.0.tar.gz.

File metadata

File hashes

Hashes for video_lectures_to_searchable_pdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f97ad9a81ff6c52cf3d2cba34b5dac170fb04e0901c492fb961c94a9f766646b
MD5 a97a73843c5bd0c9600950361fec317c
BLAKE2b-256 66c2862ae773d6f447a6d25b68d46070e1654eec4e662cbc0f32b1daf77cc756

See more details on using hashes here.

File details

Details for the file video_lectures_to_searchable_pdf-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for video_lectures_to_searchable_pdf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29d20a4bb1848482b3dcad29d864e7cf18792981673b4a81ecf7a7e8a8a1941c
MD5 f23ce6b4477ef02a61ae41d12fb89c9c
BLAKE2b-256 bcfe0e5001cfa17cbb3cdc57f0a3b58f660b00d950b7e6f30f8a0f2b8bf50dd9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page