Pipeline to transform webinar videos into searchable slide and transcript PDFs.
Project description
Video Lectures to Searchable PDFs
Pipeline for turning webinar-style videos into searchable lecture artifacts:
- OCR-driven slide PDF
- Whisper transcript PDF
- Slide-aligned combined PDF
Requirements
- Python: 3.10+
- System binaries:
ffmpeg(for audio + frame extraction)
- Hardware:
- CPU-only is supported (default will fall back to CPU).
- GPU (CUDA) is recommended for faster Whisper + OCR if available.
On Ubuntu/Debian, install FFmpeg with:
sudo apt-get update
sudo apt-get install -y ffmpeg
Quickstart
python -m venv .venv && source .venv/bin/activate
pip install -e .
vlsp --help
CLI Usage
vlsp run --type local --source /path/to/webinar.mp4
vlsp run --type youtube --source https://youtu.be/xxxx
vlsp run --type gdrive --source https://drive.google.com/file/d/ID/view
Outputs land in data/processed/<video_id>/.
API Server
uvicorn app.server:app --reload --port 8080
POST payload:
{
"source_type": "youtube",
"source": "https://youtu.be/... "
}
Architecture Overview
flowchart LR
subgraph Ingestion
SRC[(Video Source)]
SRC -->|local / youtube / gdrive| DL[Downloader]
end
DL --> FF[FFmpeg Extractor]
FF -->|audio| WHISPER[faster-whisper]
FF -->|frames| OCR[PaddleOCR and optional VLM captions]
WHISPER --> ALIGN[Slide/Text Aligner]
OCR --> ALIGN
ALIGN --> PDFGEN[ReportLab / PyPDF Builder]
PDFGEN --> OUT[Searchable PDFs]
OUT -->|persist| STORE[data/processed/<video_id>]
ALIGN -->|serve| API[(FastAPI + Typer CLI)]
The CLI (vlsp) and FastAPI server share the same pipeline, so you can drive the workflow via command line, HTTP, or by importing the pipeline directly in Python.
End-to-End Workflow
- Ingestion: Video is pulled from the specified target (
local,youtube, orgdrive). Metadata such as ID, title, and duration is captured for downstream file naming. - Media Extraction: FFmpeg splits the video into a high-quality WAV track and evenly spaced video frames with timestamps.
- Speech + Slide Text Understanding:
faster-whisperproduces bilingual-friendly transcripts and per-segment timestamps.- PaddleOCR extracts slide text from frames.
- (Optional) A vision-language model (e.g. BLIP / LLaVA) can generate rich slide captions; this is disabled by default to keep VRAM usage modest.
- Alignment: Transcript chunks are matched to their corresponding slide frames using temporal overlap and cosine similarity on embeddings.
- PDF Generation:
- OCR-driven slide PDF for crisp slide reproduction with searchable overlays.
- Whisper transcript PDF containing time-linked dialogues.
- Combined PDF merges slides and transcripts per page for study-ready notes.
- Delivery: Artifacts are written to
data/processed/<video_id>/and optionally surfaced via the FastAPI endpoint.
Component Details
- Multi-source ingestion (local path, YouTube URL, Google Drive URL)
- Media extraction via FFmpeg (audio WAV + timestamped frames)
- GPU-friendly AI models:
faster-whisper(configurable checkpoint)- PaddleOCR for slide OCR
- Optional VLM (BLIP / LLaVA via 🤗 Transformers) for dense slide captions
- PDF creation using ReportLab + PyPDF
- Slide-by-slide synchronization with transcript blocks
- FastAPI service & Typer CLI orchestrating the workflow
See docs/models.md for recommended checkpoints and VRAM needs.
Configuration
All runtime settings are driven by a Pydantic Settings model and can be overridden via environment variables:
- Model selection:
MODELS__whisper_model– e.g.small,medium,large-v3(default:medium).MODELS__vlm_model– set to a HF model id (e.g.Salesforce/blip-image-captioning-base) to enable captions, or"none"(default) to skip VLM entirely.MODELS__device–cudaorcpu(default:cuda, will fall back to CPU if GPU is not available).
- Storage paths:
PATHS__root– project root (default:cwd).PATHS__raw_dir,PATHS__processed_dir,PATHS__temp_dir– override data directories if needed.
- Binaries:
FFMPEG_BINARY– override theffmpegexecutable name/path if it is not onPATH.
By default the system runs with VLM captions off, uses ffmpeg from your PATH, and writes results under data/processed/<video_id>/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file video_lectures_to_searchable_pdf-0.1.0.tar.gz.
File metadata
- Download URL: video_lectures_to_searchable_pdf-0.1.0.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f97ad9a81ff6c52cf3d2cba34b5dac170fb04e0901c492fb961c94a9f766646b
|
|
| MD5 |
a97a73843c5bd0c9600950361fec317c
|
|
| BLAKE2b-256 |
66c2862ae773d6f447a6d25b68d46070e1654eec4e662cbc0f32b1daf77cc756
|
File details
Details for the file video_lectures_to_searchable_pdf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: video_lectures_to_searchable_pdf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29d20a4bb1848482b3dcad29d864e7cf18792981673b4a81ecf7a7e8a8a1941c
|
|
| MD5 |
f23ce6b4477ef02a61ae41d12fb89c9c
|
|
| BLAKE2b-256 |
bcfe0e5001cfa17cbb3cdc57f0a3b58f660b00d950b7e6f30f8a0f2b8bf50dd9
|