Skip to main content

Video content distillation CLI tool

Project description

vidistill

A toolset to download, parse and analyse social-media video content.

Install

uv sync
uv run playwright install chromium   # one-time: downloads ~150 MB browser binary

playwright install chromium is required by the collect subcommand, which drives a headless browser against Bilibili pages.

Environment

Copy .env.example.env and fill in:

  • OPENROUTER_API_KEY — LLM provider for analyse.
  • VOLC_ACCESS_KEY_ID / VOLC_SECRET_ACCESS_KEY / VOLC_TOS_BUCKET — Volcengine TOS + 豆包 ASR credentials used by the default transcriber.

CLI

vidistill analyse <input>  [--output-dir …] [--model …] [--transcriber …] [--diarize]
vidistill collect <page-path> [--rule …] [--num …] [--sort …] [--max-scrolls …] [--headful]

vidistill analyse

Runs the full pipeline on a single video: download → audio extraction → transcription → LLM analysis → markdown report.

vidistill analyse https://www.bilibili.com/video/BV1xx411c7mu

vidistill collect

Scrape a Bilibili listing page, filter videos by a rule expression, and write the top-N matches as JSON. Supported page types:

Type URL shape
homepage www.bilibili.com/ (the recommended feed)
user space.bilibili.com/<uid>
search search.bilibili.com/all?keyword=…
channel www.bilibili.com/c/<slug>
popular www.bilibili.com/v/popular/<column> (all, weekly, …)
topic www.bilibili.com/v/topic/detail?topic_id=<id>
vidistill collect https://space.bilibili.com/12345 \
    --rule='play>10000 and like>=500' --num=20

vidistill collect "https://search.bilibili.com/all?keyword=LLM" \
    -r 'play>50000 and duration<=600' -n 10 --sort=play

Rule DSL (--rule) supports comparisons (== != > >= < <=), boolean ops (and, or, not), parentheses, and the following fields: play, like, coin, favorite (alias star), share, danmaku, comment (alias reply), duration (seconds), publish_days_ago, title, author. Missing fields evaluate the whole rule to False for that video.

Output lands in <output-dir>/collect_<page-type>_<identifier>_<ts>.json.

Bilibili login & storage state

Bilibili's anti-bot layer rejects guest scraping. The --storage-state flag uses Playwright's native storage-state (cookies + localStorage) to persist login across runs. Default: default.storage_state in the current directory.

First run (no login yet):

vidistill collect https://space.bilibili.com/<uid> -n 10

If default.storage_state doesn't exist, a browser window opens. Log in to Bilibili there. When you close the browser, storage state is saved automatically and scraping proceeds. Subsequent runs reuse that state.

Explicit path:

vidistill collect <url> --storage-state ~/my-session.storage_state -n 10

Development

uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vidistill-0.3.3-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file vidistill-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: vidistill-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vidistill-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5cc91ee681d2fe46f86a8d1f192c2263cd3c75579c5777cafd326e222ebb0e88
MD5 f19b1b613f1591670f459a4d829a534a
BLAKE2b-256 36aad2ea441dee43d5a449891b914a74b67008793173c2894fd3422499744e1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page