Skip to main content

A lightweight tool with good defaults for downloading large video datasets.

Project description

clipyard

A lightweight wrapper around yt-dlp that aims to make downloading large video datasets easier and faster through a simple CLI

Features

  • Multiple Input Sources: Support for text files, CSV files, and HuggingFace datasets. For files, you can provide the actual file or a publicly accesible URL
  • Input Types: Either provide a list of IDs and the platform (typically YouTube/Vimdeo, or a list of URLs)
  • Parallel Downloads: Configurable parallel workers for fast batch downloads
  • Metadata Tracking: Saves download summaries and failure reasons to quickly inspect download summaries, and easy re-launching of runs
  • Good Defaults: Takes the pain out of configuring yt-dlp by putting good defaults in place. Videos are downloaded in h264, and m4a codecs (typically does not require re-encoding). This has two benefits: fast decoding, and ability to preview files quickly using the filesystem

PS: This library is in early development, and is an evolution of this script

Installation

pip install clipyard
pip install clipyard[datasets]  # For HuggingFace dataset support

clipyard -h
clipyard download -h

Or, if you use uv (recommended):

uv tool install clipyard[datasets]
uv tool run clipyard download -h

Examples (w/ Best Practices)

Refer to this doc to setup cookies for downloading. Note that some of these args presented here are not mandatory, but are the suggested way to use this tool for best quality of life.

SF20K Dataset (Text File)

This cmd downloads test_expert split of the SF20K dataset, in 720p. Note

uv tool run clipyard download \
  --input-type huggingface \
  --hf-dataset "rghermi/sf20k" \
  --hf-split test_expert \
  --id-column video_id \
  --url-column video_url \
  --output-dir /mnt/DataSSD/datasets/sf20k/test_expert/720p/ \
  --metadata-dir /mnt/DataSSD/datasets/sf20k/test_expert/metadata/ \
  --cookies cookies.txt

VUE-TR-V2 Dataset (HuggingFace)

uv tool run clipyard download \
  --input-type txt \
  --input "https://raw.githubusercontent.com/bytedance/vidi/refs/heads/main/VUE_TR_V2/video_id.txt" \
  --output-dir /mnt/DataSSD/datasets/vue-tr-v2/videos/ \
  --metadata-dir /mnt/DataSSD/datasets/vue-tr-v2/metadata/ \
  --cookies cookies.txt

uv tool run clipyard download \
  --input-type txt \
  --input "https://raw.githubusercontent.com/bytedance/vidi/refs/heads/main/VUE_TR/video_id.txt" \
  --output-dir /mnt/DataSSD/datasets/vue-tr-v2/videos/ \
  --metadata-dir /mnt/DataSSD/datasets/vue-tr-v2/metadata-v1/ \
  --cookies cookies.txt

Re-Launching Runs

If you ran a download with --metadata-dir, a download_config.json file is saved. You can relaunch the same download with:

clipyard download --config ./metadata/download_config.json

You can also override specific settings:

clipyard download --config ./metadata/download_config.json --workers 8  # More workers
clipyard download --config ./metadata/download_config.json --output-dir ./new-downloads  # Different output dir

Command-Line Arguments

Required Arguments

  • --input-type: Type of input source (required unless --config is provided)

    • Options: txt, csv, huggingface
  • --output-dir: Directory to save downloaded videos (required unless --config is provided)

Config File

  • --config: Path to a saved config JSON file (from a previous run's --metadata-dir)
    • When provided, loads settings from the config file
    • CLI arguments can override config file values
    • Allows relaunching a download with the same settings

Input Arguments

  • --input: Input file path (for txt/csv) or dataset name (for huggingface)
  • --platform: Default platform for video IDs when not specified in URLs
    • Options: youtube, vimeo
    • Default: youtube
  • --id-column: Column name containing video IDs
    • Default: video_id
    • At least one of --id-column or --url-column must be provided
  • --url-column: Column name containing video URLs (optional)
    • If only URL column is provided: video IDs are extracted from URLs
    • If only ID column is provided: URLs are built from video IDs
    • If both are provided: URLs are preferred (IDs extracted from URLs)
  • --hf-dataset: HuggingFace dataset identifier (e.g., "rghermi/sf20k")
  • --hf-split: HuggingFace dataset split to use
    • Default: train

Output Arguments

  • --metadata-dir: Directory to save download metadata
    • Creates three files:
      • download_summary.csv: All download results (success, failed, skipped)
      • failed_videos.csv: Only failed downloads (for retry)
      • download_config.json: Configuration for re-running the download

Download Arguments

  • --resolution: Video resolution to download
    • Options: 144, 240, 360, 480, 720, 1080
    • Default: 720
  • --workers: Number of parallel workers for downloading videos
    • Default: 4
    • Recommended: 4-8 for faster downloads
  • --threads: Number of threads per download (passed to yt-dlp --concurrent-fragments)
    • Default: 1
    • Recommended: 2-4 for faster individual downloads
  • --max-videos: Maximum number of videos to download (useful for testing)
    • If not specified, downloads all videos
  • --cookies: Path to cookies file (.txt) for yt-dlp
    • Required for some restricted/private videos
  • --replace-existing: Replace videos if they've already been downloaded previously
  • --silence-errors: Silence yt-dlp errors and warnings
  • --sleep-interval: Number of seconds to sleep before each download (passed to yt-dlp)
    • Default: 5
  • --max-sleep-interval: Maximum number of seconds to sleep before each download (passed to yt-dlp)
    • Default: 10

Input Formats

Text files: One video ID or URL per line. Empty lines and lines starting with # are ignored.

CSV files: At least one of the following must be provided:

  • Video ID column (default: video_id): IDs are used to construct URLs
  • Video URL column: IDs are extracted from URLs
  • Both columns: URLs are preferred (IDs extracted from URLs)

HuggingFace datasets: Specify dataset name and split. At least one of the following must be provided:

  • Video ID column (default: video_id): IDs are used to construct URLs
  • Video URL column: IDs are extracted from URLs
  • Both columns: URLs are preferred (IDs extracted from URLs)

Python API

You can also use clipyard programmatically:

from clipyard import (
    parse_txt_file,
    download_videos,
    DownloadConfig,
    save_summary_csv,
)
from pathlib import Path

# Parse input
sources = parse_txt_file(Path("videos.txt"), platform="youtube")

# Configure download
config = DownloadConfig(
    output_dir=Path("./downloads"),
    resolution=1080,
    workers=4,
    threads=2,
)

# Download
results = download_videos(sources, config)

# Save summary
save_summary_csv(results, Path("summary.csv"))

Context For LLMs

Use development.md for developer documentation, tailored for LLM use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clipyard-0.2.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clipyard-0.2.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file clipyard-0.2.0.tar.gz.

File metadata

  • Download URL: clipyard-0.2.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.10

File hashes

Hashes for clipyard-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d27bb217d315ef0415b7ada3c645959d7642db0023db37e559d67a9c3f2ce176
MD5 3e4c101a084c03f0a536fd48557c4f5a
BLAKE2b-256 b4d2a1849a7db7c31dc471df3bab5d2ebd033493eb5744b813ad61843eaab545

See more details on using hashes here.

File details

Details for the file clipyard-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: clipyard-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.10

File hashes

Hashes for clipyard-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4a28ca2ff046e207b99e56f2b00ca19456fa25926652c298cc238bfbe3379449
MD5 079b1899423e77fc49d69b7b240de603
BLAKE2b-256 622bc9edfb7a9bd5ae07ee7defb67706083c42496fce2b1a8528c058ecbd6a36

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page