A lightweight tool with good defaults for downloading large video datasets.
Project description
clipyard
A lightweight wrapper around yt-dlp that aims to make downloading large video datasets easier and faster through a simple CLI
Features
- Multiple Input Sources: Support for text files, CSV files, and HuggingFace datasets. For files, you can provide the actual file or a publicly accesible URL
- Input Types: Either provide a list of IDs and the platform (typically YouTube/Vimdeo, or a list of URLs)
- Parallel Downloads: Configurable parallel workers for fast batch downloads
- Metadata Tracking: Saves download summaries and failure reasons to quickly inspect download summaries, and easy re-launching of runs
- Good Defaults: Takes the pain out of configuring
yt-dlpby putting good defaults in place. Videos are downloaded in h264, and m4a codecs (typically does not require re-encoding). This has two benefits: fast decoding, and ability to preview files quickly using the filesystem
PS: This library is in early development, and is an evolution of this script
Installation
pip install clipyard
pip install clipyard[datasets] # For HuggingFace dataset support
clipyard -h
clipyard download -h
Or, if you use uv (recommended):
uv tool install clipyard[datasets]
uv tool run clipyard download -h
Examples (w/ Best Practices)
Refer to this doc to setup cookies for downloading. Note that some of these args presented here are not mandatory, but are the suggested way to use this tool for best quality of life.
SF20K Dataset (Text File)
This cmd downloads test_expert split of the SF20K dataset, in 720p. Note
uv tool run clipyard download \
--input-type huggingface \
--hf-dataset "rghermi/sf20k" \
--hf-split test_expert \
--id-column video_id \
--url-column video_url \
--output-dir /mnt/DataSSD/datasets/sf20k/test_expert/720p/ \
--metadata-dir /mnt/DataSSD/datasets/sf20k/test_expert/metadata/ \
--cookies cookies.txt
VUE-TR-V2 Dataset (HuggingFace)
uv tool run clipyard download \
--input-type txt \
--input "https://raw.githubusercontent.com/bytedance/vidi/refs/heads/main/VUE_TR_V2/video_id.txt" \
--output-dir /mnt/DataSSD/datasets/vue-tr-v2/videos/ \
--metadata-dir /mnt/DataSSD/datasets/vue-tr-v2/metadata/ \
--cookies cookies.txt
uv tool run clipyard download \
--input-type txt \
--input "https://raw.githubusercontent.com/bytedance/vidi/refs/heads/main/VUE_TR/video_id.txt" \
--output-dir /mnt/DataSSD/datasets/vue-tr-v2/videos/ \
--metadata-dir /mnt/DataSSD/datasets/vue-tr-v2/metadata-v1/ \
--cookies cookies.txt
Re-Launching Runs
If you ran a download with --metadata-dir, a download_config.json file is saved. You can relaunch the same download with:
clipyard download --config ./metadata/download_config.json
You can also override specific settings:
clipyard download --config ./metadata/download_config.json --workers 8 # More workers
clipyard download --config ./metadata/download_config.json --output-dir ./new-downloads # Different output dir
Command-Line Arguments
Required Arguments
-
--input-type: Type of input source (required unless--configis provided)- Options:
txt,csv,huggingface
- Options:
-
--output-dir: Directory to save downloaded videos (required unless--configis provided)
Config File
--config: Path to a saved config JSON file (from a previous run's--metadata-dir)- When provided, loads settings from the config file
- CLI arguments can override config file values
- Allows relaunching a download with the same settings
Input Arguments
--input: Input file path (fortxt/csv) or dataset name (forhuggingface)--platform: Default platform for video IDs when not specified in URLs- Options:
youtube,vimeo - Default:
youtube
- Options:
--id-column: Column name containing video IDs- Default:
video_id - At least one of
--id-columnor--url-columnmust be provided
- Default:
--url-column: Column name containing video URLs (optional)- If only URL column is provided: video IDs are extracted from URLs
- If only ID column is provided: URLs are built from video IDs
- If both are provided: URLs are preferred (IDs extracted from URLs)
--hf-dataset: HuggingFace dataset identifier (e.g.,"rghermi/sf20k")--hf-split: HuggingFace dataset split to use- Default:
train
- Default:
Output Arguments
--metadata-dir: Directory to save download metadata- Creates three files:
download_summary.csv: All download results (success, failed, skipped)failed_videos.csv: Only failed downloads (for retry)download_config.json: Configuration for re-running the download
- Creates three files:
Download Arguments
--resolution: Video resolution to download- Options:
144,240,360,480,720,1080 - Default:
720
- Options:
--workers: Number of parallel workers for downloading videos- Default:
4 - Recommended: 4-8 for faster downloads
- Default:
--threads: Number of threads per download (passed to yt-dlp--concurrent-fragments)- Default:
1 - Recommended: 2-4 for faster individual downloads
- Default:
--max-videos: Maximum number of videos to download (useful for testing)- If not specified, downloads all videos
--cookies: Path to cookies file (.txt) for yt-dlp- Required for some restricted/private videos
--replace-existing: Replace videos if they've already been downloaded previously--silence-errors: Silence yt-dlp errors and warnings--sleep-interval: Number of seconds to sleep before each download (passed to yt-dlp)- Default:
5
- Default:
--max-sleep-interval: Maximum number of seconds to sleep before each download (passed to yt-dlp)- Default:
10
- Default:
Input Formats
Text files: One video ID or URL per line. Empty lines and lines starting with # are ignored.
CSV files: At least one of the following must be provided:
- Video ID column (default:
video_id): IDs are used to construct URLs - Video URL column: IDs are extracted from URLs
- Both columns: URLs are preferred (IDs extracted from URLs)
HuggingFace datasets: Specify dataset name and split. At least one of the following must be provided:
- Video ID column (default:
video_id): IDs are used to construct URLs - Video URL column: IDs are extracted from URLs
- Both columns: URLs are preferred (IDs extracted from URLs)
Python API
You can also use clipyard programmatically:
from clipyard import (
parse_txt_file,
download_videos,
DownloadConfig,
save_summary_csv,
)
from pathlib import Path
# Parse input
sources = parse_txt_file(Path("videos.txt"), platform="youtube")
# Configure download
config = DownloadConfig(
output_dir=Path("./downloads"),
resolution=1080,
workers=4,
threads=2,
)
# Download
results = download_videos(sources, config)
# Save summary
save_summary_csv(results, Path("summary.csv"))
Context For LLMs
Use development.md for developer documentation, tailored for LLM use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clipyard-0.2.0.tar.gz.
File metadata
- Download URL: clipyard-0.2.0.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d27bb217d315ef0415b7ada3c645959d7642db0023db37e559d67a9c3f2ce176
|
|
| MD5 |
3e4c101a084c03f0a536fd48557c4f5a
|
|
| BLAKE2b-256 |
b4d2a1849a7db7c31dc471df3bab5d2ebd033493eb5744b813ad61843eaab545
|
File details
Details for the file clipyard-0.2.0-py3-none-any.whl.
File metadata
- Download URL: clipyard-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a28ca2ff046e207b99e56f2b00ca19456fa25926652c298cc238bfbe3379449
|
|
| MD5 |
079b1899423e77fc49d69b7b240de603
|
|
| BLAKE2b-256 |
622bc9edfb7a9bd5ae07ee7defb67706083c42496fce2b1a8528c058ecbd6a36
|