Crawl and download videos from the CDVL (Consumer Digital Video Library) research repository
Project description
CDVL Crawler
Python tools for crawling and downloading videos from the CDVL (Consumer Digital Video Library) research video repository.
Contents:
- What This Does
- Requirements
- Installation
- Usage
- Output Format
- Configuration File (
config.json) - API
- Privacy & Ethics
- License
What This Does
This package provides a unified command-line tool for working with CDVL:
cdvl-crawler with two subcommands:
crawl- Crawls and extracts metadata from all videos and datasets on CDVLdownload- Downloads individual videos by their ID
Features:
- Automatic Login: Handles authentication automatically with username/password
- Parallel Crawling: Concurrent requests for efficient data collection
- Smart Enumeration: Automatically discovers videos and datasets by ID
- Auto-Resume: Continues from the last crawled ID if interrupted
- Structured Output: JSONL format for easy data processing
- Progress Tracking: Real-time progress bars showing success/empty/failed counts
- Download Management: Handles large file downloads with progress indicators
Requirements
- Python 3.9 or higher
- An active CDVL account with username and password
Installation
Install with uvx:
uvx cdvl-crawler --help
uvx cdvl-crawler crawl --help
uvx cdvl-crawler download --help
Or install with pipx:
pipx install cdvl-crawler
Or, with pip:
pip3 install --user cdvl-crawler
We assume you will be using uvx, otherwise just run cdvl-crawler directly without uvx after installing from pipx or pip.
Usage
Before using the tool, you need to provide your CDVL credentials. The tool supports three methods for providing credentials (in order of priority):
- Config file: Create a
config.jsonfile in your working directory (automatically detected) or specify with--configthat containsusernameandpassword - Environment variables: Set
CDVL_USERNAMEandCDVL_PASSWORD - Interactive prompt: The tool will ask for credentials if not found via other methods
Note: If a config.json file exists in your current directory, it will be automatically loaded. You don't need to specify --config unless you want to use a different file.
Choose the method that best suits your workflow. For example:
# Using environment variables (no config file needed)
export CDVL_USERNAME="your.email@example.com"
export CDVL_PASSWORD="your_password"
uvx cdvl-crawler crawl
# Using config.json (automatically detected if in current directory)
# Just create config.json:
# {
# "username": "your.email@example.com",
# "password": "your_password_here"
# }
# and run:
uvx cdvl-crawler crawl
Crawling Metadata
To crawl all videos and datasets:
# Basic usage (outputs to current directory)
uvx cdvl-crawler crawl
# Save to specific directory
uvx cdvl-crawler crawl --output-dir ./data
Available options:
--output-dir DIR- Directory to save output files (default: current directory)--start-video-id N- Starting video ID for crawling (default: 1 or resume from last)--start-dataset-id N- Starting dataset ID for crawling (default: 1 or resume from last)--max-concurrent N- Maximum number of concurrent requests (default: 5)--max-failures N- Stop after N consecutive empty/failed responses (default: 10)--delay SECONDS- Delay between request batches in seconds (default: 0.1)
The crawler will automatically:
- Log in with your credentials (from config, env vars, or prompt)
- Crawl videos and datasets in parallel
- Save metadata to
videos.jsonlanddatasets.jsonlin the output directory - Resume from the last ID if run again
Example output:
2025-10-09 15:30:03 - INFO - ✓ Login successful!
2025-10-09 15:30:03 - INFO - Starting crawlers in parallel...
Videos: 12543 | Success: 8432 | Empty: 3891 | Failed: 220
Datasets: 142 | Success: 98 | Empty: 34 | Failed: 10
To start fresh, delete the output files before running:
rm ./data/videos.jsonl ./data/datasets.jsonl
uvx cdvl-crawler crawl --output-dir ./data
To resume, just run it again - it will automatically continue from where it left off.
Downloading Videos
Download videos by their ID:
# Download a single video (to current directory)
uvx cdvl-crawler download 42
# Download to specific directory
uvx cdvl-crawler download 42 --output-dir ./downloads
# Download multiple videos (comma-separated)
uvx cdvl-crawler download 1,5,10,20 --output-dir ./videos
# Get download URL without downloading
uvx cdvl-crawler download 42 --dry-run
# Download to specific filename (single video only)
uvx cdvl-crawler download 42 --output my_video.avi
For more options:
uvx cdvl-crawler download --help
Output Format
Output files use JSON Lines format (one JSON object per line).
Video Records
{
"id": 5,
"url": "https://www.cdvl.org/members-section/view-file/?videoid=5",
"title": "Introduction to Video Quality",
"text": "Full text content...",
"paragraphs": ["Paragraph 1...", "Paragraph 2..."],
"links": [{"text": "Download", "href": "/path/to/file"}],
"media": [{"type": "video", "src": "/path/to/video.mp4"}],
"html": "<div>Raw HTML...</div>",
"extracted_at": "2025-10-09T15:30:00+00:00",
"content_type": "video"
}
Dataset Records
{
"id": 7,
"url": "https://www.cdvl.org/members-section/search?dataset=7",
"title": "Mobile Quality Dataset",
"text": "Full text content...",
"paragraphs": ["Description..."],
"links": [{"text": "Download", "href": "/download/dataset7.zip"}],
"tables_count": 2,
"html": "<div>Raw HTML...</div>",
"extracted_at": "2025-10-09T15:30:00+00:00",
"content_type": "dataset"
}
Here are some processing examples using jq.
Count records:
wc -l videos.jsonl datasets.jsonl
View first record:
head -n 1 videos.jsonl | jq .
Extract all titles:
jq -r '.title' videos.jsonl
Filter by keyword:
jq 'select(.text | contains("codec"))' videos.jsonl
Convert to CSV:
jq -r '[.id, .title, .url] | @csv' videos.jsonl > videos.csv
Configuration File (config.json)
Configuration is optional. The tool has sensible defaults built-in, and you can use:
- Environment variables for authentication (see Usage above)
- Command-line options for crawling parameters (see
--help) - Interactive prompts if credentials are not found
Auto-detection: If a file named config.json exists in your current directory, it will be automatically loaded. You can override this with --config path/to/other.json.
If you want to customize settings permanently or override defaults, create a config.json file:
- Download
config.example.jsonfrom the repository - Rename it to
config.json - Edit
config.jsonwith your settings:
{
"username": "your.email@example.com",
"password": "your_password_here",
"endpoints": {
"video_base_url": "https://www.cdvl.org/members-section/view-file/",
"dataset_base_url": "https://www.cdvl.org/members-section/search"
},
"output": {
"videos_file": "videos.jsonl",
"datasets_file": "datasets.jsonl"
},
"start_video_id": 1,
"start_dataset_id": 1,
"max_concurrent_requests": 5,
"max_consecutive_failures": 10,
"request_delay": 0.1
}
Configuration Options
All settings are optional with sensible defaults. CLI options override config file values.
| Setting | Default | CLI Option | Description |
|---|---|---|---|
username |
(from env CDVL_USERNAME or prompt) |
- | Your CDVL account email |
password |
(from env CDVL_PASSWORD or prompt) |
- | Your CDVL account password |
start_video_id |
1 | --start-video-id |
Starting video ID for crawling |
start_dataset_id |
1 | --start-dataset-id |
Starting dataset ID for crawling |
max_concurrent_requests |
5 | --max-concurrent |
Number of parallel requests |
max_consecutive_failures |
10 | --max-failures |
Stop after N consecutive empty responses |
request_delay |
0.1 | --delay |
Delay between request batches (seconds) |
videos_file |
videos.jsonl | - | Output filename for video metadata |
datasets_file |
datasets.jsonl | - | Output filename for dataset metadata |
endpoints.video_base_url |
cdvl.org members section | - | Base URL for video pages |
endpoints.dataset_base_url |
cdvl.org members section | - | Base URL for dataset pages |
headers |
Browser-like headers | - | HTTP headers (User-Agent, Accept, etc.) |
API
You can also use the package programmatically:
import asyncio
from cdvl_crawler import CDVLCrawler, CDVLDownloader
# Crawl videos and datasets
async def crawl():
# Config file is optional - will use env vars or prompt
crawler = CDVLCrawler(config_path=None, output_dir="./data")
await crawler.crawl()
# Download a specific video
async def download():
# Config file is optional - will use env vars or prompt
downloader = CDVLDownloader(config_path=None, output_dir="./downloads")
await downloader._init_session()
await downloader._login()
url = await downloader.get_download_link(42)
if url:
await downloader.download_file(url, "output.avi")
await downloader._close_session()
# Run
asyncio.run(crawl())
asyncio.run(download())
Privacy & Ethics
- Rate Limiting: Use reasonable delays to avoid server strain (default: 0.1s between batches)
- Credentials: Keep your
config.jsonsecure and never share credentials - Usage Policies: Respect CDVL's terms of service and usage policies
- Personal Use: Only use your own account credentials
License
The MIT License (MIT)
Copyright (c) 2025 Werner Robitza
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cdvl_crawler-0.1.0.tar.gz.
File metadata
- Download URL: cdvl_crawler-0.1.0.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
618022b9fee863007adc4a8860bbf7d9e2c8ba9e200683fecb5103e25f982f19
|
|
| MD5 |
f8720e61b1a2a50a88be5a9823173761
|
|
| BLAKE2b-256 |
a65fd58de1636a1155ceb16ef9180c1a9d38a94326beae6d591c83c3a4598c90
|
File details
Details for the file cdvl_crawler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cdvl_crawler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d51cfccb766145629bc447c7c9f5937fb19ec5bc0ce06f42edcac4f7aebf716
|
|
| MD5 |
e2a25126e47332bdc16d8fd989ab4b1d
|
|
| BLAKE2b-256 |
ecc53a90f801d312b2b243ad4735ddbb944dabd0a0e87b591fe0898539b87ac6
|