Skip to main content

Crawl and download videos from the CDVL (Consumer Digital Video Library) research repository

Project description

CDVL Crawler

PyPI version

Python tools for crawling and downloading videos from the CDVL (Consumer Digital Video Library) research video repository.

Contents:

Disclaimer

[!CAUTION]

This software is provided as-is as a general-purpose tool for interacting with the CDVL web content. The author provides this tool solely as software code and has no control over how it is used, no relationship with CDVL, and no obligation to enforce any third-party terms of service or licenses. You, the user, are solely and independently responsible for:

  • Obtaining proper authorization to access CDVL
  • Complying with all applicable terms of service, license agreements, and usage policies
  • Understanding and accepting the CDVL Database Content User License Agreement
  • Your own actions and use of any content accessed through this tool
  • Any legal consequences arising from your use of this tool

While this tool displays the CDVL license agreement for user convenience, this does not constitute legal advice, license enforcement, or any guarantee of compliance. Displaying the license is informational only and does not create any legal relationship between the author and CDVL or the user. Under no circumstances shall the author be held liable for any direct, indirect, incidental, special, consequential, or exemplary damages arising from your use or misuse of this tool. This includes, but is not limited to, any violation of terms of service, breach of license agreements, unauthorized access, or any other legal issues.

Do not download more content than you would reasonably access manually through a web browser. Only use your own account credentials – never use credentials that do not belong to you. Respect the intellectual property rights and usage restrictions of content providers. Specifically, have a look at the License page before use.

By using this tool, you acknowledge that you have read, understood, and agree to this disclaimer. If you do not agree, do not use this software.

What This Does

This package provides a unified command-line tool for working with CDVL:

cdvl-crawler with three subcommands:

  • crawl - Crawls and extracts metadata from all videos and datasets on CDVL
  • download - Downloads individual videos by their ID
  • generate-site - Generates a searchable, interactive HTML site from crawled metadata

Features:

  • Automatic Login: Handles authentication automatically with username/password
  • Parallel Crawling: Concurrent requests for efficient data collection
  • Smart Enumeration: Automatically discovers videos and datasets by ID
  • Auto-Resume: Continues from the last crawled ID if interrupted
  • Structured Output: JSONL format for easy data processing
  • Progress Tracking: Real-time progress bars showing success/empty/failed counts
  • Download Management: Handles large file downloads with progress indicators
  • Static Site Generator: Creates a beautiful, searchable HTML interface for browsing videos

Requirements

  • Python 3.9 or higher
  • An active CDVL account with username and password

Installation

Install with uvx:

uvx cdvl-crawler --help
uvx cdvl-crawler crawl --help
uvx cdvl-crawler download --help
uvx cdvl-crawler generate-site --help

Or install with pipx:

pipx install cdvl-crawler

Or, with pip:

pip3 install --user cdvl-crawler

We assume you will be using uvx, otherwise just run cdvl-crawler directly without uvx after installing from pipx or pip.

Usage

Before using the tool, you need to provide your CDVL credentials. The tool supports three methods for providing credentials (in order of priority):

  1. Config file: Create a config.json file in your working directory (automatically detected) or specify with --config that contains username and password
  2. Environment variables: Set CDVL_USERNAME and CDVL_PASSWORD
  3. Interactive prompt: The tool will ask for credentials if not found via other methods

Note: If a config.json file exists in your current directory, it will be automatically loaded. You don't need to specify --config unless you want to use a different file.

Choose the method that best suits your workflow. For example:

# Using environment variables (no config file needed)
export CDVL_USERNAME="your.email@example.com"
export CDVL_PASSWORD="your_password"
uvx cdvl-crawler crawl

# Using config.json (automatically detected if in current directory)
# Just create config.json:
# {
#   "username": "your.email@example.com",
#   "password": "your_password_here"
# }

# and run:
uvx cdvl-crawler crawl

Crawling Metadata

To crawl all videos and datasets:

# Basic usage (outputs to current directory)
uvx cdvl-crawler crawl

# Save to specific directory
uvx cdvl-crawler crawl --output-dir ./data

For more options, run:

uvx cdvl-crawler crawl --help

The crawler will automatically:

  1. Log in with your credentials (from config, env vars, or prompt)
  2. Crawl videos and datasets in parallel
  3. Save metadata to videos.jsonl and datasets.jsonl in the output directory
  4. Resume from the last ID if run again

Example output:

2025-10-09 15:30:03 - INFO - ✓ Login successful!
2025-10-09 15:30:03 - INFO - Starting crawlers in parallel...
Videos:   12543 | Success: 8432 | Empty: 3891 | Failed: 220
Datasets:   142 | Success:   98 | Empty:   34 | Failed:  10

To start fresh, delete the output files before running:

rm ./data/videos.jsonl ./data/datasets.jsonl
uvx cdvl-crawler crawl --output-dir ./data

To resume, just run it again - it will automatically continue from where it left off.

Downloading Videos

Download videos by their ID:

# Download a single video (to current directory)
uvx cdvl-crawler download 42

# Download to specific directory
uvx cdvl-crawler download 42 --output-dir ./downloads

# Download multiple videos (comma-separated)
uvx cdvl-crawler download 1,5,10,20 --output-dir ./videos

# Get download URL without downloading
uvx cdvl-crawler download 42 --dry-run

# Download to specific filename (single video only)
uvx cdvl-crawler download 42 --output my_video.avi

For more options:

uvx cdvl-crawler download --help

Generating Static Site

After crawling metadata, you can generate a beautiful, interactive HTML site to browse the video library:

# Generate site from videos.jsonl to index.html
uvx cdvl-crawler generate-site

# Specify custom input and output files
uvx cdvl-crawler generate-site -i ./data/videos.jsonl -o ./website/index.html

For more options:

uvx cdvl-crawler generate-site --help

Output Format

Output files use JSON Lines format (one JSON object per line).

Video Records

{
  "id": 5,
  "url": "https://www.cdvl.org/members-section/view-file/?videoid=5",
  "paragraphs": [
    "Description:Pan over a children's ball pit, seen from above using a crane.",
    "Dataset:NTIA Source Scenes",
    "Audio Specifications:16-bit stereo PCM. Talk in English with balls rustling.",
    "Video Specification:The camera was a professional HDTV camera..."
  ],
  "extracted_at": "2025-10-09T15:30:00+00:00",
  "content_type": "video",
  "title": "NTIA children's ball pit from above, part 1, 525-line",
  "links": [{"text": "NTIA T1.801.01", "href": "/members-section/search?dataset=3"}],
  "media": [{"type": "img", "src": "/uploads/thumbnails/thumb_1.jpg"}],
  "filename": "ntia_bpit1-525_original.avi",
  "file_size": "242.50 MB"
}

Required fields:

  • id: Video ID number
  • url: Source URL on CDVL
  • paragraphs: Structured list of text content from the page
  • extracted_at: Timestamp when data was extracted (ISO 8601 format)
  • content_type: Always "video" for videos

Optional fields:

  • title: Video title
  • links: Related links found on the page
  • media: Images and media elements
  • filename: Download filename (if available)
  • file_size: File size (if available)

Dataset Records

{
  "id": 7,
  "url": "https://www.cdvl.org/members-section/search?dataset=7",
  "paragraphs": ["Description of the dataset...", "Additional information..."],
  "extracted_at": "2025-10-09T15:30:00+00:00",
  "content_type": "dataset",
  "title": "Mobile Quality Dataset",
  "links": [{"text": "Video 123", "href": "/members-section/view-file/?videoid=123"}],
  "tables_count": 2
}

Required fields:

  • id: Dataset ID number
  • url: Source URL on CDVL
  • paragraphs: Structured list of text content from the page
  • extracted_at: Timestamp when data was extracted (ISO 8601 format)
  • content_type: Always "dataset" for datasets

Optional fields:

  • title: Dataset title
  • links: Related links found on the page
  • tables_count: Number of tables in the dataset page

Here are some processing examples using jq.

Count records:

wc -l videos.jsonl datasets.jsonl

View first record:

head -n 1 videos.jsonl | jq .

Extract all titles:

jq -r '.title' videos.jsonl

Filter by keyword in paragraphs:

jq 'select(.paragraphs | join(" ") | contains("codec"))' videos.jsonl

Convert to CSV:

jq -r '[.id, .title, .url] | @csv' videos.jsonl > videos.csv

Configuration File (config.json)

Configuration is optional. The tool has sensible defaults built-in, and you can use:

  • Environment variables for authentication (see Usage above)
  • Command-line options for crawling parameters (see --help)
  • Interactive prompts if credentials are not found

Auto-detection: If a file named config.json exists in your current directory, it will be automatically loaded. You can override this with --config path/to/other.json.

If you want to customize settings permanently or override defaults, create a config.json file:

  1. Download config.example.json from the repository
  2. Rename it to config.json
  3. Edit config.json with your settings:
{
  "username": "your.email@example.com",
  "password": "your_password_here",
  "endpoints": {
    "video_base_url": "https://www.cdvl.org/members-section/view-file/",
    "dataset_base_url": "https://www.cdvl.org/members-section/search"
  },
  "output": {
    "videos_file": "videos.jsonl",
    "datasets_file": "datasets.jsonl"
  },
  "start_video_id": 1,
  "start_dataset_id": 1,
  "max_concurrent_requests": 5,
  "max_consecutive_failures": 10,
  "request_delay": 0.1
}

Configuration Options

All settings are optional with sensible defaults. CLI options override config file values.

Setting Default CLI Option Description
username (from env CDVL_USERNAME or prompt) - Your CDVL account email
password (from env CDVL_PASSWORD or prompt) - Your CDVL account password
start_video_id 1 --start-video-id Starting video ID for crawling
start_dataset_id 1 --start-dataset-id Starting dataset ID for crawling
max_concurrent_requests 5 --max-concurrent Number of parallel requests
max_consecutive_failures 10 --max-failures Stop after N consecutive empty responses
request_delay 0.1 --delay Delay between request batches (seconds)
videos_file videos.jsonl - Output filename for video metadata
datasets_file datasets.jsonl - Output filename for dataset metadata
endpoints.video_base_url cdvl.org members section - Base URL for video pages
endpoints.dataset_base_url cdvl.org members section - Base URL for dataset pages
headers Browser-like headers - HTTP headers (User-Agent, Accept, etc.)

API

You can also use the package programmatically:

import asyncio
from cdvl_crawler import CDVLCrawler, CDVLDownloader, CDVLSiteGenerator

# Crawl videos and datasets
async def crawl():
    # Config file is optional - will use env vars or prompt
    crawler = CDVLCrawler(config_path=None, output_dir="./data")
    await crawler.crawl()

# Download a specific video
async def download():
    # Config file is optional - will use env vars or prompt
    downloader = CDVLDownloader(config_path=None, output_dir="./downloads")
    await downloader._init_session()
    await downloader._login()

    url = await downloader.get_download_link(42)
    if url:
        await downloader.download_file(url, "output.avi")

    await downloader._close_session()

# Generate static site
def generate_site():
    generator = CDVLSiteGenerator(
        input_file="./data/videos.jsonl",
        output_file="./website/index.html"
    )
    success = generator.generate()
    print(f"Site generated: {success}")

# Run
asyncio.run(crawl())
asyncio.run(download())
generate_site()

License

The MIT License (MIT)

Copyright (c) 2025 Werner Robitza

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdvl_crawler-0.3.0.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdvl_crawler-0.3.0-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file cdvl_crawler-0.3.0.tar.gz.

File metadata

  • Download URL: cdvl_crawler-0.3.0.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cdvl_crawler-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ceada8c48dbca387bd6ca276a99a4a4a9efd5454afe38f1b771da609ed05b393
MD5 66a365671853066985044d45776bcf47
BLAKE2b-256 1d1e22450ca7f1a6354c083c8365ba82735e5a8fc16700dd33575f5f4a7b2ecb

See more details on using hashes here.

File details

Details for the file cdvl_crawler-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: cdvl_crawler-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cdvl_crawler-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0daa476edd9271475fbe55a96dc7ca1dc1f19bdfe353225a9bf671a72200a31c
MD5 1f372a1942683a695383db9605374202
BLAKE2b-256 39f2586da66201e670f1340ae6ca3114d604e1801cc1b4d6ad66a155d5ae2e75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page