Skip to main content

Sample YouTube channels and retrieve their historical Wayback Machine metadata

Project description

TubeCensus

A Python library for sampling YouTube channels and retrieving their historical Wayback Machine metadata.

Installation

  • Requirements: ~20GB of storage.

    • Defaults to ~/.tubecensus, but can be overriden by TubeCensus(data_dir=...), or the TUBECENSUS_DIR environment variable.
  • pip install tubecensus

Features

sample(n, by={"usernames","ids","customs", "handles"})

  • Sample YouTube channels from the URLs collected from the Wayback machine indices.
  • Current version includes unique URLs up to 2023. These are featured in the four YouTube channel formats:
    1. Username (/profile?user=, /user/): 34.8M channels
    2. ID (/channel/UC): 106M channels
    3. Custom Page (/c/): 5.9M channels
    4. Handle (/@): 25.4M channels
  • See our paper for more discussion.

sample_until(n, by, condition)

  • Construct a conditional sample by repeatedly drawing channels and keeping them if the condition function is met.
  • Can be used along with YouTube API / Innertube to construct samples conditioned on API metadata (e.g. country, join date, channel topic), or alternatively our metadata (subscribers at given timestamp).

fetch(channels, by, from_ts, to_ts, closest)

  • Retrieve the subscriber counts for a given timestamp using the Wayback Machine.
  • Requires to either specify a timestamp range using (from_ts, to_ts) or closest.
  • Returns outputs as a Pandas DataFrame, and includes additional channel identifier metadata extracted from the page (username / id fields).

Citation

@article{tubecensus, 
    title={TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time}, 
    volume={20}, 
    number={1}, 
    journal={Proceedings of the International AAAI Conference on Web and Social Media}, 
    author={Eggleston, Chloe and Handler, Abram and Pacheco, Maria Leonor}, 
    year={2026}, 
    month={May}, 
}

TO-DOs

  • Early channel IDs via CDN URLs
    • Before the standardization of the YouTube channel ID (c. 2012), they were occasionally used in the URLs of custom channel page content (such as profile pictures and custom CSS). They can be used to map additional usernames to channel IDs.
  • Scrape channel hubs / related channels
    • Subscriber counts for additional channels are sometimes accessible in the related channels tab. When paired with identifiers extracted from profile pictures or subscriber button HTML attributes, they can add upwards of ~10 subscriber counts in a given page scrape.
  • Caching
    • We redistribute the data collected in our paper as a part of our dataset, which is downloaded with this library. We plan to integrate these into the library such that URLs in the cache are not re-scraped.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubecensus-1.0.0.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tubecensus-1.0.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file tubecensus-1.0.0.tar.gz.

File metadata

  • Download URL: tubecensus-1.0.0.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tubecensus-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b9f1e7608f71c477adac44a7720b1887ed509a03398f61503bd657df24f534cf
MD5 7999060a8a14d594eaa8a85faee81e1c
BLAKE2b-256 5723b78448d4f1bbb2eada85b7ee4757eca8fccfdc503351cc7e6eb7bd0faa5d

See more details on using hashes here.

File details

Details for the file tubecensus-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: tubecensus-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tubecensus-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b6f2d39e7484fe540d4a92e23c1027d521a0184b669397b13f7e877552b0393e
MD5 4a27dc31dbbb2d4ff71508f52b297586
BLAKE2b-256 b8a9a4a3753e003e0fde1270e0db949582f521b92f17a0c85192a72f64b95f20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page