Sample YouTube channels and retrieve their historical Wayback Machine metadata
Project description
TubeCensus
A Python library for sampling YouTube channels and retrieving their historical Wayback Machine metadata.
Installation
-
Requirements: ~20GB of storage.
- Defaults to
~/.tubecensus, but can be overriden byTubeCensus(data_dir=...), or theTUBECENSUS_DIRenvironment variable.
- Defaults to
-
pip install tubecensus
Features
sample(n, by={"usernames","ids","customs", "handles"})
- Sample YouTube channels from the URLs collected from the Wayback machine indices.
- Current version includes unique URLs up to 2023. These are featured in the four YouTube channel formats:
- Username (
/profile?user=,/user/): 34.8M channels - ID (
/channel/UC): 106M channels - Custom Page (
/c/): 5.9M channels - Handle (
/@): 25.4M channels
- Username (
- See our paper for more discussion.
sample_until(n, by, condition)
- Construct a conditional sample by repeatedly drawing channels and keeping them if the condition function is met.
- Can be used along with YouTube API / Innertube to construct samples conditioned on API metadata (e.g. country, join date, channel topic), or alternatively our metadata (subscribers at given timestamp).
fetch(channels, by, from_ts, to_ts, closest)
- Retrieve the subscriber counts for a given timestamp using the Wayback Machine.
- Requires to either specify a timestamp range using
(from_ts, to_ts)orclosest. - Returns outputs as a Pandas DataFrame, and includes additional channel identifier metadata extracted from the page (username / id fields).
Citation
@article{tubecensus,
title={TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time},
volume={20},
number={1},
journal={Proceedings of the International AAAI Conference on Web and Social Media},
author={Eggleston, Chloe and Handler, Abram and Pacheco, Maria Leonor},
year={2026},
month={May},
}
TO-DOs
- Early channel IDs via CDN URLs
- Before the standardization of the YouTube channel ID (c. 2012), they were occasionally used in the URLs of custom channel page content (such as profile pictures and custom CSS). They can be used to map additional usernames to channel IDs.
- Scrape channel hubs / related channels
- Subscriber counts for additional channels are sometimes accessible in the related channels tab. When paired with identifiers extracted from profile pictures or subscriber button HTML attributes, they can add upwards of ~10 subscriber counts in a given page scrape.
- Caching
- We redistribute the data collected in our paper as a part of our dataset, which is downloaded with this library. We plan to integrate these into the library such that URLs in the cache are not re-scraped.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tubecensus-1.0.0.tar.gz
(9.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tubecensus-1.0.0.tar.gz.
File metadata
- Download URL: tubecensus-1.0.0.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9f1e7608f71c477adac44a7720b1887ed509a03398f61503bd657df24f534cf
|
|
| MD5 |
7999060a8a14d594eaa8a85faee81e1c
|
|
| BLAKE2b-256 |
5723b78448d4f1bbb2eada85b7ee4757eca8fccfdc503351cc7e6eb7bd0faa5d
|
File details
Details for the file tubecensus-1.0.0-py3-none-any.whl.
File metadata
- Download URL: tubecensus-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6f2d39e7484fe540d4a92e23c1027d521a0184b669397b13f7e877552b0393e
|
|
| MD5 |
4a27dc31dbbb2d4ff71508f52b297586
|
|
| BLAKE2b-256 |
b8a9a4a3753e003e0fde1270e0db949582f521b92f17a0c85192a72f64b95f20
|