YTFetcher lets you fetch YouTube transcripts in bulk with metadata like titles, publish dates, and thumbnails. Great for ML, NLP, and dataset generation.
Project description
YTFetcher
⚡ Build structured YouTube datasets for NLP, ML, sentiment analysis & RAG in minutes.
A python tool for fetching thousands of videos fast from a Youtube channel along with structured transcripts and additional metadata. Export data easily as CSV, TXT, or JSON.
📚 Table of Contents
- Installation
- Quick CLI Usage
- Basic Usage (Python API)
- Features
- Fetching Specific Channel Tabs (Videos / Shorts / Streams)
- Using Different Fetchers
- Retreive Different Languages
- Filtering
- Converting ChannelData to Rows
- SQLite Cache
- Fetching Only Manually Created Transcripts
- Exporting
- Comments
- Other Methods
- Proxy Configuration
- Advanced HTTP Configuration (Optional)
- CLI (Advanced)
- Docker Quick Start
- Contributing
- Running Tests
- Related Projects
- License
- Contributors
Installation
Install from PyPI:
pip install ytfetcher
Quick CLI Usage
Fetch 50 video transcripts + metadata from a channel and save as JSON:
ytfetcher channel TheOffice -m 50 -f json
Basic Usage (Python API)
Here’s how you can get transcripts and metadata information like channel name, description, published date, etc. from a single channel with from_channel method:
from ytfetcher import YTFetcher
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=2
)
channel_data = fetcher.fetch_youtube_data()
for video in channel_data:
print(video.metadata.title)
print(video.metadata.description)
print(video.transcripts)
This will return a list of ChannelData with metadata in DLSnippet objects:
[
ChannelData(
video_id='video1',
transcripts=[
Transcript(
text="Hey there",
start=0.0,
duration=1.54
),
Transcript(
text="Happy coding!",
start=1.56,
duration=4.46
)
]
metadata=DLSnippet(
video_id='video1',
title='VideoTitle',
description='VideoDescription',
url='https://youtu.be/video1',
duration=120,
view_count=1000,
thumbnails=[{'url': 'thumbnail_url'}]
)
),
# Other ChannelData objects...
]
You can also preview this data using PreviewRenderer class from ytfetcher.services.
from ytfetcher.services import PreviewRenderer
channel_data = fetcher.fetch_with_comments(max_comments=10)
#print(channel_data)
preview = PreviewRenderer()
preview.render(data=channel_data, limit=4)
This will preview the first 4 results of the data in a beautifully formatted terminal view, including metadata, transcript snippets, and comments.
Features
- Fetch full transcripts from a YouTube channel.
- Get video metadata: title, description, thumbnails, published date.
- Support for fetching with channel handle, playlist id, custom video id's or with a search query.
- Fetch comments in bulk.
- Concurrent fetching for high performance.
- Built in cache support.
- Export fetched data as txt, csv or json.
- CLI support.
Fetching Specific Channel Tabs (Videos / Shorts / Streams)
Use the tab parameter in from_channel() to select which section of a channel to fetch.
Available options:
'videos'(default)'shorts''streams'
If not specified, the fetcher defaults to the Videos tab.
# Fetch regular videos (default)
YTFetcher.from_channel(channel_handle="handle")
# Fetch Shorts
YTFetcher.from_channel(channel_handle="handle", tab="shorts")
# Fetch live streams
YTFetcher.from_channel(channel_handle="handle", tab="streams")
Using Different Fetchers
ytfetcher supports various fetching options that includes:
- Fetching from a playlist id with
from_playlist_idmethod. - Fetching from video id's with
from_video_idsmethod. - Fetching from a search query with
from_searchmethod.
Fetching from Playlist ID
Use from_playlist_id to retrieve metadata and transcripts for every video within a public or unlisted YouTube playlist.
from ytfetcher import YTFetcher
fetcher = YTFetcher.from_playlist_id(
playlist_id="playlistid1254"
)
# Rest is same ...
Fetching With Custom Video IDs
If you already have specific video identifiers, from_video_ids allows you to target them directly.
This is the most efficient way to fetch data when you have an external list of URLs or IDs.
from ytfetcher import YTFetcher
fetcher = YTFetcher.from_video_ids(
video_ids=['video1', 'video2', 'video3']
)
# Rest is same ...
Fetching With Search Query
The from_search method allows you to discover videos based on a keyword or phrase, similar to using the YouTube search bar. You can control the breadth of the search using the max_results parameter.
from ytfetcher import YTFetcher
# Searches for the top 10 videos matching 'Artificial Intelligence'
fetcher = YTFetcher.from_search(
query="Artificial Intelligence",
max_results=10
)
YTFetcher Options
YTFetcher provides a simple interface for customizing your fetching process with several optional parameters:
- languages: Specify preferred transcript languages (e.g.,
["en", "tr"]). - filters: Apply filters to video metadata before transcripts are fetched.
- manually_created Fetch only manually created transcripts for more precise transcripts.
- proxy_config Provide custom proxy settings for preventing bans.
- http_config Define custom http headers.
- cache_enabled Enable or disable SQLite transcript cache. Enabled by default.
- cache_path Choose where cache file (
cache.sqlite3) is stored.
These options can be passed to any of the fetcher methods (from_channel, from_video_ids, from_playlist_id, or from_search) to tailor the fetching process for your needs. You can use FetchOptions dataclass from ytfetcher.config for easily configure your options.
See below for examples of usages.
Retreive Different Languages
You can use the languages param to retrieve your desired language. (Default en)
from ytfetcher.config import FetchOptions
options = FetchOptions(
languages=['tr', 'en']
)
fetcher = YTFetcher.from_video_ids(video_ids=video_ids, options=options)
Also here's a quick CLI command for languages param.
ytfetcher channel TheOffice -m 50 -f csv --languages tr en
ytfetcher first tries to fetch the Turkish transcript. If it's not available, it falls back to English.
Filtering
ytfetcher allows you to filter videos before fetching transcripts, which helps you focus on specific content and save processing time. Filters are applied to video metadata (duration, view count, title) and work with all fetcher methods.
Available Filter Functions
The following filter functions are available in ytfetcher.filters:
min_duration(sec: float)- Filter videos with duration greater than or equal to specified secondsmax_duration(sec: float)- Filter videos with duration less than or equal to specified secondsmin_views(n: int)- Filter videos with view count greater than or equal to specified numbermax_views(n: int)- Filter videos with view count less than or equal to specified numberfilter_by_title(search_query: str)- Filter videos whose title contains the search query (case-insensitive)
Using Filters in Python API
Pass a list of filter functions to the filters parameter when creating a fetcher:
from ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions
from ytfetcher.filters import min_duration, min_views, filter_by_title
options = FetchOptions(
filters=[
min_views(5000),
min_duration(600), # At least 10 minutes
filter_by_title("tutorial")
]
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=50,
options=options
)
Using Filters in CLI
You can use filter arguments directly in the CLI:
# Filter by minimum views
ytfetcher channel TheOffice -m 50 -f json --min-views 1000
# Filter by minimum duration (in seconds)
ytfetcher channel TheOffice -m 50 -f csv --min-duration 300
# Filter by title substring
ytfetcher channel TheOffice -m 50 -f json --includes-title "episode"
# Combine multiple filters
ytfetcher channel TheOffice -m 50 -f json --min-views 1000 --min-duration 300 --includes-title "tutorial"
Converting ChannelData to Rows
If you want a flat, row-based structure for ML workflows (Pandas, HuggingFace datasets, JSON/Parquet), you can use the helper in ytfetcher.utils to join transcript segments. Comments are only included if you fetched them with fetch_with_comments or fetch_comments.
from ytfetcher import YTFetcher
from ytfetcher.utils import channel_data_to_rows
fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
channel_data = fetcher.fetch_with_comments(max_comments=5)
rows = channel_data_to_rows(channel_data, include_comments=True)
SQLite Cache
ytfetcher now uses a local SQLite cache for transcripts. This significantly speeds up repeated fetches by reusing transcripts that were already fetched with the same transcript options.
Python API cache options
sfrom ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions
options = FetchOptions(
cache_enabled=True,
cache_path="./.ytfetcher_cache"
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=20,
options=options,
)
Disable cache when needed:
from ytfetcher.config import FetchOptions
options = FetchOptions(cache_enabled=False)
Control cache expiration with TTL (days):
from ytfetcher.config import FetchOptions
# Keep cached transcripts for 3 days
options = FetchOptions(cache_ttl=3)
# Disable expiration entirely
options = FetchOptions(cache_ttl=0)
CLI cache options
Use --no-cache to skip reading/writing cache for a command:
ytfetcher channel TheOffice -m 20 --no-cache -f json
Set a custom cache directory:
ytfetcher channel TheOffice -m 20 --cache-path ./my_cache -f json
Set cache TTL in days (0 disables expiration):
ytfetcher channel TheOffice -m 20 --cache-ttl 3 -f json
Clear cached transcripts:
ytfetcher cache --clean
Or clear a custom cache path:
ytfetcher cache --clean --cache-path ./my_cache
Fetching Only Manually Created Transcripts
ytfetcher allows you to fetch only manually created transcripts from a channel which allows you to get more precise transcripts.
from ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions
options = FetchOptions(
manually_created=True
)
fetcher = YTFetcher.from_channel(channel_handle="TEDx", options=options)
You can also easily enable this feature with --manually-created argument in CLI.
ytfetcher channel TEDx -f csv --manually-created
Exporting
Use the BaseExporter class to export ChannelData in csv, json, or txt:
from ytfetcher.services import JSONExporter # OR you can import other exporters: TXTExporter, CSVExporter
channel_data = fetcher.fetch_youtube_data()
exporter = JSONExporter(
channel_data=channel_data,
allowed_metadata_list=['title'], # You can customize this
timing=True, # Include transcript start/duration
filename='my_export', # Base filename
output_dir='./exports' # Optional output directory
)
exporter.write()
Exporting With CLI
You can also specify arguments when exporting which allows you to decide whether to exclude timings and choose desired metadata.
ytfetcher channel TheOffice -m 20 -f json --no-timing --metadata title description
This command will exclude timings from transcripts and keep only title and description as metadata.
Fetching Comments
ytfetcher allows you fetch comments in bulk with additional metadata and transcripts or just comments alone.
Performance: Comment fetching is a resource-intensive process. The speed of extraction depends significantly on the user's internet connection and the total volume of comments being retrieved.
Fetch Comments With Transcripts And Metadata
To fetch comments alongside with transcripts and metadata you can use fetch_with_comments method.
fetcher = YTFetcher.from_channel("TheOffice", max_results=5)
channel_data_with_comments = fetcher.fetch_with_comments(max_comments=10)
This will simply fetch top 10 comments for every video alongside with transcript data.
Here's an example structure:
[
ChannelData(
video_id='id1',
transcripts=list[Transcript(...)],
metadata=DLSnippet(...),
comments=list[Comment(
text='Comment one.',
like_count=20,
author='@author',
time_text='8 days ago'
)]
)
]
Fetch Only Comments
To fetch comments without transcripts you can use fetch_comments method.
fetcher = YTFetcher.from_channel("TheOffice", max_results=5)
comments = fetcher.fetch_comments(max_comments=20)
This will return list of Comment like this:
[
Comment(
text='Comment one.',
like_count=20,
author='@author',
time_text='8 days ago'
)
## OTHER COMMENT OBJECTS...
]
Fetching Comments With CLI
Fetching comments in ytfetcher with CLI is very easy.
To fetch comments with transcripts you can use --comments argument:
ytfetcher channel TheOffice -m 20 --comments 10 -f json
To fetch only comments with metadata you can use --comments-only argument:
ytfetcher channel TheOffice -m 20 --comments-only 10 -f json
Other Methods
You can also fetch only transcript data or metadata with video IDs using fetch_transcripts and fetch_snippets.
Fetch Transcripts
fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
data = fetcher.fetch_transcripts()
print(data)
Fetch Snippets
data = fetcher.fetch_snippets()
print(data)
Proxy Configuration
YTFetcher supports proxy usage for fetching YouTube transcripts:
from ytfetcher import YTFetcher
from ytfetcher.config import GenericProxyConfig, WebshareProxyConfig, FetchOptions
options = FetchOptions(
proxy_config=GenericProxyConfig() | WebshareProxyConfig()
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=3,
options=options
)
Advanced HTTP Configuration (Optional)
YTfetcher already uses custom headers for mimic real browser behavior but if you want to change it, you can use a custom HTTPConfig class.
from ytfetcher import YTFetcher
from ytfetcher.config import HTTPConfig, FetchOptions
custom_config = HTTPConfig(
headers={"User-Agent": "ytfetcher/1.0"}
)
options = FetchOptions(
http_config=custom_config
)
fetcher = YTFetcher.from_channel(
channel_handle="TheOffice",
max_results=10,
options=options
)
CLI (Advanced)
CLI Overview
YTFetcher comes with a simple CLI so you can fetch data directly from your terminal.
ytfetcher -h
usage: ytfetcher [-h] {channel,playlist,video,search} ...
Fetch YouTube transcripts for a channel
positional arguments:
{channel,playlist,video,search}
channel Fetch data from channel handle with max_results.
playlist Fetch data from a specific playlist id.
video Fetch data from your custom video ids.
search Fetch data from youtube with search query.
options:
-h, --help show this help message and exit
Basic Usage
ytfetcher channel <CHANNEL_HANDLE> -m <MAX_RESULTS> -f <FORMAT>
Fetching Different Channel Tabs (Videos / Shorts / Streams)
Use --tab to choose which channel feed should be fetched.
# Default: videos
ytfetcher channel TheOffice -m 20 --tab videos -f json
# Fetch from the Shorts tab
ytfetcher channel TheOffice -m 20 --tab shorts -f json
# Fetch from the Live/Streams tab
ytfetcher channel TheOffice -m 20 --tab streams -f json
### Fetching by Video IDs
```bash
ytfetcher video video_id1 video_id2 ... -f json
Fetching From Playlist Id
ytfetcher playlist playlistid123 -f csv -m 25
Fetching with Search Method
ytfetcher search "AI Getting Jobs" -f json -m 25
Using Webshare Proxy
ytfetcher <CHANNEL_HANDLE> -f json --webshare-proxy-username "<USERNAME>" --webshare-proxy-password "<PASSWORD>"
Using Custom Proxy
ytfetcher <CHANNEL_HANDLE> -f json --http-proxy "http://user:pass@host:port" --https-proxy "https://user:pass@host:port"
Docker Quick Start
The recommended way to run or develop YTFetcher is using Docker to ensure a clean, stable environment without needing local Python or dependency management.
docker-compose build
Use docker-compose run to execute your desired command inside the container.
docker-compose run ytfetcher poetry run ytfetcher channel -c TheOffice -m 20 -f json
Contributing
git clone https://github.com/kaya70875/ytfetcher.git
cd ytfetcher
poetry install
Running Tests
poetry run pytest
Running Type Check
You should be passing all type checks to contribute ytfetcher.
poetry run mypy ytfetcher
Related Projects
License
This project is licensed under the MIT License — see the LICENSE file for details.
Contributors
Thanks to everyone who has contributed to ytfetcher ❤️
⭐ If you find this useful, please star the repo or open an issue with feedback!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ytfetcher-2.2.tar.gz.
File metadata
- Download URL: ytfetcher-2.2.tar.gz
- Upload date:
- Size: 35.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d23af98c4437ce6aec708ca0396e6b76b2eb81f3379bafbe5932a0124f6210ed
|
|
| MD5 |
59465097835fd090e690fb7877a16faa
|
|
| BLAKE2b-256 |
655aad642dfb0f416af10427df072920ec192a7b35e9029e67caebfdbd46cbd2
|
File details
Details for the file ytfetcher-2.2-py3-none-any.whl.
File metadata
- Download URL: ytfetcher-2.2-py3-none-any.whl
- Upload date:
- Size: 38.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bdf1d7c1079834ee99b97e075355f34735fe064926da927c522396a425fa9db
|
|
| MD5 |
88a78f9c8eb3164d92dd4fd5b1a1879a
|
|
| BLAKE2b-256 |
9d4b3afd27b68fad392d37248ac22d47687b3913ec3d52085cc028449b27f964
|