Skip to main content

YTFetcher lets you fetch YouTube transcripts in bulk with metadata like titles, publish dates, and thumbnails. Great for ML, NLP, and dataset generation.

Project description

YTFetcher

codecov PyPI Downloads PyPI version License: MIT

⚡ Build structured YouTube datasets for NLP, ML, sentiment analysis & RAG in minutes.

A python tool for fetching thousands of videos fast from a Youtube channel along with structured transcripts and additional metadata. Export data easily as CSV, TXT, or JSON.


📚 Table of Contents


Installation

Install from PyPI:

pip install ytfetcher

Quick CLI Usage

Fetch 50 video transcripts + metadata from a channel and save as JSON:

ytfetcher channel TheOffice -m 50 -f json

Basic Usage (Python API)

Here’s how you can get transcripts and metadata information like channel name, description, published date, etc. from a single channel with from_channel method:

from ytfetcher import YTFetcher

fetcher = YTFetcher.from_channel(
    channel_handle="TheOffice",
    max_results=2
)

channel_data = fetcher.fetch_youtube_data()
for video in channel_data:
  print(video.metadata.title)
  print(video.metadata.description)
  print(video.transcripts)

This will return a list of ChannelData with metadata in DLSnippet objects:

[
ChannelData(
    video_id='video1',
    transcripts=[
        Transcript(
            text="Hey there",
            start=0.0,
            duration=1.54
        ),
        Transcript(
            text="Happy coding!",
            start=1.56,
            duration=4.46
        )
    ]
    metadata=DLSnippet(
        video_id='video1',
        title='VideoTitle',
        description='VideoDescription',
        url='https://youtu.be/video1',
        duration=120,
        view_count=1000,
        thumbnails=[{'url': 'thumbnail_url'}]
    )
),
# Other ChannelData objects...
]

You can also preview this data using PreviewRenderer class from ytfetcher.services.

from ytfetcher.services import PreviewRenderer

channel_data = fetcher.fetch_with_comments(max_comments=10)
#print(channel_data)
preview = PreviewRenderer()
preview.render(data=channel_data, limit=4)

This will preview the first 4 results of the data in a beautifully formatted terminal view, including metadata, transcript snippets, and comments.


Features

  • Fetch full transcripts from a YouTube channel.
  • Get video metadata: title, description, thumbnails, published date.
  • Support for fetching with channel handle, playlist id, custom video id's or with a search query.
  • Fetch comments in bulk.
  • Concurrent fetching for high performance.
  • Built in cache support.
  • Export fetched data as txt, csv or json.
  • CLI support.

Fetching Specific Channel Tabs (Videos / Shorts / Streams)

Use the tab parameter in from_channel() to select which section of a channel to fetch.

Available options:

  • 'videos' (default)
  • 'shorts'
  • 'streams'

If not specified, the fetcher defaults to the Videos tab.

# Fetch regular videos (default)
YTFetcher.from_channel(channel_handle="handle")

# Fetch Shorts
YTFetcher.from_channel(channel_handle="handle", tab="shorts")

# Fetch live streams
YTFetcher.from_channel(channel_handle="handle", tab="streams")

Using Different Fetchers

ytfetcher supports various fetching options that includes:

  • Fetching from a playlist id with from_playlist_id method.
  • Fetching from video id's with from_video_ids method.
  • Fetching from a search query with from_search method.

Fetching from Playlist ID

Use from_playlist_id to retrieve metadata and transcripts for every video within a public or unlisted YouTube playlist.

from ytfetcher import YTFetcher

fetcher = YTFetcher.from_playlist_id(
    playlist_id="playlistid1254"
)

# Rest is same ...

Fetching With Custom Video IDs

If you already have specific video identifiers, from_video_ids allows you to target them directly. This is the most efficient way to fetch data when you have an external list of URLs or IDs.

from ytfetcher import YTFetcher

fetcher = YTFetcher.from_video_ids(
    video_ids=['video1', 'video2', 'video3']
)

# Rest is same ...

Fetching With Search Query

The from_search method allows you to discover videos based on a keyword or phrase, similar to using the YouTube search bar. You can control the breadth of the search using the max_results parameter.

from ytfetcher import YTFetcher

# Searches for the top 10 videos matching 'Artificial Intelligence'
fetcher = YTFetcher.from_search(
    query="Artificial Intelligence",
    max_results=10
)

YTFetcher Options

YTFetcher provides a simple interface for customizing your fetching process with several optional parameters:

  • languages: Specify preferred transcript languages (e.g., ["en", "tr"]).
  • filters: Apply filters to video metadata before transcripts are fetched.
  • manually_created Fetch only manually created transcripts for more precise transcripts.
  • proxy_config Provide custom proxy settings for preventing bans.
  • http_config Define custom http headers.
  • cache_enabled Enable or disable SQLite transcript cache. Enabled by default.
  • cache_path Choose where cache file (cache.sqlite3) is stored.

These options can be passed to any of the fetcher methods (from_channel, from_video_ids, from_playlist_id, or from_search) to tailor the fetching process for your needs. You can use FetchOptions dataclass from ytfetcher.config for easily configure your options.

See below for examples of usages.

Retreive Different Languages

You can use the languages param to retrieve your desired language. (Default en)

from ytfetcher.config import FetchOptions

options = FetchOptions(
    languages=['tr', 'en']
)

fetcher = YTFetcher.from_video_ids(video_ids=video_ids, options=options)

Also here's a quick CLI command for languages param.

ytfetcher channel TheOffice -m 50 -f csv --languages tr en

ytfetcher first tries to fetch the Turkish transcript. If it's not available, it falls back to English.


Filtering

ytfetcher allows you to filter videos before fetching transcripts, which helps you focus on specific content and save processing time. Filters are applied to video metadata (duration, view count, title) and work with all fetcher methods.

Available Filter Functions

The following filter functions are available in ytfetcher.filters:

  • min_duration(sec: float) - Filter videos with duration greater than or equal to specified seconds
  • max_duration(sec: float) - Filter videos with duration less than or equal to specified seconds
  • min_views(n: int) - Filter videos with view count greater than or equal to specified number
  • max_views(n: int) - Filter videos with view count less than or equal to specified number
  • filter_by_title(search_query: str) - Filter videos whose title contains the search query (case-insensitive)

Using Filters in Python API

Pass a list of filter functions to the filters parameter when creating a fetcher:

from ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions
from ytfetcher.filters import min_duration, min_views, filter_by_title

options = FetchOptions(
    filters=[
        min_views(5000),
        min_duration(600),  # At least 10 minutes
        filter_by_title("tutorial")
    ]
)

fetcher = YTFetcher.from_channel(
    channel_handle="TheOffice",
    max_results=50,
    options=options
)

Using Filters in CLI

You can use filter arguments directly in the CLI:

# Filter by minimum views
ytfetcher channel TheOffice -m 50 -f json --min-views 1000

# Filter by minimum duration (in seconds)
ytfetcher channel TheOffice -m 50 -f csv --min-duration 300

# Filter by title substring
ytfetcher channel TheOffice -m 50 -f json --includes-title "episode"

# Combine multiple filters
ytfetcher channel TheOffice -m 50 -f json --min-views 1000 --min-duration 300 --includes-title "tutorial"

Converting ChannelData to Rows

If you want a flat, row-based structure for ML workflows (Pandas, HuggingFace datasets, JSON/Parquet), you can use the helper in ytfetcher.utils to join transcript segments. Comments are only included if you fetched them with fetch_with_comments or fetch_comments.

from ytfetcher import YTFetcher
from ytfetcher.utils import channel_data_to_rows

fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
channel_data = fetcher.fetch_with_comments(max_comments=5)

rows = channel_data_to_rows(channel_data, include_comments=True)

SQLite Cache

ytfetcher now uses a local SQLite cache for transcripts. This significantly speeds up repeated fetches by reusing transcripts that were already fetched with the same transcript options.

Python API cache options

sfrom ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions

options = FetchOptions(
    cache_enabled=True,
    cache_path="./.ytfetcher_cache"
)

fetcher = YTFetcher.from_channel(
    channel_handle="TheOffice",
    max_results=20,
    options=options,
)

Disable cache when needed:

from ytfetcher.config import FetchOptions

options = FetchOptions(cache_enabled=False)

Control cache expiration with TTL (days):

from ytfetcher.config import FetchOptions

# Keep cached transcripts for 3 days
options = FetchOptions(cache_ttl=3)

# Disable expiration entirely
options = FetchOptions(cache_ttl=0)

CLI cache options

Use --no-cache to skip reading/writing cache for a command:

ytfetcher channel TheOffice -m 20 --no-cache -f json

Set a custom cache directory:

ytfetcher channel TheOffice -m 20 --cache-path ./my_cache -f json

Set cache TTL in days (0 disables expiration):

ytfetcher channel TheOffice -m 20 --cache-ttl 3 -f json

Clear cached transcripts:

ytfetcher cache --clean

Or clear a custom cache path:

ytfetcher cache --clean --cache-path ./my_cache

Fetching Only Manually Created Transcripts

ytfetcher allows you to fetch only manually created transcripts from a channel which allows you to get more precise transcripts.

from ytfetcher import YTFetcher
from ytfetcher.config import FetchOptions

options = FetchOptions(
    manually_created=True
)
fetcher = YTFetcher.from_channel(channel_handle="TEDx", options=options)

You can also easily enable this feature with --manually-created argument in CLI.

ytfetcher channel TEDx -f csv --manually-created

Exporting

Use the BaseExporter class to export ChannelData in csv, json, or txt:

from ytfetcher.services import JSONExporter # OR you can import other exporters: TXTExporter, CSVExporter

channel_data = fetcher.fetch_youtube_data()

exporter = JSONExporter(
    channel_data=channel_data,
    allowed_metadata_list=['title'],   # You can customize this
    timing=True,                       # Include transcript start/duration
    filename='my_export',              # Base filename
    output_dir='./exports'             # Optional output directory
)

exporter.write()

Exporting With CLI

You can also specify arguments when exporting which allows you to decide whether to exclude timings and choose desired metadata.

ytfetcher channel TheOffice -m 20 -f json --no-timing --metadata title description

This command will exclude timings from transcripts and keep only title and description as metadata.


Fetching Comments

ytfetcher allows you fetch comments in bulk with additional metadata and transcripts or just comments alone.

Performance: Comment fetching is a resource-intensive process. The speed of extraction depends significantly on the user's internet connection and the total volume of comments being retrieved.

Fetch Comments With Transcripts And Metadata

To fetch comments alongside with transcripts and metadata you can use fetch_with_comments method.

fetcher = YTFetcher.from_channel("TheOffice", max_results=5)

channel_data_with_comments = fetcher.fetch_with_comments(max_comments=10)

This will simply fetch top 10 comments for every video alongside with transcript data.

Here's an example structure:

[
    ChannelData(
        video_id='id1',
        transcripts=list[Transcript(...)],
        metadata=DLSnippet(...),
        comments=list[Comment(
            text='Comment one.',
            like_count=20,
            author='@author',
            time_text='8 days ago'
        )]
    )
]

Fetch Only Comments

To fetch comments without transcripts you can use fetch_comments method.

fetcher = YTFetcher.from_channel("TheOffice", max_results=5)

comments = fetcher.fetch_comments(max_comments=20)

This will return list of Comment like this:

[
    Comment(
        text='Comment one.',
        like_count=20,
        author='@author',
        time_text='8 days ago'
    )

    ## OTHER COMMENT OBJECTS...
]

Fetching Comments With CLI

Fetching comments in ytfetcher with CLI is very easy.

To fetch comments with transcripts you can use --comments argument:

ytfetcher channel TheOffice -m 20 --comments 10 -f json

To fetch only comments with metadata you can use --comments-only argument:

ytfetcher channel TheOffice -m 20 --comments-only 10 -f json

Other Methods

You can also fetch only transcript data or metadata with video IDs using fetch_transcripts and fetch_snippets.

Fetch Transcripts

fetcher = YTFetcher.from_channel(channel_handle="TheOffice", max_results=2)
data = fetcher.fetch_transcripts()

print(data)

Fetch Snippets

data = fetcher.fetch_snippets()
print(data)

Proxy Configuration

YTFetcher supports proxy usage for fetching YouTube transcripts:

from ytfetcher import YTFetcher
from ytfetcher.config import GenericProxyConfig, WebshareProxyConfig, FetchOptions

options = FetchOptions(
    proxy_config=GenericProxyConfig() | WebshareProxyConfig()
)

fetcher = YTFetcher.from_channel(
    channel_handle="TheOffice",
    max_results=3,
    options=options
)

Advanced HTTP Configuration (Optional)

YTfetcher already uses custom headers for mimic real browser behavior but if you want to change it, you can use a custom HTTPConfig class.

from ytfetcher import YTFetcher
from ytfetcher.config import HTTPConfig, FetchOptions

custom_config = HTTPConfig(
    headers={"User-Agent": "ytfetcher/1.0"}
)

options = FetchOptions(
    http_config=custom_config
)

fetcher = YTFetcher.from_channel(
    channel_handle="TheOffice",
    max_results=10,
    options=options
)

CLI (Advanced)

CLI Overview

YTFetcher comes with a simple CLI so you can fetch data directly from your terminal.

ytfetcher -h
usage: ytfetcher [-h] {channel,playlist,video,search} ...

Fetch YouTube transcripts for a channel

positional arguments:
  {channel,playlist,video,search}
    channel        Fetch data from channel handle with max_results.
    playlist    Fetch data from a specific playlist id.
    video      Fetch data from your custom video ids.
    search     Fetch data from youtube with search query. 

options:
  -h, --help            show this help message and exit

Basic Usage

ytfetcher channel <CHANNEL_HANDLE> -m <MAX_RESULTS> -f <FORMAT>

Fetching Different Channel Tabs (Videos / Shorts / Streams)

Use --tab to choose which channel feed should be fetched.

# Default: videos
ytfetcher channel TheOffice -m 20 --tab videos -f json

# Fetch from the Shorts tab
ytfetcher channel TheOffice -m 20 --tab shorts -f json

# Fetch from the Live/Streams tab
ytfetcher channel TheOffice -m 20 --tab streams -f json

### Fetching by Video IDs

```bash
ytfetcher video video_id1 video_id2 ... -f json

Fetching From Playlist Id

ytfetcher playlist playlistid123 -f csv -m 25

Fetching with Search Method

ytfetcher search "AI Getting Jobs" -f json -m 25

Using Webshare Proxy

ytfetcher <CHANNEL_HANDLE> -f json --webshare-proxy-username "<USERNAME>" --webshare-proxy-password "<PASSWORD>"

Using Custom Proxy

ytfetcher <CHANNEL_HANDLE> -f json --http-proxy "http://user:pass@host:port" --https-proxy "https://user:pass@host:port"

Docker Quick Start

The recommended way to run or develop YTFetcher is using Docker to ensure a clean, stable environment without needing local Python or dependency management.

docker-compose build

Use docker-compose run to execute your desired command inside the container.

docker-compose run ytfetcher poetry run ytfetcher channel -c TheOffice -m 20 -f json

Contributing

git clone https://github.com/kaya70875/ytfetcher.git
cd ytfetcher
poetry install

Running Tests

poetry run pytest

Running Type Check

You should be passing all type checks to contribute ytfetcher.

poetry run mypy ytfetcher

Related Projects


License

This project is licensed under the MIT License — see the LICENSE file for details.

Contributors

Thanks to everyone who has contributed to ytfetcher ❤️

Contributors


⭐ If you find this useful, please star the repo or open an issue with feedback!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ytfetcher-2.2.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ytfetcher-2.2-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file ytfetcher-2.2.tar.gz.

File metadata

  • Download URL: ytfetcher-2.2.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Windows/10

File hashes

Hashes for ytfetcher-2.2.tar.gz
Algorithm Hash digest
SHA256 d23af98c4437ce6aec708ca0396e6b76b2eb81f3379bafbe5932a0124f6210ed
MD5 59465097835fd090e690fb7877a16faa
BLAKE2b-256 655aad642dfb0f416af10427df072920ec192a7b35e9029e67caebfdbd46cbd2

See more details on using hashes here.

File details

Details for the file ytfetcher-2.2-py3-none-any.whl.

File metadata

  • Download URL: ytfetcher-2.2-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Windows/10

File hashes

Hashes for ytfetcher-2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7bdf1d7c1079834ee99b97e075355f34735fe064926da927c522396a425fa9db
MD5 88a78f9c8eb3164d92dd4fd5b1a1879a
BLAKE2b-256 9d4b3afd27b68fad392d37248ac22d47687b3913ec3d52085cc028449b27f964

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page