Skip to main content

A package to crawl a youtube channel

Project description

Crawl YouTube Channel

This Python package provides tools to crawl and extract data from YouTube channels.

Features

  • Crawl an entire YouTube channel for video information.
  • Extract metadata, comments, transcripts, audio, and video for each video.
  • Provides a base class to easily implement your own video processing and storage logic.
  • Includes a Sqlite3YouTubeVideoProcessor for storing data in a local SQLite database.
  • Provides data classes for easy access to crawled data.

Prerequisites

  • Python 3.10+
  • Google Cloud YouTube API Key

Installation

  1. Install the package:

    pip install crawl-youtube-channel
    
  2. Set up your environment:

    Create a .env file in your project root and add your Google Cloud YouTube API key:

    GOOGLE_CLOUD_YOUTUBE_API_KEY=your_api_key
    

Usage

To use the crawler, you need to implement the YouTubeVideoProcessorBase abstract class. This class defines how to check for existing videos and how to process new ones.

Here is a basic skeleton for a custom processor:

import asyncio
from crawl_youtube_channel import YouTubeVideoProcessorBase, YouTubeVideo

class MyVideoProcessor(YouTubeVideoProcessorBase):
    async def check_video(self, video_id: str) -> bool:
        # Implement logic to check if the video has already been processed.
        # Return True if it exists, False otherwise.
        ...

    async def process_video(self, v: YouTubeVideo) -> None:
        # Implement logic to save or process the video data.
        # For example, save it to a database, a file, or another service.
        ...

async def main():
    # Initialize your custom processor
    processor = MyVideoProcessor()

    # Start crawling the channel
    await processor.process_channel(channel_url='https://www.youtube.com/@YourFavoriteChannel/videos')

if __name__ == '__main__':
    asyncio.run(main())

For a concrete implementation example, see the Sqlite3YouTubeVideoProcessor class in the source code, which stores video data in a SQLite database.

Data Models

The following data classes are used to structure the crawled data:

  • YouTubeVideo: The main container for all video-related data.
  • YouTubeThumbnail: Basic information about a video thumbnail.
  • YouTubeData: Contains detailed information about a video, including:
    • Meta: Video metadata (title, description, tags, etc.).
    • Comment: A YouTube comment, including replies.
    • Transcript: The video's transcript.
    • audio: The audio file in M4A format (as bytes).
    • video: The video file in MP4 format (as bytes).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl_youtube_channel-0.0.1.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawl_youtube_channel-0.0.1-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file crawl_youtube_channel-0.0.1.tar.gz.

File metadata

  • Download URL: crawl_youtube_channel-0.0.1.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for crawl_youtube_channel-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1f3089157e057d2039d75e887942258159cb8148e80a4c84ed5bc92681358320
MD5 c4cca9f774a289e6903dda744527fa2e
BLAKE2b-256 030517061614e6b8f16e6406eb0489e4c0b92af3645261d6d519f3057338cee3

See more details on using hashes here.

File details

Details for the file crawl_youtube_channel-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for crawl_youtube_channel-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 abe18946c31aa067300dad9fe7418f4bf4ec81e37969daa4faaf0e5c8e9b5628
MD5 2d89d4aa170b53920b2a829b5ad8d2f0
BLAKE2b-256 840ba873ca77d09279820f5549fa78e4f966fab58036080c887d8d9f4957ad99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page