Skip to main content

A package to crawl a youtube channel

Project description

Crawl YouTube Channel

This Python package provides tools to crawl and extract data from YouTube channels.

Features

  • Crawl an entire YouTube channel for video information.
  • Extract metadata, comments, transcripts, audio, and video for each video.
  • Provides a base class to easily implement your own video processing and storage logic.
  • Includes a Sqlite3YouTubeVideoProcessor for storing data in a local SQLite database.
  • Provides data classes for easy access to crawled data.

Prerequisites

  • Python 3.10+
  • Google Cloud YouTube API Key

Installation

  1. Install the package:

    pip install crawl-youtube-channel
    
  2. Set up your environment:

    Create a .env file in your project root and add your Google Cloud YouTube API key:

    GOOGLE_CLOUD_YOUTUBE_API_KEY=your_api_key
    

Usage

To use the crawler, you need to implement the YouTubeVideoProcessorBase abstract class. This class defines how to check for existing videos and how to process new ones.

Here is a basic skeleton for a custom processor:

import asyncio
from crawl_youtube_channel import YouTubeVideoProcessorBase, YouTubeVideo

class MyVideoProcessor(YouTubeVideoProcessorBase):
    async def check_video(self, video_id: str) -> bool:
        # Implement logic to check if the video has already been processed.
        # Return True if it exists, False otherwise.
        ...

    async def process_video(self, v: YouTubeVideo) -> None:
        # Implement logic to save or process the video data.
        # For example, save it to a database, a file, or another service.
        ...

async def main():
    # Initialize your custom processor
    processor = MyVideoProcessor()

    # Start crawling the channel
    await processor.process_channel(channel_url='https://www.youtube.com/@YourFavoriteChannel/videos')

if __name__ == '__main__':
    asyncio.run(main())

For a concrete implementation example, see the Sqlite3YouTubeVideoProcessor class in the source code, which stores video data in a SQLite database.

Data Models

The following data classes are used to structure the crawled data:

  • YouTubeVideo: The main container for all video-related data.
  • YouTubeThumbnail: Basic information about a video thumbnail.
  • YouTubeData: Contains detailed information about a video, including:
    • Meta: Video metadata (title, description, tags, etc.).
    • Comment: A YouTube comment, including replies.
    • Transcript: The video's transcript.
    • audio: The audio file in M4A format (as bytes).
    • video: The video file in MP4 format (as bytes).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl_youtube_channel-0.0.1a2.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawl_youtube_channel-0.0.1a2-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file crawl_youtube_channel-0.0.1a2.tar.gz.

File metadata

  • Download URL: crawl_youtube_channel-0.0.1a2.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for crawl_youtube_channel-0.0.1a2.tar.gz
Algorithm Hash digest
SHA256 f4758ca254c9ef03f33abbca0f5fbb7153dc191c88f4c8177d38b6123321da17
MD5 38816431eb1eae7388d2eb8f66bd39bf
BLAKE2b-256 19f4289a7da9ce45d4d301de25ec9a66350b5587a6ae3f0cf252ced3f8c3b0ca

See more details on using hashes here.

File details

Details for the file crawl_youtube_channel-0.0.1a2-py3-none-any.whl.

File metadata

File hashes

Hashes for crawl_youtube_channel-0.0.1a2-py3-none-any.whl
Algorithm Hash digest
SHA256 8824550725f9e713deb07979c16a50e073157980b14ec80e4de27c85e52b5b4e
MD5 6b2db9ed73d211aafc1146515d95c5df
BLAKE2b-256 8524eca50177bb1bd69060cb5f3281dcc741e11b114d3059e2b0ca62cafe3c76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page