Skip to main content

Video Crawler

Project description

vidcrawler

Crawls major videos sites like YouTube/Rumble/Bitchute/Brighteon for video content and outputs a json feed of all the videos that were found.

Platform Unit Tests

Actions Status Actions Status Actions Status

Scraper Tests

Actions Status Actions Status Actions Status Actions Status

Note that bitchute doesn't like the github runner's IP and will fail with a 403 forbidden. Actions Status

API

Command line

vidcrawler --input_crawl_json "fetch_list.json" --output_json "out_list.json"

Python

import json
from vidcrawler import crawl_video_sites
crawl_list = [
    [
        "Computing Forever",  # Can be whatever you want.
        "bitchute",  # Must be "youtube", "rumble", "bitchute" (and others).
        "hybm74uihjkf"  # The channel id on the service.
    ]
]
output = crawl_video_sites(crawl_list)
print(json.dumps(output))

"source" and "channel_id" are used to generate the video-platform-specific urls to fetch data. The "channel name" is echo'd back in the generated json feeds, but doesn't not affect the fetching process in any way.

Testing

Install vidcrawler and then the command vidcralwer_test will become available.

> pip install vidcrawler
> vidcrawler_test

youtube-pull-channel

This new command will a channel and all of it's files as mp3s. Great for transcribing and putting into an LLM.

Example input fetch_list.json

[
    [
        "Health Ranger Report",
        "brighteon",
        "hrreport"
    ],
    [
        "Sydney Watson",
        "youtube",
        "UCSFy-1JrpZf0tFlRZfo-Rvw"
    ],
    [
        "Computing Forever",
        "bitchute",
        "hybm74uihjkf"
    ],
    [
        "ThePeteSantilliShow",
        "rumble",
        "ThePeteSantilliShow"
    ],
    [
        "Macroaggressions",
        "odysee",
        "Macroaggressions"
    ]
]

Example Output:

[
  {
    "channel_name": "ThePeteSantilliShow",
    "title": "The damage this caused is now being totaled up",
    "date_published": "2022-05-17T05:00:11+00:00",
    "date_lastupdated": "2022-05-17T05:17:18.540084",
    "channel_url": "https://www.youtube.com/channel/UCXIJgqnII2ZOINSWNOGFThA",
    "source": "youtube.com",
    "url": "https://www.youtube.com/watch?v=bwqBudCzDrQ",
    "duration": 254,
    "description": "",
    "img_src": "https://i3.ytimg.com/vi/bwqBudCzDrQ/hqdefault.jpg",
    "iframe_src": "https://youtube.com/embed/bwqBudCzDrQ",
    "views": 1429
  },
  {
     "channel_name": "ThePeteSantilliShow",
     "title": "..."
  }
]

Releases

  • 1.0.13: New brighteon-pull-channel command
  • 1.0.11: Improves youtube-pull-channel
  • 1.0.10: Adds youtube-pull-channel which pulls all files down as mp3s for a channel.
  • 1.0.9: Fixes crawler for rumble and minor fixes + linting fixes.
  • 1.0.8: Readme correction.
  • 1.0.7: Fixes Odysee scraper by including image/webp thumbnail format.
  • 1.0.4: Fixes local_now() to be local timezone aware.
  • 1.0.3: Bump
  • 1.0.2: Updates testing
  • 1.0.1: improves command line
  • 1.0.0: Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vidcrawler-1.0.14.tar.gz (37.0 kB view details)

Uploaded Source

Built Distribution

vidcrawler-1.0.14-py2.py3-none-any.whl (47.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file vidcrawler-1.0.14.tar.gz.

File metadata

  • Download URL: vidcrawler-1.0.14.tar.gz
  • Upload date:
  • Size: 37.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for vidcrawler-1.0.14.tar.gz
Algorithm Hash digest
SHA256 17212a6c5170d2531bd66dcb26d0e2812c78137e0c5833af7e632e16da7728ce
MD5 62bc5139e91105b5ccca8c1a40a66158
BLAKE2b-256 23a0eec4853d99ca545108f314ca608fdd8784c34d88a0ea0e1571b0b834b133

See more details on using hashes here.

File details

Details for the file vidcrawler-1.0.14-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for vidcrawler-1.0.14-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5787ac7cc5f97a6902021b6da526d4db6fab3d7849ed5020bd37074ba2324bde
MD5 653863c10dd5905abbbef0bb97e0cf93
BLAKE2b-256 7d9eac28fe9c93170d4f8b7084357f0d88da97e7e9075addc8e871f02dcfa2bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page