Skip to main content

A Python package for web scraping and YouTube video scraping

Project description

CrawlPY

CrawlPY is a Python package for web scraping and YouTube video scraping.

Features

  • Web Scraping: Easily scrape web pages using requests and BeautifulSoup.
  • YouTube Scraping: Scrape YouTube videos and download them.
  • Audio Transcription: Transcribe audio from videos using Deepgram API.
  • Selenium Support: Support for websites that require JavaScript rendering.

Installation

You can install CrawlPY using pip:

pip install crawlpy

Web Scraper

Initializing the WebScraper

To start using the WebScraper, you need to create an instance of it. You can customize headers, timeout, retries, and other settings.

from pyscrapy import WebScraper

scraper = WebScraper(headers={'User-Agent': 'Mozilla/5.0'}, timeout=15, retries=5, use_selenium=False)

Fetching Page Text

You can fetch the text content of a single URL or multiple URLs. The function returns the plain text content of the web pages.

# Single URL
text = scraper.get_page_text("https://google.com")
print(text)

# Multiple URLs
texts = scraper.get_page_text(["https://google.com", "https://wikipedia.com"])
for text in texts:
    print(text)

Saving Scraped Content to a File

You can save the scraped content to a file in different formats: txt, json, or csv. You can also provide column names for the CSV format.

# Save as plain text
scraper.save_to_file("https://google.com", "output.txt", file_type='txt')

# Save as JSON
scraper.save_to_file(["https://google.com", "https://wikipedia.com"], "output.json", file_type='json')

# Save as CSV with column names
scraper.save_to_file(["https://google.com", "https://wikipedia.com"], "output.csv", file_type='csv', column_names=['Content'])

Extracting Specific HTML Tags

You can extract content from specific HTML tags. The function returns the text content of all occurrences of the specified tag.

tags_content = scraper.get_tag_content("https://google.com", "p")
for content in tags_content:
    print(content)

Extracting Links

You can extract all the links (<a> tags) from a web page.

links = scraper.extract_links("https://example.com")
for link in links:
    print(link)

Taking Screenshots

You can take a screenshot of a web page and save it as an image file. This feature uses Selenium.

screenshot_file = scraper.take_screenshot("https://google.com", filename="screenshot.png")

Using Selenium for JavaScript-Heavy Websites

If you need to scrape content from websites that require JavaScript rendering, enable Selenium when initializing the WebScraper.

scraper = WebScraper(use_selenium=True)

# Now all scraping functions will use Selenium
text = scraper.get_page_text("https://example.com")
print(text)

YouTube Scraper

The YouTube scraper in CrawlPy allows you to download YouTube videos and transcribe their audio content using the Deepgram API. To use this functionality, ensure you have set up your environment with the required API keys.

Prerequisites

Install Dependencies:

pip install crawlpy

Set Up Environment Variables:

You need to set up your Deepgram API key as an environment variable. Create a .env file in your project directory and add your API key:

DEEPGRAM_API_KEY=your_deepgram_api_key

or add an environment variable into your code space

import os

os.environ["DEEPGRAM_API_KEY"] = deepgram_api_key

Initializing the YouTube Scraper

To use the YouTube scraper, you need to create an instance of the YouTubeScraper class.

from crawlpy import YouTubeScraper

youtube_scraper = YouTubeScraper()

Downloading YouTube Videos

You can download a YouTube video by providing its URL.

video_url = "https://www.youtube.com/watch?v=oHg5SJYRHA0"
file_path = youtube_scraper.download_video(video_url)
print(f"Video downloaded to {file_path}")

Transcribing YouTube Videos

You can transcribe the audio content of a YouTube video. The transcriber function can take either a URL or the path to a previously downloaded video file.

# Transcribe using a video URL
transcript = youtube_scraper.transcribe_video(video_url)
print("Transcript:", transcript)

# Optionally, save the transcript to a file
transcript = youtube_scraper.transcribe_video(video_url, save=True, filename="transcript.txt")
print(f"Transcript saved to {transcript}")

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlpy-1.0.2.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

crawlpy-1.0.2-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file crawlpy-1.0.2.tar.gz.

File metadata

  • Download URL: crawlpy-1.0.2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.3

File hashes

Hashes for crawlpy-1.0.2.tar.gz
Algorithm Hash digest
SHA256 71cd0a9db4ada86cdb094790fdff8f4b3a0e984b08019e90514513daccef17ce
MD5 1de4146e22574f6d51f9c7ba8014964d
BLAKE2b-256 f73517752f54d21da9b0ec954287e4554fc05d2d8dcab788faf678bc2e473a95

See more details on using hashes here.

File details

Details for the file crawlpy-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: crawlpy-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.3

File hashes

Hashes for crawlpy-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c9649e69256c6f612dda6313286d02eab75a875f111e42d779627a95b4b6b342
MD5 9b70481a7e132000f15160ae3732a544
BLAKE2b-256 8d3b2966266b256dda07b8398baabce4fec1a77ad77c9566be2ba75307ca52cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page