Skip to main content

WebScraping library that scrapes & gathers data from multiple sources on the internet

Project description

web-graze

Introduction

This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.

Table of Contents

Installation

  1. Clone the repository:

    git clone https://github.com/shivendrra/web-graze.git
    
    cd web-scraper-suite
    
  2. Create and activate a virtual environment:

    python -m venv venv
    
    source venv/bin/activate   # On Windows: venv\Scripts\activate
    
  3. Install the required packages:

    pip install -r requirements.txt
    

Usage

For sample examples, use the run.py that contains example for each type of scraper.

1. Queries

This library contains some topics, keywords, search queries & channel ids which you can just load & use it with the respective scrapers.

Channel Ids

from webgraze.queries import Queries



queries = Queries(category="channel")

Search Queries

from webgraze.queries import Queries



queries = Queries(category="search")

Image Topics

from webgraze.queries import Queries



queries = Queries(category="channel")

2. YouTube Scraper

The YouTube scraper fetches video captions from a list of channels.

Configuration

  • Add your YouTube API key to a .env file:

    yt_key=YOUR_API_KEY
    
  • Create a channelIds.json file with the list of channel IDs:

    [
    
      "UC_x5XG1OV2P6uZZ5FSM9Ttw",
    
      "UCJ0-OtVpF0wOKEqT2Z1HEtA"
    
    ]
    

Running the Scraper

import os

from dotenv import load_dotenv

load_dotenv()

current_directory = os.path.dirname(os.path.abspath(__file__))

os.chdir(current_directory)



api_key = os.getenv('yt_key')



from webgraze import Youtube

from webgraze.queries import Queries



queries = Queries(category="channel")



youtube = Youtube(api_key=api_key, filepath='../transcripts', max_results=50)

youtube(channel_ids=queries(), videoUrls=True)

3. Wikipedia Scraper

The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.

Running the Scraper

from webgraze import Wikipedia

from webgraze.queries import Queries



queries = Queries(category="search")

wiki = Wikipedia(filepath='../data.txt', metrics=True)



wiki(queries=queries(), extra_urls=True)

4. Unsplash Scraper

The Unsplash Image scraper fetches images based on given topics & saves them in their respective folders

Configuration

  • Define your search queries like this:

    search_queries = ["topic1", "topic2", "topic3"]
    

Running the Scraper

from webgraze import Unsplash

from webgraze.queries import Queries



topics = Queries("images")



image = Unsplash(directory='../images', metrics=True)

image(topics=topics())

Output:

Downloading 'american football' images:

Downloading : 100%|██████████████████████████| 176/176 [00:30<00:00,  5.72it/s]



Downloading 'indian festivals' images:

Downloading : 100%|██████████████████████████| 121/121 [00:30<00:00,  7.29it/s]

5. Britannica Scraper

The Britannica scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.

Running the scraper

from webgraze import Britannica

from webgraze.queries import Queries



queries = Queries(category="search")

scraper = Britannica(filepath='../data.txt', metrics=True)



scraper(queries=queries())

6. Freesound Scraper

Scraper to download & save audios from freesound.org using its official API. Saves audios in different directories according to the topics.

Running the scraper

import os

current_directory = os.path.dirname(os.path.abspath(__file__))

os.chdir(current_directory)

from dotenv import load_dotenv

load_dotenv()



API_KEY = os.getenv("freesound_key")



from webgraze import Freesound



sound = Freesound(api_key=API_KEY, download_dir="audios", metrics=True)

sound(topics=["clicks", "background", "nature"])

Output

Downloading 'clicks' audio files:

Response status code: 200

Downloading 'clicks' audio files: 100%|██████████████████████████████| 10/10 [00:20<00:00,  2.01s/it] 



Downloading 'background' audio files:

Response status code: 200

Downloading 'background' audio files: 100%|██████████████████████████████| 10/10 [00:53<00:00,  5.37s/it] 



Downloading 'nature' audio files:

Response status code: 200

Downloading 'nature' audio files: 100%|██████████████████████████████| 10/10 [01:57<00:00, 11.78s/it] 



Freesound Scraper Metrics:



-------------------------------------------

Total topics fetched: 3

Total audio files downloaded: 30

Total time taken: 3.26 minutes

-------------------------------------------

7. Pexels Scraper

Scrapes & downloads pictures from pexels.com & saves them in individual directory topic-wise.

Running the scraper

from webgraze import Pexels

from webgraze.queries import Queries



queries = Queries("images")

scraper = Pexels(directory="./images", metrics=True)

scraper(topics=queries())

Output

Downloading 'american football' images:

Downloading: 100%|████████████████████████████████| 24/24 [00:03<00:00,  7.73it/s]



Downloading 'india' images:

Downloading: 100%|████████████████████████████████| 27/27 [00:04<00:00,  5.99it/s]



Downloading 'europe' images:

Downloading: 100%|████████████████████████████████| 24/24 [00:06<00:00,  3.55it/s]

Configuration

  • API Keys and other secrets: Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.

  • Search Queries: The search queries for Wikipedia and Britannica scrapers are defined in queries.py.

Logging

Each scraper logs errors to respective .log file. Make sure to check this file for detailed error messages & troubleshooting information.

Contribution

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.

Check out CONTRIBUTING.md for more details

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webgraze-1.1.2.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

webgraze-1.1.2-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file webgraze-1.1.2.tar.gz.

File metadata

  • Download URL: webgraze-1.1.2.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for webgraze-1.1.2.tar.gz
Algorithm Hash digest
SHA256 b86e6dc2a8f030eb8c22a5b4d4377a895003e3bf1c581866542e4ac2dc04e209
MD5 843970abd022c22a71276e704e1334b7
BLAKE2b-256 0b857cad637bf9fe03ae973a8316a6dbaed84a140baeb5b37c74c6d7d1ee8951

See more details on using hashes here.

File details

Details for the file webgraze-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: webgraze-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for webgraze-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b2afd8f91969927556b9ac7f4f29a48e8ca2affe3a371e1e983a28eae8da92ea
MD5 2c58b13b06e981182d64af08f3b801ce
BLAKE2b-256 fc589279731e0e61be630411a4f4a20ae1fbb011c9511649b0c8ba92c36e7688

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page