WebScraping library that scrapes & gathers data from multiple sources on the internet

These details have not been verified by PyPI

Project links

Project description

web-graze

Introduction

This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.

Installation
Usage
Configuration
Logging

Installation

Clone the repository:

git clone https://github.com/shivendrra/web-graze.git
cd web-scraper-suite

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Install the required packages:
```
pip install -r requirements.txt
```

Usage

For sample examples, use the run.py that contains example for each type of scraper.

1. Queries

This library contains some topics, keywords, search queries & channel ids which you can just load & use it with the respective scrapers.

Channel Ids

from webgraze.queries import Queries

queries = Queries(category="channel")

Search Queries

from webgraze.queries import Queries

queries = Queries(category="search")

Image Topics

from webgraze.queries import Queries

queries = Queries(category="channel")

2. YouTube Scraper

The YouTube scraper fetches video captions from a list of channels.

Configuration

Add your YouTube API key to a .env file:
```
yt_key=YOUR_API_KEY
```

Create a channelIds.json file with the list of channel IDs:

[
  "UC_x5XG1OV2P6uZZ5FSM9Ttw",
  "UCJ0-OtVpF0wOKEqT2Z1HEtA"
]

Running the Scraper

import os
from dotenv import load_dotenv
load_dotenv()
current_directory = os.path.dirname(os.path.abspath(__file__))
os.chdir(current_directory)

api_key = os.getenv('yt_key')

from webgraze import Youtube
from webgraze.queries import Queries

queries = Queries(category="channel")

youtube = Youtube(api_key=api_key, filepath='../transcripts', max_results=50)
youtube(channel_ids=queries(), videoUrls=True)

3. Wikipedia Scraper

The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.

Running the Scraper

from webgraze import Wikipedia
from webgraze.queries import Queries

queries = Queries(category="search")
wiki = Wikipedia(filepath='../data.txt', metrics=True)

wiki(queries=queries(), extra_urls=True)

4. Unsplash Scraper

The Unsplash Image scraper fetches images based on given topics & saves them in their respective folders

Configuration

Define your search queries like this:

search_queries = ["topic1", "topic2", "topic3"]

Running the Scraper

from webgraze import Unsplash
from webgraze.queries import Queries

topics = Queries("images")

image = Unsplash(directory='../images', metrics=True)
image(topics=topics())

Output:

Downloading 'american football' images:
Downloading : 100%|██████████████████████████| 176/176 [00:30<00:00,  5.72it/s]

Downloading 'indian festivals' images:
Downloading : 100%|██████████████████████████| 121/121 [00:30<00:00,  7.29it/s]

5. Britannica Scraper

The Britannica scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.

Running the scraper

from webgraze import Britannica
from webgraze.queries import Queries

queries = Queries(category="search")
scraper = Britannica(filepath='../data.txt', metrics=True)

scraper(queries=queries())

6. Freesound Scraper

Scraper to download & save audios from freesound.org using its official API. Saves audios in different directories according to the topics.

Running the scraper

import os
current_directory = os.path.dirname(os.path.abspath(__file__))
os.chdir(current_directory)
from dotenv import load_dotenv
load_dotenv()

API_KEY = os.getenv("freesound_key")

from webgraze import Freesound

sound = Freesound(api_key=API_KEY, download_dir="audios", metrics=True)
sound(topics=["clicks", "background", "nature"])

Output

Downloading 'clicks' audio files:
Response status code: 200
Downloading 'clicks' audio files: 100%|██████████████████████████████| 10/10 [00:20<00:00,  2.01s/it] 

Downloading 'background' audio files:
Response status code: 200
Downloading 'background' audio files: 100%|██████████████████████████████| 10/10 [00:53<00:00,  5.37s/it] 

Downloading 'nature' audio files:
Response status code: 200
Downloading 'nature' audio files: 100%|██████████████████████████████| 10/10 [01:57<00:00, 11.78s/it] 

Freesound Scraper Metrics:

-------------------------------------------
Total topics fetched: 3
Total audio files downloaded: 30
Total time taken: 3.26 minutes
-------------------------------------------

7. Pexels Scraper

Scrapes & downloads pictures from pexels.com & saves them in individual directory topic-wise.

Running the scraper

from webgraze import Pexels
from webgraze.queries import Queries

queries = Queries("images")
scraper = Pexels(directory="./images", metrics=True)
scraper(topics=queries())

Output

Downloading 'american football' images:
Downloading: 100%|████████████████████████████████| 24/24 [00:03<00:00,  7.73it/s]

Downloading 'india' images:
Downloading: 100%|████████████████████████████████| 27/27 [00:04<00:00,  5.99it/s]

Downloading 'europe' images:
Downloading: 100%|████████████████████████████████| 24/24 [00:06<00:00,  3.55it/s]

Configuration

API Keys and other secrets: Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.
Search Queries: The search queries for Wikipedia and Britannica scrapers are defined in queries.py.

Logging

Each scraper logs errors to respective .log file. Make sure to check this file for detailed error messages & troubleshooting information.

Contribution

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.

Check out CONTRIBUTING.md for more details

License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.3

Jul 2, 2025

1.1.2

Sep 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webgraze-1.1.3.tar.gz (22.9 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webgraze-1.1.3-py3-none-any.whl (23.1 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file webgraze-1.1.3.tar.gz.

File metadata

Download URL: webgraze-1.1.3.tar.gz
Upload date: Jul 2, 2025
Size: 22.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for webgraze-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`bce59ec625631e164ac70b0122f7d3c48d3fee76f91118c4a636ae9c93ffc1aa`
MD5	`7bac8742e4e1aca89c91256322a836be`
BLAKE2b-256	`c18c4ce21d37c1798bc4e6eb2719b1f609e61d0110661b4e9a51c81e1e8bc3eb`

See more details on using hashes here.

File details

Details for the file webgraze-1.1.3-py3-none-any.whl.

File metadata

Download URL: webgraze-1.1.3-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 23.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for webgraze-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9e9a10d383cd132383091b96e644d2ea0ffbe0224308890bc32763bc40e04fd`
MD5	`ad6ec799bffa2c9878da369d7ef1448d`
BLAKE2b-256	`e9c0bbb78934fdb83e3af5c514a4463b1a48db898ec7ff445f5d3b3460a2381a`

See more details on using hashes here.

webgraze 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

web-graze

Introduction

Table of Contents

Installation

Usage

1. Queries

Channel Ids

Search Queries

Image Topics

2. YouTube Scraper

Configuration

Running the Scraper

3. Wikipedia Scraper

Running the Scraper

4. Unsplash Scraper

Configuration

Running the Scraper

Output:

5. Britannica Scraper

Running the scraper

6. Freesound Scraper

Running the scraper

Output

7. Pexels Scraper

Running the scraper

Output

Configuration

Logging

Contribution

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes