WebScraping library that scrapes & gathers data from multiple sources on the internet
Project description
web-graze
Introduction
This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.
Table of Contents
Installation
-
Clone the repository:
git clone https://github.com/shivendrra/web-graze.git cd web-scraper-suite
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
Usage
For sample examples, use the run.py that contains example for each type of scraper.
1. Queries
This library contains some topics, keywords, search queries & channel ids which you can just load & use it with the respective scrapers.
Channel Ids
from webgraze.queries import Queries
queries = Queries(category="channel")
Search Queries
from webgraze.queries import Queries
queries = Queries(category="search")
Image Topics
from webgraze.queries import Queries
queries = Queries(category="channel")
2. YouTube Scraper
The YouTube scraper fetches video captions from a list of channels.
Configuration
-
Add your YouTube API key to a
.env
file:yt_key=YOUR_API_KEY
-
Create a
channelIds.json
file with the list of channel IDs:[ "UC_x5XG1OV2P6uZZ5FSM9Ttw", "UCJ0-OtVpF0wOKEqT2Z1HEtA" ]
Running the Scraper
import os
from dotenv import load_dotenv
load_dotenv()
current_directory = os.path.dirname(os.path.abspath(__file__))
os.chdir(current_directory)
api_key = os.getenv('yt_key')
from webgraze import Youtube
from webgraze.queries import Queries
queries = Queries(category="channel")
youtube = Youtube(api_key=api_key, filepath='../transcripts', max_results=50)
youtube(channel_ids=queries(), videoUrls=True)
3. Wikipedia Scraper
The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.
Running the Scraper
from webgraze import Wikipedia
from webgraze.queries import Queries
queries = Queries(category="search")
wiki = Wikipedia(filepath='../data.txt', metrics=True)
wiki(queries=queries(), extra_urls=True)
4. Unsplash Scraper
The Unsplash Image scraper fetches images based on given topics & saves them in their respective folders
Configuration
-
Define your search queries like this:
search_queries = ["topic1", "topic2", "topic3"]
Running the Scraper
from webgraze import Unsplash
from webgraze.queries import Queries
topics = Queries("images")
image = Unsplash(directory='../images', metrics=True)
image(topics=topics())
Output:
Downloading 'american football' images:
Downloading : 100%|██████████████████████████| 176/176 [00:30<00:00, 5.72it/s]
Downloading 'indian festivals' images:
Downloading : 100%|██████████████████████████| 121/121 [00:30<00:00, 7.29it/s]
5. Britannica Scraper
The Britannica scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.
Running the scraper
from webgraze import Britannica
from webgraze.queries import Queries
queries = Queries(category="search")
scraper = Britannica(filepath='../data.txt', metrics=True)
scraper(queries=queries())
6. Freesound Scraper
Scraper to download & save audios from freesound.org using its official API. Saves audios in different directories according to the topics.
Running the scraper
import os
current_directory = os.path.dirname(os.path.abspath(__file__))
os.chdir(current_directory)
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("freesound_key")
from webgraze import Freesound
sound = Freesound(api_key=API_KEY, download_dir="audios", metrics=True)
sound(topics=["clicks", "background", "nature"])
Output
Downloading 'clicks' audio files:
Response status code: 200
Downloading 'clicks' audio files: 100%|██████████████████████████████| 10/10 [00:20<00:00, 2.01s/it]
Downloading 'background' audio files:
Response status code: 200
Downloading 'background' audio files: 100%|██████████████████████████████| 10/10 [00:53<00:00, 5.37s/it]
Downloading 'nature' audio files:
Response status code: 200
Downloading 'nature' audio files: 100%|██████████████████████████████| 10/10 [01:57<00:00, 11.78s/it]
Freesound Scraper Metrics:
-------------------------------------------
Total topics fetched: 3
Total audio files downloaded: 30
Total time taken: 3.26 minutes
-------------------------------------------
7. Pexels Scraper
Scrapes & downloads pictures from pexels.com & saves them in individual directory topic-wise.
Running the scraper
from webgraze import Pexels
from webgraze.queries import Queries
queries = Queries("images")
scraper = Pexels(directory="./images", metrics=True)
scraper(topics=queries())
Output
Downloading 'american football' images:
Downloading: 100%|████████████████████████████████| 24/24 [00:03<00:00, 7.73it/s]
Downloading 'india' images:
Downloading: 100%|████████████████████████████████| 27/27 [00:04<00:00, 5.99it/s]
Downloading 'europe' images:
Downloading: 100%|████████████████████████████████| 24/24 [00:06<00:00, 3.55it/s]
Configuration
-
API Keys and other secrets: Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.
-
Search Queries: The search queries for Wikipedia and Britannica scrapers are defined in
queries.py
.
Logging
Each scraper logs errors to respective .log
file. Make sure to check this file for detailed error messages & troubleshooting information.
Contribution
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.
Check out CONTRIBUTING.md for more details
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file webgraze-1.1.2.tar.gz
.
File metadata
- Download URL: webgraze-1.1.2.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b86e6dc2a8f030eb8c22a5b4d4377a895003e3bf1c581866542e4ac2dc04e209 |
|
MD5 | 843970abd022c22a71276e704e1334b7 |
|
BLAKE2b-256 | 0b857cad637bf9fe03ae973a8316a6dbaed84a140baeb5b37c74c6d7d1ee8951 |
File details
Details for the file webgraze-1.1.2-py3-none-any.whl
.
File metadata
- Download URL: webgraze-1.1.2-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2afd8f91969927556b9ac7f4f29a48e8ca2affe3a371e1e983a28eae8da92ea |
|
MD5 | 2c58b13b06e981182d64af08f3b801ce |
|
BLAKE2b-256 | fc589279731e0e61be630411a4f4a20ae1fbb011c9511649b0c8ba92c36e7688 |