Web Raider
Project description
web-raider
Overview
Web Raider is a powerful web scraping and data extraction tool designed to help you gather information from various websites efficiently. It provides a simple interface to configure and run web scraping tasks, making it easy to collect and process data for your projects.
Setup Guide
-
Clone this repository from GitHub.
-
Open terminal (after redirecting yourself to the repo) and run the following commands:
pip install poetry
(don't create venv through python. does not go well.)poetry lock
(creates venv for you)poetry install
Setup for Raider Backend
Run pip install -e .
from the git root directory. Raider Backend will call Web Raider using pipeline_main(user_query: str)
from web_raider/pipeline.py
.
Usage
- Configure your scraping tasks by editing the configuration files in the
config
directory. - Run the scraper using the command:
poetry run python main.py
- The scraped data will be saved in the
output
directory.
How the Repository Works
- web-raider/: Contains the core logic of the application.
- article.py: Handles the extraction of codebase URLs from articles.
- codebase.py: Defines the
Codebase
class and its subclasses for different code hosting platforms. - connection_manager.py: Manages WebSocket connections and message buffering.
- evaluate.py: Evaluates codebases based on a query.
- model_calls.py: Handles calls to external models for query simplification, relevance, scoring, and ranking.
- pipeline.py: Defines the main pipeline for processing user queries.
- search.py: Handles Google search queries and filters results.
- shortlist.py: Shortlists codebases based on a query.
- url_classifier.py: Classifies URLs into different categories.
- utils.py: Contains utility functions.
- constants.py: Defines constants used across the application.
- init.py: Initializes the web-raider package.
- assets/: Contains auxiliary files and configurations.
- key_import.py: Handles the import of API keys.
- prompts.py: Defines various prompts used in model calls.
- init.py: Initializes the assets package.
- tests/: Contains unit tests for the application. Run the tests using
pytest
to ensure everything is working correctly.
Tasklist to complete before Wallaby
- fix relative/absolute import problem. don't rely on
-m
- need to be able to run the code from any directory
Future Implementations/Improvements
- Use Machine Learning Classification Algorithms to classify types of URLs to their type (Codebase, Article, Forum)
- Find a way to handle Forum URLs (right now they are not processed)
- Find a way to scrape code directly from Articles and Forum URLs (right now only links are scraped)
- Properly implement main query breakdown instead of just whacking LLM
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file web_raider-0.0.2.tar.gz
.
File metadata
- Download URL: web_raider-0.0.2.tar.gz
- Upload date:
- Size: 16.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.13 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa90b4551240d4c972fba1a1844fe03ba70d6beda81b4f515020cb089caaad73 |
|
MD5 | 5773bb9bd97971f9a4bb1dd8e793502c |
|
BLAKE2b-256 | 43a3d6f71a5fa1f880d0aeae8e674e0cf39f3e880d68be1720d7d7ab95d83c83 |
File details
Details for the file web_raider-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: web_raider-0.0.2-py3-none-any.whl
- Upload date:
- Size: 19.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.13 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1e612972a07feedf857daff48604dbafd6270f99deeab96eb43f6eda6f8272d |
|
MD5 | f8081f85fabcb05d8d27bac1d9c0a572 |
|
BLAKE2b-256 | 178df148abf9846abdde0bad259665fd409cc446654c1eeedb89b4d3d5079e5b |