Skip to main content

Web Raider

Project description

web-raider

Overview

Web Raider is a powerful web scraping and data extraction tool designed to help you gather information from various websites efficiently. It provides a simple interface to configure and run web scraping tasks, making it easy to collect and process data for your projects.

Setup Guide

  1. Clone this repository from GitHub.

  2. Open terminal (after redirecting yourself to the repo) and run the following commands:

    • pip install poetry (don't create venv through python. does not go well.)
    • poetry lock (creates venv for you)
    • poetry install

Setup for Raider Backend

Run pip install -e . from the git root directory. Raider Backend will call Web Raider using pipeline_main(user_query: str) from web_raider/pipeline.py.

Usage

  1. Configure your scraping tasks by editing the configuration files in the config directory.
  2. Run the scraper using the command: poetry run python main.py
  3. The scraped data will be saved in the output directory.

How the Repository Works

  • web-raider/: Contains the core logic of the application.
    • article.py: Handles the extraction of codebase URLs from articles.
    • codebase.py: Defines the Codebase class and its subclasses for different code hosting platforms.
    • connection_manager.py: Manages WebSocket connections and message buffering.
    • evaluate.py: Evaluates codebases based on a query.
    • model_calls.py: Handles calls to external models for query simplification, relevance, scoring, and ranking.
    • pipeline.py: Defines the main pipeline for processing user queries.
    • search.py: Handles Google search queries and filters results.
    • shortlist.py: Shortlists codebases based on a query.
    • url_classifier.py: Classifies URLs into different categories.
    • utils.py: Contains utility functions.
    • constants.py: Defines constants used across the application.
    • init.py: Initializes the web-raider package.
  • assets/: Contains auxiliary files and configurations.
    • key_import.py: Handles the import of API keys.
    • prompts.py: Defines various prompts used in model calls.
    • init.py: Initializes the assets package.
  • tests/: Contains unit tests for the application. Run the tests using pytest to ensure everything is working correctly.

Tasklist to complete before Wallaby

  1. fix relative/absolute import problem. don't rely on -m
  2. need to be able to run the code from any directory

Future Implementations/Improvements

  • Use Machine Learning Classification Algorithms to classify types of URLs to their type (Codebase, Article, Forum)
  • Find a way to handle Forum URLs (right now they are not processed)
  • Find a way to scrape code directly from Articles and Forum URLs (right now only links are scraped)
  • Properly implement main query breakdown instead of just whacking LLM

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_raider-0.0.2.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

web_raider-0.0.2-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file web_raider-0.0.2.tar.gz.

File metadata

  • Download URL: web_raider-0.0.2.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.13 Windows/10

File hashes

Hashes for web_raider-0.0.2.tar.gz
Algorithm Hash digest
SHA256 fa90b4551240d4c972fba1a1844fe03ba70d6beda81b4f515020cb089caaad73
MD5 5773bb9bd97971f9a4bb1dd8e793502c
BLAKE2b-256 43a3d6f71a5fa1f880d0aeae8e674e0cf39f3e880d68be1720d7d7ab95d83c83

See more details on using hashes here.

File details

Details for the file web_raider-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: web_raider-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.13 Windows/10

File hashes

Hashes for web_raider-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f1e612972a07feedf857daff48604dbafd6270f99deeab96eb43f6eda6f8272d
MD5 f8081f85fabcb05d8d27bac1d9c0a572
BLAKE2b-256 178df148abf9846abdde0bad259665fd409cc446654c1eeedb89b4d3d5079e5b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page