Skip to main content

A web scraper that downloads tables, images, and text from a webpage

Project description

generic-crawler

This project contains a Python script for a web scraper that extracts tables, images, and text from a given website.

Requirements

  • Python 3.7 or later
  • Selenium
  • Beautiful Soup
  • Pandas
  • tqdm
  • Requests

Installation

pip

pip install WebScraper

github

  1. Clone this repository:

git clone https://gitlab.kaisens.fr/kaisensdata/apps/4inshield/drivers/generic-crawler/-/tree/asaid

  1. Install the required Python packages:

pip install -r requirements.txt

Usage

  1. Download the appropriate chromedriver for your system and add it to your system's PATH or specify the path when initializing the WebScraper class.

  2. Use the following example code to run the scraper:

from scraper import WebScraper

chrome_driver_path = "<path_to_your_chromedriver>"
url = "https://example.com"
scraper = WebScraper(chrome_driver_path)
scraper.process_website(url)

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SiteCrawler-0.1.tar.gz (4.1 kB view hashes)

Uploaded Source

Built Distribution

SiteCrawler-0.1-py3-none-any.whl (4.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page