Web scraper for basic and/or list of pages scraping
Project description
General purpose Web scraping bot
The purpose of the component is to provide flexibility in Selenium Webdriver configuration, enabling the preset configurations to be reused later, as well as offering other support functions, turning the Selenium WebDriver into a bot designed to extract data from web pages where the data is rendered after the HTML document is downloaded (data that is loaded with JavaScript).
Installation Options
Requirements:
- Python 3.7 or newer;
- Selenium library version 4.9.1;
- geckodriver (Firefox) or chromedriver (Chrome) files must be placed in the
./webdrivers/
directory; - Tor service configuration is required for Tor proxy settings to work.
Installation options:
Install as a Python module
Install using pip:
pip install loopies-scraper
Komponento diegimas kaip programinio kodo
- Download the project from the GitHub repository:
git clone https://github.com/tmspsk/loopies-scraper.git
- Navigate to the downloaded directory:
cd loopies-scraper
- Install all required components:
pip install -r requirements.txt
The component uses the Selenium 4.9.1 package, and this specific version must be used for the component to function as intended.
Quickstart
Example 1: Using MultiprocessScraper Class Without URL List Function
from loopies_scraper.multiprocess_scraper import MultiprocessScraper
def scrape_page(scraper):
scraper.navigate_to_page('http://example.com')
data = scraper.find_elements(By.CSS_SELECTOR, "p")
return data
def main():
scraper = MultiprocessScraper(driver_name='firefox', proxy='default', processes_count=3)
scraper.basic_task(task_function=scrape_page)
if __name__ == "__main__":
main()
In this example:
- A MultiprocessScraper instance is created with 3 processes;
- The scrape_page function navigates to http://example.com and extracts data;
- The collected data is saved to the default ./data.json file.
Example 2: Using MultiprocessScraper Class with URL List Function
from loopies_scraper.multiprocess_scraper import MultiprocessScraper
from selenium.webdriver.common.by import By
def list_function(scraper, data_manager):
scraper.navigate_to_page('http://example.com')
urls = scraper.find_elements(By.CSS_SELECTOR, "a")
for url in urls:
data_manager.queue.put(url.get_attribute('href'))
def scrape_page(scraper, url):
scraper.navigate_to_page(url)
data = scraper.extract_data()
return data
def main():
scraper = MultiprocessScraper(driver_name='chrome', proxy='default', processes_count=3)
scraper.list_task(task_function=scrape_page, list_function=list_function)
if __name__ == "__main__":
main()
In this example:
- A MultiprocessScraper instance is created with 3 processes;
- The list_function parameter extracts a list of URLs and saves them in data_manager.queue;
- The scrape_page function is executed in parallel, navigating through the saved URLs and extracting data;
- The collected data is saved to the ./data.json file.
Example 3: Using Scraper Class
from loopies_scraper.scraper import Scraper
def main():
scraper = Scraper(driver_name='chrome', proxy='default')
scraper.driver.get('https://www.basketball-reference.com/')
international_leagues = WebDriverWait(scraper.driver, 30).until(EC.presence_of_element_located((By.XPATH, './/div[@id="leagues_primary"]//h2//a')))
scraper.scroll_to_element(international_leagues)
scraper.click_on_element(international_leagues)
time.sleep(60)
if __name__ == "__main__":
main()
In this example:
- A Scraper instance is created with the Chrome browser and default proxy settings;
- The scraper navigates to the URL https://www.basketball-reference.com/;
- It waits until the link element is available;
- The scrolling function is used to scroll to the link element;
- The clicking function is used to click on the element.
Example 4: Using MultiprocessScraper Class with Tor Proxy
from loopies_scraper.multiprocess_scraper import MultiprocessScraper
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def task(scraper: object = None):
scraper.driver.get('https://check.torproject.org/')
link = WebDriverWait(scraper.driver, 30).until(EC.presence_of_element_located((By.XPATH, ".//a[contains(text(), 'Tor website')]")))
scraper.scroll_to_element(link)
scraper.click_on_element(link)
time.sleep(60)
def main():
scraper = MultiprocessScraper(driver_name='firefox', proxy='tor', processes_count=2)
scraper.basic_task(task)
if __name__ == "__main__":
main()
In this example:
- A task function is defined, which will be executed by the web scraper:
- Navigates to the specified website https://check.torproject.org/.
- Waits until the DOM element with the text "Tor website" is available.
- Uses the scrolling function to scroll to the found element.
- Uses the clicking function to click on the found element.
- A MultiprocessScraper instance is created with the Firefox browser, using Tor proxy and 2 processes.
- The basic_task function is initiated, which will execute the task function in parallel.
Documentation
Scraper Class and MultiprocessScraper Class
The Scraper
class enables the execution of individual web page navigation and data extraction tasks by inheriting functionality from WebDriverController
.
Scraper Class
Constructor:
__init__(self, driver_name: str, proxy: str = None) -> None
driver_name
: The name of the browser to be used (options: "chrome", "firefox").proxy
: Proxy settings.
MultiprocessScraper Class
The MultiprocessScraper class enables parallel execution of data extraction tasks using multiple processes. It manages browsers, data collection, and saving.
Constructor:
__init__(self, driver_name: str = None, proxy: str = None, processes_count: int = 1, file_path: str = './data.json') -> None
driver_name
: The name of the browser to be used (options: "chrome", "firefox").proxy
: Proxy settings.processes_count
: Number of processes.file_path
: The file path to save the collected data.
Methods:
process_target(self, task_function: object = None, list_function: object = None)
: Executes the main task function after the first process collects the list of links. Used to collect data from list-type pages whose links are collected via the provided list_function parameter.process_target_basic(self, task_function: object = None)
: Executes the main task using the provided function for collecting data from a single page. Used to perform single-type tasks provided by the task_function parameter.start_processes(self, task_function: object = None, list_function: object = None)
: Starts the execution of specified tasks using multiple processes.basic_task(self, task_function: object = None)
: Starts the main task without collecting a list of links.list_task(self, task_function: object = None, list_function: object = None)
: Starts the main task with link list collection.
Additional Classes
These classes are auxiliary and not directly accessible to the user, but they are used in the WebDriverController class for browser management.
WebDriverController Class
The WebDriverController class manages the browser instance and provides common functions for working with the browser.
Constructor:
__init__(self, driver_name: str, proxy: str = None) -> None
driver_name
: The name of the browser to be used.proxy
: Proxy settings.
Methods:
scroll_to_element(self, element)
: Scrolls to the specified element.click_on_element(self, element)
: Clicks on the specified element.hide_element(self, element)
: Hides the specified element.
GeckodriverProfile and ChromedriverOptions Classes
The GeckodriverProfile and ChromedriverOptions classes configure options for the Firefox and Chrome browsers. By installing the component as source code, these classes can be used to define frequently used browser settings for reuse.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file loopies_scraper-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: loopies_scraper-0.0.6-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a3982c46039ac0f3dcffa0a14e0ecb2db4391fdddb6f458bc9be4368dbc1837 |
|
MD5 | 1843ea069d74f8878ee9a25c9afa3d1f |
|
BLAKE2b-256 | 89e25bcd1d969ad88c388df78e01920d628ed640516750b3e47dd69c2a730628 |