Skip to main content

A selenium framework for automated actions on browsers

Project description

thatscrapper

Scrap more, write less.

thatscrapper is a selenium adapter.

Selenium automates browsers. That's it! What you do with that power is entirely up to you.

Selenium is a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.

Since there are so many websites full of javascript code, scrapping pages using static methods, as those found on Beautifulsoup, gets harder or impossible. With Selenium WebDriver you can get around with it.

However it is very common to make sure the desired element is expected with certain conditions. For that you have to add 'waits' contexts, like:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    # the waiting context with the selection of element by id
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()

For a single purpose web scraping single script that means some couple of lines like that. However, for a larger testing or scraping project, that can become anoying. With using thatscrapper above script becomes:

import thatscrapper as ts

crawler = ts.Crawler().goto("http://somedomain/url_that_delays_loading")
element = crawler.element_id("myDynamicElement")
crawler.quit()

It is very important to quit the webdriver to avoid memory leakage or overcrowding. Always call quit() method when job is done . However, Crawler comes with a decorator that make sure the webdriver quits in case of any exception.

Installation

$ pip install thatscrapper

Usage

thatscrapper can be used to perform basic actions on webpages, such as clicking buttons, dropdown menu, press keyboard keys, send text or filling forms. It is also suitable to extract data.

Instances of thatscrapper.Crawler are used to navigate pages, perform actions and select elements.

Run the webdriver:

import time
import thatscrapper

crawler = thatscrapper.Crawler()
# open page
crawler.goto("https://phptravels.com/demo/")
# wait long enough so you can check the result
time.sleep(5)
# always quit the driver
crawler.quit()

Alternatively, you can crawl pages withou opening browser graphics:

crawler = thatscrapper.Crawler(headless=True)

Choosing the webdriver

By default thatscrapper make use of FireFox webdriver (geckodriver), however other drivers can be selected. But make sure you have the one of your choosing, and its path is added to your enviroment variables. For Linux users, download the werdriver and put it in /usr/bin or /usr/local/bin (windows user, check this out in order to see how to do that in your system).

Here's a list of suported browser drivers:

Browser Download link
Chrome https://sites.google.com/chromium.org/driver/
Edge https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
FireFox https://github.com/mozilla/geckodriver/releases
Safari https://webkit.org/blog/6900/webdriver-support-in-safari-10/

To use other driver, pass it's browser name to Crawler class:

import time
import thatscrapper as ts

crawler = ts.Crawler(browser='chrome')

# extract data from
crawler.goto("https://www.techlistic.com/p/demo-selenium-practice.html")
# wait long enough so you can check the result
time.sleep(5)
# always quit the driver
crawler.quit()

Elements their children

Elements can be selected with one of four methods:

  • element(value, by): selects an element based on given attribute by, with value value. A list of attributes is given by thatscrapper.ATTR_SELECTOR.keys().
  • elements(value, by): selects all elements based on given attribute by, with value value.
  • child_of(element, value, by): selects an element child of element (WebElement) based on given attribute by, with value value.
  • children_of: selects all elements child of element based on given attribute by, with value value.

Consider the following section of a page:

<div class="form">
    <input type="text" name="first_name" class="first_name input mb1" placeholder="First Name">
    <input type="text" name="last_name" class="last_name input mb1" placeholder="Last Name">
    <input type="text" name="business_name" class="business_name input mb1" placeholder="Business Name">
    ...
</div>

In order to make sure you select the input tags from that div with class="from", and not another input that the page may contain, first you can select that div, and then select its children:

form_element = crawler.element("form", "class name")
fields = crawler.children_of(form_element, "input", "tag name")

Sending keys

Sending inputs or keys to and element is performed one of two methods:

  • send(key, value, by): send key to an element based on given attribute by, with value value.
  • send(key, element): send key to previously selected element.

Consider the same abov section of a page, and the selected elements field. Sending a string to element <input type="text" name="first_name" class="first_name input mb1" placeholder="First Name"> can be:

crawler.send_to_element(fields[0], "Vagner Bessa")

Sending keyboard keys works the same way. Check thatscrapper.Key or selenium.webdriver.common.keys.Keys for valid keys.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

TODOs

  • Test all methods (current coverage: 61%).
  • Implement data extraction modules and classes.
  • Link or adapt to a database handler.
  • Implement an API boilerplate builder. That API is to serve data extracted from extractor modules and classes.

License

thatscrapper was created by Vagner Bessa. It is licensed under the terms of the MIT license.

Credits

thatscrapper was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thatscrapper-0.1.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

thatscrapper-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file thatscrapper-0.1.0.tar.gz.

File metadata

  • Download URL: thatscrapper-0.1.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.14 Linux/5.15.0-50-generic

File hashes

Hashes for thatscrapper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 792edd2f01efbf14b912c09e4cc4be510106880a8b965cb70256fd838130dc6d
MD5 2c16f075fcbe4ea26e5e86986f4d1c01
BLAKE2b-256 cd30d706f199aab0602198ee9d7d580ae5cf72d49b88c2d3f85b69e0b111bdec

See more details on using hashes here.

File details

Details for the file thatscrapper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: thatscrapper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.14 Linux/5.15.0-50-generic

File hashes

Hashes for thatscrapper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6f5a32453f93fe07e62d86fcef597e31b7015ec9eb31ab80c3a308e5cc227db
MD5 fed4f9ad2702fba5060efb7aee63e249
BLAKE2b-256 a2ec53f4365f91ea902aa9c016a88a46f7051708394d90cc113af9e9fd9628e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page