Skip to main content

A simple library to set up Selenium processes

Project description

seleniumprocessor

A simple library to set up Selenium processes

Description

This library allows you to easily set up a process based on Selenium. Thanks to the use of a specific format, it is possible to easily define processes to be passed to Selenium.

Installation

pip install seleniumprocessor

Install a Selenium web driver, e.g., the Chrome WebDriver

Available methods

initiate_connection(webdriverfile, url, to, loginrequired=True, headless=False), returning a selenium.webdriver.chrome.webdriver.WebDriver object allowing browser control

  • webdriverfile is the path of the Selenium web driver file
  • url is the url to open
  • to is the timeout to wait, regarding page loading
  • loginrequired specifies if a manual login from the user is required (True) or not (False)
  • headless specifies if the browser has to be executed in headless mode (True) or not (False)

run_process(brw, url_home, to, p, backtohome_begin=True, backtohome_end=True, checkfilterpassed_callback=None), returning an object, as specified in the process p

  • brw the selenium.webdriver.chrome.webdriver.WebDriver object used to control the browser
  • url_home the home page url
  • to the timeout used to wait the home page load
  • p the list of actions in the current process
  • backtohome_begin specifies if the browser should be redirected to the home page at begin of the method (True) or not (False)
  • backtohome_end specifies if the browser should be redirected to the home page at end of the method (True) or not (False)
  • checkfilterpassed_callback identifies a callback function used to check filters defined in the process p, returing a boolean value (True if the filter is passed, False otherwise)

Objects structure

The main process object is a list of actions to sequentially execute on the process. Each action is represented by an array map with the following fields:

  • name: the name identifying the DOM objects to find
  • class_name: the class name identifying the DOM objects to find
  • index (optional): in case of multiple DOM objects with the same class (or in case a DOM object which is not the first one has to be considered), it is possible to specify the index of the DOM object, in the list of DOM objects using the same class
  • sleep (optional): the sleep timeout used after the action is performed
  • filter: a string passed to the checkfilterpassed_callback for filtering actions
  • action_parameters (optional): its definition depends on the action field
  • action: the action to execute:
    • click: to perform a click on the DOM object
    • click-repeated: to perform a repeated click on the DOM object, until the object is present (useful with sleep, e.g., for pages loading portions of a lists, with a final button to load additional results); the optional action_parameters parameter represents the class name of the objects to count: when the object is unchanged, repeated clicks will be interrupted
    • navigate: to navigate by clicking a specific sequence of objects, by their text value; the action_parameters parameter represents the > separated navigation path
    • scroll_to: to scroll to the specific element
    • empty_value: to empty the value property of the DOM object
    • store_text: to store data on the returning object generated by the run_process method; the action_parameters parameter represents the name of the property on the object
    • send_keys: to send a key input to a specific DOM object
    • select: to select a specific value of a specific combo-box DOM object, where the value is specified in the action_parameters parameter
    • foreach: to loop on all the DOM objects retrieved to execute repeated actions
  • context (optional): in case the foreach action is used, the context of all sub-items to be found will refer to the parent DOM object used in the loop; in this case, to consider the whole page, it is possible to specify whole_page as context

Sample usage

Get all repositories of @auino

# import the library
import seleniumprocessor

# define initial variables
URL_HOME = 'https://github.com/auino'
SLEEP_TO = 3

# initiate a connection on auino GitHub page (not requiring a login)
brw = seleniumprocessor.initiate_connection('./chromedriver', URL_HOME, 3, False)

# define the process to be executed
p = [
	{'class_name':'UnderlineNav-item', 'index':1, 'action':'click', 'sleep':SLEEP_TO}, # clicking on the Repository tab, the second one, on top of the page
	{'class_name':'source', 'action':'foreach', 'action_parameters':[ # looping on all repositories
		{'class_name':'wb-break-all', 'action':'store_text', 'action_parameters':'name'}, # storing the repository name
		{'class_name':'color-text-secondary', 'action':'store_text', 'action_parameters':'description'} # storing the repository description
	]}
]

# run the process
data = seleniumprocessor.run_process(brw, URL_HOME, SLEEP_TO, p, backtohome_begin=False)

# showing resulting data
print(data)

Get all publications of a given user from Google Scholar

import seleniumprocessor

# define initial variables
USERPROFILE = 'UlbGEQwAAAAJ'
URL_HOME = 'https://scholar.google.com/citations?user={}'.format(USERPROFILE)
SLEEP_TO = 3

# initiate a connection on auino GitHub page (not requiring a login)
brw = seleniumprocessor.initiate_connection('./chromedriver', URL_HOME, 3, False)

# define the process to be executed
p = [
    {'id':'gsc_prf_in', 'action':'store_text', 'action_parameters':'name'}, # storing researcher's name
    {'class_name':'gs_lbl', 'index':-1, 'action':'click-repeated', 'action_parameters':'gsc_a_tr', 'sleep':SLEEP_TO}, # clicking the button at the end of the page, to extend the list of publications
    {'class_name':'gsc_a_tr', 'action':'foreach', 'action_parameters':[ # looping on all publications
        {'class_name':'gsc_a_at', 'action':'store_text', 'action_parameters':'title'}, # storing the publication name
        {'class_name':'gs_gray', 'index':0, 'action':'store_text', 'action_parameters':'authors'}, # storing the authors of the publication
        {'class_name':'gs_gray', 'index':1, 'action':'store_text', 'action_parameters':'venue'}, # storing the venue of the publication
        {'class_name':'gsc_a_ac', 'action':'store_text', 'action_parameters':'citations'}, # storing the number of citations of the publication
        {'class_name':'gsc_a_h', 'action':'store_text', 'action_parameters':'year'}, # storing the year of the publication
    ]}
]

# run the process
data = seleniumprocessor.run_process(brw, URL_HOME, SLEEP_TO, p, backtohome_begin=False)

# showing resulting data
print(data)

TODO

  • Improve code readability
  • Extend supported objects structure

Contacts

You can find me on Twitter as @auino.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seleniumprocessor-0.1.5.tar.gz (4.8 kB view hashes)

Uploaded Source

Built Distribution

seleniumprocessor-0.1.5-py3-none-any.whl (5.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page