Skip to main content

A library with built-in methods that make web scraping easier for dynamically loading pages.

Project description

Description

A wrapper around requests, BeautifulSoup, and Selenium (Chrome) to facilitate web scraping.

Installation

pip install webmage

Basic Usage

Import WebSpell

from webmage import WebSpell

Initializing the WebSpell class

The WebSpell class returns either a StaticSpell or DynamicSpell depending on which method you select: static or dynamic. Choosing static will use the requests module while dynamic will use a selenium webdriver. Currently Chrome is the only supported browser.

spell = WebSpell('https://javascriptorian.com')

url: A URL.

driver_path (optional): A filepath for a chromedriver. If omitted, the class will automatically download one into your cache.

browser (optional): A string to choose which browser you want to use. Options include chrome | undetected_chrome | firefox

ghost (optional): A boolean for making the chromedriver headless.

Changing from one URL to another

When you initialize the WebSpell class, you provide a URL. You probably need to change the url if you are web scraping.

spell.change_url('https://wordcruncher.com')

url = A string of an HTTP(S) request.

Closing the webdriver

You can close the webdriver browser by using the close method.

spell.close()

Selecting an element

The select method returns a StaticRune or DynamicRune of the first element found on the webpage. It takes a css selector as its only argument. XPath is not implemented yet.

rune = spell.select('a')

css_selector: A string of a css selector.

Selecting multiple elements

The selectAll method returns a list of StaticRune or DynamicRune elements.

runes = spell.selectAll('a')

css_selector: A string of a css selector.

Seeing all attributes from element

All element attributes are added to rune.attributes.

attributes = rune.attributes

Getting tag attributes from element

You can use the dictionary-like format to get an attribute from an HTML tag. Unlike BeautifulSoup, webmage returns None instead of KeyError if no attribute is found.

url = rune['href']

Getting the text from element

text = rune.text

Getting the innerHTML or outerHTML from element

inner = rune.innerHTML
outer = rune.outerHTML

Clicking on an element

The click method lets you click on an element on the page. It will click on the first element it finds. It takes a css selector as its only argument.

spell.click(css_selector='button')

css_selector: A string of a css selector.

Clicking on multiple elements

The clickAll method lets you click on multiple elements. Generally, it's good to wait between clicks, so there's an optional wait_interval argument you can pass to this method.

spell.clickAll(css_selector='button', wait_interval=2)

css_selector: A string of a css selector.

wait_interval (optional): A float or integer of the amount of seconds to wait in between clicks.

Scrolling Abilities

WebSpell has 4 different scrolling methods depending on the nature of the website.

Limited Scroll

The limited scroll method will scroll down to the bottom of the page X amount of times. It always scrolls down to the bottom of the page immediately.

spell.scroll(wait_interval, scroll_count, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)

wait_interval: A float or integer of the amount of seconds to wait in between scrolls.

scroll_count: An integer of how many times you want to scroll down the page.

scroll_css_selector (optional): A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.

callback (optional): A callback function to call after each scroll. The callback function must have one argument that contains the spell object.

verbose (optional): A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

Delicate Limited Scroll

The delicate limited scroll will scroll down X amount of pixels X amount of times. It scrolls down a specific amount of pixels, allowing you to grab content from the webpage that only is available if the page is scrolled down gradually.

spell.delicate_scroll(wait_interval, scroll_count, scroll_pixel_length=500, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)

wait_interval: A float or integer of the amount of seconds to wait in between scrolls.

scroll_count: An integer of how many times you want to scroll down the page.

scroll_pixel_length (optional): A float or integer of the amount of pixels you want to scroll down for each scroll. Defaults to 500 pixels

scroll_css_selector (optional): A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.

callback (optional): A callback function to call after each scroll. The callback function must have one argument that contains the spell object.

verbose (optional): A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

Infinite Scroll

The infinite scroll is a special ability programmed into webmage. It will scroll down the page for as long as there is no more content added to the page.

spell.infinite_scroll(wait_interval, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)

wait_interval: A float or integer of the amount of seconds to wait in between scrolls.

scroll_pixel_length: A float or integer of the amount of pixels you want to scroll down for each scroll.

scroll_css_selector (otional): A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.

callback (optional): A callback function to call after each scroll. The callback function must have one argument that contains the spell object.

verbose (optional): A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

Delicate Infinite Scroll

delicate_infinite_scroll(wait_interval, scroll_pixel_length=500, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)

wait_interval: A float or integer of the amount of seconds to wait in between scrolls. scroll_pixel_length: A float or integer of the amount of pixels you want to scroll down for each scroll. scroll_css_selector: A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element. callback: A callback function to call after each scroll. The callback function must have one argument that contains the spell object. verbose: A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

Destroying/Removing Elements

You may want to remove elements from the page, especially if you don't need that content or you are scrolling on a very long page. Below are a few methods that help you remove elements from a page

Destroy all tags on an HTML page.

spell.destroyAll('div')

Iterate through each tag and then destroy it.

divs = spell.selectAll('div')
for div in divs:
    div.destroy()

Select the a tag and then destroy the first of another tag.

body = spell.select('div.body')
body.destroy('sup')

Select a tag and then destroy all of another tag within the main tag.

body = spell.select('div.body')
body.destroyAll('sup')

Other Abilities

Pause Scraper

Same as time.sleep()

spell.wait(5)

time_interval: A float or integer of the amount of seconds to pause your scraper.

Get URL Slug

This returns the last part of the URL. It removes any hashes (#...) or queries (?...&...)

spell.get_slug()

Get Network Log

Another special ability of WebSpell. This returns a list of the network log. Useful for getting data that is only found in the network requests (like image or video URLs). It's important to note that the network log wipes the log each time you call this function, so you must save the data to a variable if you intend on getting the network log multiple times.

log = spell.network_log()

Executing JavaScript code

The cast_js method will execute JavaScript code on the web browser.

spell.cast_js('window.open("window.open("https://google.com", "_blank");')

Opening a New Tab Temporarily

Sometimes you need to open an image, video, tweet, etc. in a different tab, and you don't want to lose your progress on your main page. You can use the cast_in_discrete_tab method to open a new tab and do something there.

spell.cast_in_discrete_tab(url, callback=False, payload)

url: A string of the URL that you want to open in a new tab callback: A callback function that you want to run after opening the tab. payload: An object (e.g. dictionary or list) that you want available within the callback function.

Taking a Screenshot

Use the take_screenshot method if you want to take a picture of an image.

spell.take_screenshot(css_selector='img')

css_selector: A string of a CSS selector.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webmage-1.0.3.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

webmage-1.0.3-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file webmage-1.0.3.tar.gz.

File metadata

  • Download URL: webmage-1.0.3.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for webmage-1.0.3.tar.gz
Algorithm Hash digest
SHA256 cb0f581b653c25cdca873c3316629a048e523d12125fcd00bcff85e892f6585c
MD5 950ff3fb57e25ed0015c81f9e9088325
BLAKE2b-256 7dcf7774c6bd2610bb534afdb64c9d6ba0b4be6267dbbddf722c20a096918dc8

See more details on using hashes here.

File details

Details for the file webmage-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: webmage-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for webmage-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9af0dbe9008eefa5f635cf622191e4dae5e1f3a7e4c8b0a23a7fdf42a61b939a
MD5 a8dfd792b9a8c32fe58ea5faa6d646ff
BLAKE2b-256 3589efd149d7bf8dcb3ca37f192f9ed9fa6d7736f7122691faa7098e6bf4e0fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page