A library with built-in methods that make web scraping easier for dynamically loading pages.
Project description
Description
A wrapper around requests, BeautifulSoup, and Selenium (Chrome) to facilitate web scraping.
Installation
pip install webmage
Basic Usage
Import WebSpell
from webmage import WebSpell
Initializing the WebSpell class
The WebSpell class returns either a StaticSpell or DynamicSpell depending on which method you select: static or dynamic. Choosing static will use the requests module while dynamic will use a selenium webdriver. Currently Chrome is the only supported browser.
spell = WebSpell('https://javascriptorian.com')
url: A URL.
driver_path (optional): A filepath for a chromedriver. If omitted, the class will automatically download one into your cache.
browser (optional): A string to choose which browser you want to use. Options include chrome | undetected_chrome | firefox
ghost (optional): A boolean for making the chromedriver headless.
Changing from one URL to another
When you initialize the WebSpell class, you provide a URL. You probably need to change the url if you are web scraping.
spell.change_url('https://wordcruncher.com')
url = A string of an HTTP(S) request.
Closing the webdriver
You can close the webdriver browser by using the close method.
spell.close()
Selecting an element
The select method returns a StaticRune or DynamicRune of the first element found on the webpage. It takes a css selector as its only argument. XPath is not implemented yet.
rune = spell.select('a')
css_selector: A string of a css selector.
Selecting multiple elements
The selectAll method returns a list of StaticRune or DynamicRune elements.
runes = spell.selectAll('a')
css_selector: A string of a css selector.
Seeing all attributes from element
All element attributes are added to rune.attributes.
attributes = rune.attributes
Getting tag attributes from element
You can use the dictionary-like format to get an attribute from an HTML tag. Unlike BeautifulSoup, webmage returns None instead of KeyError if no attribute is found.
url = rune['href']
Getting the text from element
text = rune.text
Getting the innerHTML or outerHTML from element
inner = rune.innerHTML
outer = rune.outerHTML
Clicking on an element
The click method lets you click on an element on the page. It will click on the first element it finds. It takes a css selector as its only argument.
spell.click(css_selector='button')
css_selector: A string of a css selector.
Clicking on multiple elements
The clickAll method lets you click on multiple elements. Generally, it's good to wait between clicks, so there's an optional wait_interval argument you can pass to this method.
spell.clickAll(css_selector='button', wait_interval=2)
css_selector: A string of a css selector.
wait_interval (optional): A float or integer of the amount of seconds to wait in between clicks.
Scrolling Abilities
WebSpell has 4 different scrolling methods depending on the nature of the website.
Limited Scroll
The limited scroll method will scroll down to the bottom of the page X amount of times. It always scrolls down to the bottom of the page immediately.
spell.scroll(wait_interval, scroll_count, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)
wait_interval: A float or integer of the amount of seconds to wait in between scrolls.
scroll_count: An integer of how many times you want to scroll down the page.
scroll_css_selector (optional): A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.
callback (optional): A callback function to call after each scroll. The callback function must have one argument that contains the spell object.
verbose (optional): A boolean for whether you want it to dynamically print how many times it's scrolled down the page.
Delicate Limited Scroll
The delicate limited scroll will scroll down X amount of pixels X amount of times. It scrolls down a specific amount of pixels, allowing you to grab content from the webpage that only is available if the page is scrolled down gradually.
spell.delicate_scroll(wait_interval, scroll_count, scroll_pixel_length=500, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)
wait_interval: A float or integer of the amount of seconds to wait in between scrolls.
scroll_count: An integer of how many times you want to scroll down the page.
scroll_pixel_length (optional): A float or integer of the amount of pixels you want to scroll down for each scroll. Defaults to 500 pixels
scroll_css_selector (optional): A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.
callback (optional): A callback function to call after each scroll. The callback function must have one argument that contains the spell object.
verbose (optional): A boolean for whether you want it to dynamically print how many times it's scrolled down the page.
Infinite Scroll
The infinite scroll is a special ability programmed into webmage. It will scroll down the page for as long as there is no more content added to the page.
spell.infinite_scroll(wait_interval, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)
wait_interval: A float or integer of the amount of seconds to wait in between scrolls.
scroll_pixel_length: A float or integer of the amount of pixels you want to scroll down for each scroll.
scroll_css_selector (otional): A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.
callback (optional): A callback function to call after each scroll. The callback function must have one argument that contains the spell object.
verbose (optional): A boolean for whether you want it to dynamically print how many times it's scrolled down the page.
Delicate Infinite Scroll
delicate_infinite_scroll(wait_interval, scroll_pixel_length=500, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)
wait_interval: A float or integer of the amount of seconds to wait in between scrolls. scroll_pixel_length: A float or integer of the amount of pixels you want to scroll down for each scroll. scroll_css_selector: A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element. callback: A callback function to call after each scroll. The callback function must have one argument that contains the spell object. verbose: A boolean for whether you want it to dynamically print how many times it's scrolled down the page.
Destroying/Removing Elements
You may want to remove elements from the page, especially if you don't need that content or you are scrolling on a very long page. Below are a few methods that help you remove elements from a page
Destroy all tags on an HTML page.
spell.destroyAll('div')
Iterate through each tag and then destroy it.
divs = spell.selectAll('div')
for div in divs:
div.destroy()
Select the a tag and then destroy the first of another tag.
body = spell.select('div.body')
body.destroy('sup')
Select a tag and then destroy all of another tag within the main tag.
body = spell.select('div.body')
body.destroyAll('sup')
Other Abilities
Pause Scraper
Same as time.sleep()
spell.wait(5)
time_interval: A float or integer of the amount of seconds to pause your scraper.
Get URL Slug
This returns the last part of the URL. It removes any hashes (#...) or queries (?...&...)
spell.get_slug()
Get Network Log
Another special ability of WebSpell. This returns a list of the network log. Useful for getting data that is only found in the network requests (like image or video URLs). It's important to note that the network log wipes the log each time you call this function, so you must save the data to a variable if you intend on getting the network log multiple times.
log = spell.network_log()
Executing JavaScript code
The cast_js method will execute JavaScript code on the web browser.
spell.cast_js('window.open("window.open("https://google.com", "_blank");')
Opening a New Tab Temporarily
Sometimes you need to open an image, video, tweet, etc. in a different tab, and you don't want to lose your progress on your main page. You can use the cast_in_discrete_tab method to open a new tab and do something there.
spell.cast_in_discrete_tab(url, callback=False, payload)
url: A string of the URL that you want to open in a new tab callback: A callback function that you want to run after opening the tab. payload: An object (e.g. dictionary or list) that you want available within the callback function.
Taking a Screenshot
Use the take_screenshot method if you want to take a picture of an image.
spell.take_screenshot(css_selector='img')
css_selector: A string of a CSS selector.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file webmage-1.0.3.tar.gz
.
File metadata
- Download URL: webmage-1.0.3.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb0f581b653c25cdca873c3316629a048e523d12125fcd00bcff85e892f6585c |
|
MD5 | 950ff3fb57e25ed0015c81f9e9088325 |
|
BLAKE2b-256 | 7dcf7774c6bd2610bb534afdb64c9d6ba0b4be6267dbbddf722c20a096918dc8 |
File details
Details for the file webmage-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: webmage-1.0.3-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9af0dbe9008eefa5f635cf622191e4dae5e1f3a7e4c8b0a23a7fdf42a61b939a |
|
MD5 | a8dfd792b9a8c32fe58ea5faa6d646ff |
|
BLAKE2b-256 | 3589efd149d7bf8dcb3ca37f192f9ed9fa6d7736f7122691faa7098e6bf4e0fe |