Skip to main content

A wrapper around requests, BeautifulSoup, and Selenium (Chrome) to facilitate web scraping.

Project description

Description

A wrapper around requests, BeautifulSoup, and Selenium (Chrome) to facilitate web scraping.

Installation

pip install webmage

Basic Usage

Import WebSpell

from webmage import WebSpell

Initializing the WebSpell class

The WebSpell class returns either a StaticSpell or DynamicSpell depending on which method you select: static or dynamic. Choosing static will use the requests module while dynamic will use a selenium webdriver. Currently Chrome is the only supported browser. spell = WebSpell(url='https://javascriptorian.com', method='dynamic', driver_path='chromedriver.exe', ghost=False) method: 'dynamic' or 'static' driver_path (optional): A filepath for a chromedriver. If omitted, the class will automatically download one into your cache. ghost (optional): A boolean for making the chromedriver headless.

Changing from one URL to another

When you initialize the WebSpell class, you provide a URL. You probably need to change the url if you are web scraping. spell.change_url(url='https://wordcruncher.com') url = A string of an HTTP(S) request.

Closing the webdriver

You can close the webdriver browser by using the close method. spell.close()

Selecting an element

The select method returns a StaticRune or DynamicRune of the first element found on the webpage. It takes a css selector as its only argument. XPath is not implemented yet. rune = spell.select('a') css_selector: A string of a css selector.

Selecting multiple elements

The selectAll method returns a list of StaticRune or DynamicRune elements. runes = spell.selectAll('a') css_selector: A string of a css selector.

Seeing all attributes from element

All element attributes are added to rune.attributes. attributes = rune.attributes

Getting tag attributes from element

You can use the dictionary-like format to get an attribute from an HTML tag. Unlike BeautifulSoup, webmage returns None instead of KeyError if no attribute is found. url = rune['href']

Getting the text from element

text = rune.text

Getting the innerHTML or outerHTML from element

outer = rune.outerHTML```

## Clicking on an element
The click method lets you click on an element on the page. It will click on the first element it finds. It takes a css selector as its only argument.
```spell.click(css_selector='button')```
css_selector: A string of a css selector.

## Clicking on multiple elements
The clickAll method lets you click on multiple elements. Generally, it's good to wait between clicks, so there's an optional wait_interval argument you can pass to this method.
```spell.clickAll(css_selector='button', wait_interval=2)```
css_selector: A string of a css selector.
wait_interval (optional): A float or integer of the amount of seconds to wait in between clicks.

# Scrolling Abilities
WebSpell has 4 different scrolling methods depending on the nature of the website.

## Limited Scroll
The limited scroll method will scroll down to the bottom of the page X amount of times. It always scrolls down to the bottom of the page immediately.

spell.scroll(wait_interval, scroll_count, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)


wait_interval: A float or integer of the amount of seconds to wait in between scrolls.
scroll_count: An integer of how many times you want to scroll down the page.
scroll_css_selector: A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.
callback: A callback function to call after each scroll. The callback function must have one argument that contains the spell object.
verbose: A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

## Delicate Limited Scroll
The delicate limited scroll will scroll down X amount of pixels X amount of times. It scrolls down a specific amount of pixels, allowing you to grab content from the webpage that only is available if the page is scrolled down gradually.
```spell.delicate_scroll(wait_interval, scroll_count, scroll_pixel_length=500, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)```
wait_interval: A float or integer of the amount of seconds to wait in between scrolls.
scroll_count: An integer of how many times you want to scroll down the page.
scroll_pixel_length: A float or integer of the amount of pixels you want to scroll down for each scroll.
scroll_css_selector: A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.
callback: A callback function to call after each scroll. The callback function must have one argument that contains the spell object.
verbose: A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

## Infinite Scroll
The infinite scroll is a special ability programmed into webmage. It will scroll down the page for as long as there is no more content added to the page.
```spell.infinite_scroll(wait_interval, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)```
wait_interval: A float or integer of the amount of seconds to wait in between scrolls.
scroll_pixel_length: A float or integer of the amount of pixels you want to scroll down for each scroll.
scroll_css_selector: A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.
callback: A callback function to call after each scroll. The callback function must have one argument that contains the spell object.
verbose: A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

## Delicate Infinite Scroll
```delicate_infinite_scroll(wait_interval, scroll_pixel_length=500, scroll_css_selector="document.scrollingElement", callback=None, verbose=True)```
wait_interval: A float or integer of the amount of seconds to wait in between scrolls.
scroll_pixel_length: A float or integer of the amount of pixels you want to scroll down for each scroll.
scroll_css_selector: A string of a css selector. If the website has a custom scrolling element, you must specify the CSS selector for the css selector. Defaults to the normal scrolling element.
callback: A callback function to call after each scroll. The callback function must have one argument that contains the spell object.
verbose: A boolean for whether you want it to dynamically print how many times it's scrolled down the page.

# Other Abilities

## Pause Scraper
Same as time.sleep()
```spell.wait(5)```
time_interval: A float or integer of the amount of seconds to pause your scraper.

## Get URL Slug
This returns the last part of the URL. It removes any hashes (#...) or queries (?...&...)
```spell.get_slug()```

## Get Network Log
Another special ability of WebSpell. This returns a list of the network log. Useful for getting data that is only found in the network requests (like image or video URLs). It's important to note that the network log wipes the log each time you call this function, so you must save the data to a variable if you intend on getting the network log multiple times.
```log = spell.network_log()```

## Executing JavaScript code
The cast_js method will execute JavaScript code on the web browser.
```spell.cast_js('window.open("window.open("https://google.com", "_blank");')```

## Opening a New Tab Temporarily
Sometimes you need to open an image, video, tweet, etc. in a different tab, and you don't want to lose your progress on your main page. You can use the cast_in_discrete_tab method to open a new tab and do something there.
```spell.cast_in_discrete_tab(url, callback=False, payload)```
url: A string of the URL that you want to open in a new tab
callback: A callback function that you want to run after opening the tab.
payload: An object (e.g. dictionary or list) that you want available within the callback function.

## Taking a Screenshot
Use the take_screenshot method if you want to take a picture of an image.
```spell.take_screenshot(css_selector='img')```
css_selector: A string of a CSS selector.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webmage-0.0.22.tar.gz (10.2 kB view hashes)

Uploaded Source

Built Distribution

webmage-0.0.22-py3-none-any.whl (9.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page