Skip to main content

High-level Python web scraping.

Project description

Screpe

High-level Python web scraping.

(Crepes not included)



Installation


Using pip

The Python package installer makes it easy to install screpe.

pip install screpe

Using git

Otherwise, clone this repository to your local machine with git, then install with Python.

git clone https://github.com/shanedrabing/screpe.git
cd screpe
python setup.py install

You can also simply download screpe.py and place it in your working directory.



Getting Started


Initializing Screpe

Import the module in Python, and initialize a Screpe object.

from screpe import Screpe

# do we want the scraper to remember previous responses?
scr = Screpe(is_caching=True)

All methods in this module live on the Screpe class, so there is no need to import anything else!


Requests and BeautifulSoup

If you are familiar with web scraping in Python, then you have probably used the requests and bs4 packages before. There are a couple of static methods that Screpe provides to make their usage even easier!

# a webpage we want to scrape
url = "https://www.wikipedia.org"

# returns None if status code is not 200
html = Screpe.get(url)

# can handle None as input, parses the HTML with `lxml`
soup = Screpe.cook(html)

# check to make sure we have a soup object, otherwise see bs4
if soup is not None:
    print(soup.select_one("h1"))

We can marry these two functions with the instance method Screpe.dine. Remember that we have the scr object from the section above.

# get and cook
soup = scr.dine(url)

Responses from Screpe.dine can be cached and adhere to rate-limiting (see next sections).


Downloading a Webpage or a File

Commonly, we just want to download an image, webpage, generic file, etc. Let us see how to do this with Screpe!

# locator to file we want, local path to where we want it
url = "https://www.python.org/static/img/python-logo.png"
fpath = "logo.png"

# let us use our object to download the file
scr.download(url, fpath)

Note that the URL can be pretty much any filetype as the response is saved in binary, just make sure you get the filetype right.


Downloading an HTML Table

Sometimes there is a nice HTML table on a webpage that we want as more interoperable format. The pandas package can do this easily, and we take advantage of that with Screpe.

# this webpage contains a table that we want to download
url = "https://www.multpl.com/cpi/table/by-year"

# we save the tables as a CSV file
fpath = "table.csv"

# the `which` parameter decides what table to save
scr.download_table(url, fpath, which=0)

Selenium

One of the most challenging tasks in web scraping is to deal with dynamic pages that require a web browser to work properly. Thankfully, the selenium package is pretty good at this. Screpe removes headaches surrounding Selenium.

# the homepage of Wikipedia has a search box
url = "https://www.wikipedia.org"

# let us open the page in a webdriver
scr.open(url)

# we can click on the input box
scr.click("input#searchInput")

# ...enter a search term
scr.send_keys("Selenium")

# ...and hit return to initiate the search
scr.bide(lambda: scr.send_enter())
# note that the `Screpe.bide` function takes a function as input, checks what
# page it is on, calls the function, and waits for the next page to load

# we can use bs4 once the next page loads!
soup = scr.source()

Caching does not apply to the Selenium-related functions, it is a stateful activity and we cannot simply load an old webdriver state.


Asynchronous Requests

Screpe uses concurrent.futures to spawn a bunch of threads that can work simulatanously to retrieve webpages.

# a collection of URLs
urls = ["https://www.wikipedia.org/wiki/Dog",
        "https://www.wikipedia.org/wiki/Cat",
        "https://www.wikipedia.org/wiki/Sheep"]

# we want soup objects for all
soups = scr.dine_many(urls)

Rate-Limiting

If sites are sensitive to how often you can request, consider setting your Screpe object to halt before sending another request.

# we give the function a duration, but can find that from a rate
rate_per_second = 2
duration_in_seconds = 1 / rate_per_second

# inform your scraper to not surpass the request interval
scr.halt_duration(duration_in_seconds)

Note that cached responses do not adhere to the rate limit. After all, we already have the reponse!


Caching

Sometimes, we have to request many pages. So that we do not waste bandwidth, or a rate limit, we can use cached reponses. Note that caching is on by default, turn it off if you want real-time responses.

# turn caching on
scr.cache_on()

# ...or turn it off
scr.cache_off()

We can save and load the cache between sessions for even more greatness!

# where shall we save the cache? (binary file)
fpath = "cahce.bin"

# save the cache
scr.cache_save(fpath)

# load the cache
scr.cache_load(fpath)

# clear the cache
scr.cache_clear()



License


MIT License

Copyright (c) 2022 Shane Drabing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

screpe-0.0.7.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

screpe-0.0.7-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file screpe-0.0.7.tar.gz.

File metadata

  • Download URL: screpe-0.0.7.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for screpe-0.0.7.tar.gz
Algorithm Hash digest
SHA256 1a5c454dde65bc2854b254d8423ab9415145340d1508b6b217aac4c2ea5fafc3
MD5 1de9f80171468f97f8777d29bbd9e559
BLAKE2b-256 253df2a7532570e591c15d5216b71ce6a622b1f77c2b00ec022ad766f96197d6

See more details on using hashes here.

File details

Details for the file screpe-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: screpe-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.12

File hashes

Hashes for screpe-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 d81b3ed753bffc8c8bb785e3785c5aca7de2e9912d0368b6b19ed48845406d35
MD5 13c00be6a4a4891ed3a919de2b80b127
BLAKE2b-256 2e051c4cf9e4371ed9b84583c1fc5f9ae776edcbfb10c6b57a79b631ca8aec07

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page