A framework and an utility helps you to develop web scraping applications.
Project description
Scrape Academy
Scrape Academy provides a framework and an utility helps you to develop web scraping applications.
Simple web page scraping
Scrape Academy helps you to download web pages to scrape.
# Download a page from https://www.python.jp
from bs4 import BeautifulSoup
from scrapeacademy import context, run
async def run_simple():
page = await context.get("https://www.python.jp")
soup = BeautifulSoup(page, features="html.parser")
print(soup.title.text)
run(run_simple())
scrapeacademy.run()
starts asyncio event loop and run a scraping function.
In the async function, you can use context.get()
method to download the page. The context.get()
throttle the requests to the server. By default, context.get()
waits 0.1 seconds between requests.
Cache downloaded files
While developping the scraper, you usually need to investigate the HTML over and over. To help investigations, you can save the downloaded files to the cache directory.
The context.get()
method saves downloaded file to the cache directory if name
parameter is supplied.
# Save https://www.python.jp
from scrapeacademy import context, run
async def save_index():
page = await context.get("https://www.python.jp", name="python_jp_index")
run(run_simple())
Later, you can load the saved HTML from the cache to scrape using another script.
# Parse saved HTML file.
from scrapeacademy import context
html = context.load("python_jp_index")
soup = BeautifulSoup(page, features="html.parser")
print(soup.title.text)
Command line utility
Scrape Academy provides the scrapeacademy
command to make development easier.
You can inspect the cached files with web browser.
$ scrapeacademy open python_jp_index
Or, you can view the file with vi editor as follow.
$ vi `scrapeacademy path python_jp_index`
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for scape_academy-0.0.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60699b43bb933207b72456b787a4a599e92c7cb379913f16e62fe22866954163 |
|
MD5 | 0bbd7f52cb55bf79d7cbc263c32cac9b |
|
BLAKE2b-256 | ba9cc8e3f3e16d4140a294e48083ed9d72c26f1d186ef9b14b51af0836fba28f |