Skip to main content

A framework and an utility helps you to develop web scraping applications.

Project description

Scrape Academy

Scrape Academy provides a framework and an utility helps you to develop web scraping applications.

Simple web page scraping

Scrape Academy helps you to download web pages to scrape.

# Download a page from https://www.python.jp

from bs4 import BeautifulSoup
from scrapeacademy import context, run

async def run_simple():
    page = await context.get("https://www.python.jp")
    soup = BeautifulSoup(page, features="html.parser")
    print(soup.title.text)

run(run_simple())

scrapeacademy.run() starts asyncio event loop and run a scraping function.

In the async function, you can use context.get() method to download the page. The context.get() throttle the requests to the server. By default, context.get() waits 0.1 seconds between requests.

Cache downloaded files

While developping the scraper, you usually need to investigate the HTML over and over. To help investigations, you can save the downloaded files to the cache directory.

The context.get() method saves downloaded file to the cache directory if name parameter is supplied.

# Save https://www.python.jp

from scrapeacademy import context, run

async def save_index():
    page = await context.get("https://www.python.jp", name="python_jp_index")

run(run_simple())

Later, you can load the saved HTML from the cache to scrape using another script.

# Parse saved HTML file.

from scrapeacademy import context

html = context.load("python_jp_index")
soup = BeautifulSoup(page, features="html.parser")
print(soup.title.text)

Command line utility

Scrape Academy provides the scrapeacademy command to make development easier.

You can inspect the cached files with web browser.

$ scrapeacademy open python_jp_index

Or, you can view the file with vi editor as follow.

$ vi `scrapeacademy path python_jp_index`

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

scape_academy-0.0.1-py2.py3-none-any.whl (3.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page