Skip to main content

A framework and an utility helps you to develop web scraping applications.

Project description

Scrape Academy

Scrape Academy provides a framework and an utility helps you to develop web scraping applications.

Simple web page scraping

Scrape Academy helps you to download web pages to scrape.

# Download a page from https://www.python.jp

from bs4 import BeautifulSoup
from scrapeacademy import context, run

async def run_simple():
    page = await context.get("https://www.python.jp")
    soup = BeautifulSoup(page, features="html.parser")
    print(soup.title.text)

run(run_simple())

scrapeacademy.run() starts asyncio event loop and run a scraping function.

In the async function, you can use context.get() method to download the page. The context.get() throttle the requests to the server. By default, context.get() waits 0.1 seconds between requests.

Cache downloaded files

While developping the scraper, you usually need to investigate the HTML over and over. To help investigations, you can save the downloaded files to the cache directory.

The context.get() method saves downloaded file to the cache directory if name parameter is supplied.

# Save https://www.python.jp

from scrapeacademy import context, run

async def save_index():
    page = await context.get("https://www.python.jp", name="python_jp_index")

run(run_simple())

Later, you can load the saved HTML from the cache to scrape using another script.

# Parse saved HTML file.

from scrapeacademy import context

html = context.load("python_jp_index")
soup = BeautifulSoup(page, features="html.parser")
print(soup.title.text)

Command line utility

Scrape Academy provides the scrapeacademy command to make development easier.

You can inspect the cached files with web browser.

$ scrapeacademy open python_jp_index

Or, you can view the file with vi editor as follow.

$ vi `scrapeacademy path python_jp_index`

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

scape_academy-0.0.1-py2.py3-none-any.whl (3.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file scape_academy-0.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: scape_academy-0.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 3.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.1

File hashes

Hashes for scape_academy-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 60699b43bb933207b72456b787a4a599e92c7cb379913f16e62fe22866954163
MD5 0bbd7f52cb55bf79d7cbc263c32cac9b
BLAKE2b-256 ba9cc8e3f3e16d4140a294e48083ed9d72c26f1d186ef9b14b51af0836fba28f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page