A framework and an utility helps you to develop web scraping applications.
Project description
Scrape Academy
Scrape Academy provides a framework and an utility helps you to develop web scraping applications.
Simple web page scraping
Scrape Academy helps you to download web pages to scrape.
# Download a page from https://www.python.jp
from bs4 import BeautifulSoup
from scrapeacademy import context, run
async def run_simple():
page = await context.get("https://www.python.jp")
soup = BeautifulSoup(page, features="html.parser")
print(soup.title.text)
run(run_simple())
scrapeacademy.run() starts asyncio event loop and run a scraping function.
In the async function, you can use context.get() method to download the page. The context.get() throttle the requests to the server. By default, context.get() waits 0.1 seconds between requests.
Cache downloaded files
While developping the scraper, you usually need to investigate the HTML over and over. To help investigations, you can save the downloaded files to the cache directory.
The context.get() method saves downloaded file to the cache directory if name parameter is supplied.
# Save https://www.python.jp
from scrapeacademy import context, run
async def save_index():
page = await context.get("https://www.python.jp", name="python_jp_index")
run(run_simple())
Later, you can load the saved HTML from the cache to scrape using another script.
# Parse saved HTML file.
from scrapeacademy import context
html = context.load("python_jp_index")
soup = BeautifulSoup(page, features="html.parser")
print(soup.title.text)
Command line utility
Scrape Academy provides the scrapeacademy command to make development easier.
You can inspect the cached files with web browser.
$ scrapeacademy open python_jp_index
Or, you can view the file with vi editor as follow.
$ vi `scrapeacademy path python_jp_index`
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scape_academy-0.0.1-py2.py3-none-any.whl.
File metadata
- Download URL: scape_academy-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 3.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60699b43bb933207b72456b787a4a599e92c7cb379913f16e62fe22866954163
|
|
| MD5 |
0bbd7f52cb55bf79d7cbc263c32cac9b
|
|
| BLAKE2b-256 |
ba9cc8e3f3e16d4140a294e48083ed9d72c26f1d186ef9b14b51af0836fba28f
|