Simple Python web scraper/page fetcher with cache
Project description
scrapesy
Easy and Pythonic way to get and parse a Web page
Usage
To get a Page
object, use scrapesy.get(url)
. The Page
object has two
properties, page
and request
. page
is a BeautifulSoup
object.
request
is a Requests Response
object.
However, if you just want this for the cache (see below), and do not need or
want Beautiful Soup to parse the pages, pass parse=False
to get(). This will
simply return a Requests request object, not a Page
object. Disabling
parsing and caching is possible, but rather useless, because parsing and
caching are the main features of Scrapesy. If you find yourself always
disabling both, just use Requests directly.
Caching
By default, Scrapesy implements a cache, allowing for near-instantaneous results on pages that have been requested previously. This cache operates automatically, and it operates transparently to any code that does not specifically interact with it. It is possible to use Scrapesy without any understanding of the cache.
However, it is possible to disable the cache. Simply run scrapesy.caching = False
. To re-enable it, use scrapesy.caching = True
. If you simply need to
ignore the cache for a single call, simply add use_cache=False
to your
scrapesy.get()
call.
To empty the cache, call scrapesy.empty_cache()
.
To remove a single page from the cache, call scrapesy.uncache(url)
.
To enable selective caching, set scrapesy.cache_check
to a function that
takes url
as an input and returns True
if the page should be cached and
False
otherwise.
Run demo.py
for a demonstration of the impact of the cache.
Requirements
- Beautiful Soup 4
- Requests
- Python 3 (it may work on 2.7, but is not tested)
Note
This project was originally called PyScrape. If you find that name used anywhere in this repo, please report it as an issue!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scrapesy-1.2.0.tar.gz
.
File metadata
- Download URL: scrapesy-1.2.0.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9cb8c36b00972c44cf1b361dbb360059f94220e18c61ffae72c31b86c8b935a |
|
MD5 | 55ea1bf9f90403b6cad7a374e5451e0f |
|
BLAKE2b-256 | 522d96fa011fc8b70913b8041d45cc01efb56998c61ba09a89e803a2ce3be590 |