Skip to main content

Simple Python web scraper/page fetcher with cache

Project description

scrapesy

Build Status

Easy and Pythonic way to get and parse a Web page

Usage

To get a Page object, use scrapesy.get(url). The Page object has two properties, page and request. page is a BeautifulSoup object. request is a Requests Response object.

However, if you just want this for the cache (see below), and do not need or want Beautiful Soup to parse the pages, pass parse=False to get(). This will simply return a Requests request object, not a Page object. Disabling parsing and caching is possible, but rather useless, because parsing and caching are the main features of Scrapesy. If you find yourself always disabling both, just use Requests directly.

Caching

By default, Scrapesy implements a cache, allowing for near-instantaneous results on pages that have been requested previously. This cache operates automatically, and it operates transparently to any code that does not specifically interact with it. It is possible to use Scrapesy without any understanding of the cache.

However, it is possible to disable the cache. Simply run scrapesy.caching = False. To re-enable it, use scrapesy.caching = True. If you simply need to ignore the cache for a single call, simply add use_cache=False to your scrapesy.get() call.

To empty the cache, call scrapesy.empty_cache().

To remove a single page from the cache, call scrapesy.uncache(url).

To enable selective caching, set scrapesy.cache_check to a function that takes url as an input and returns True if the page should be cached and False otherwise.

Run demo.py for a demonstration of the impact of the cache.

Requirements

  • Beautiful Soup 4
  • Requests
  • Python 3 (it may work on 2.7, but is not tested)

Note

This project was originally called PyScrape. If you find that name used anywhere in this repo, please report it as an issue!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapesy-1.2.0.tar.gz (3.5 kB view details)

Uploaded Source

File details

Details for the file scrapesy-1.2.0.tar.gz.

File metadata

  • Download URL: scrapesy-1.2.0.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for scrapesy-1.2.0.tar.gz
Algorithm Hash digest
SHA256 c9cb8c36b00972c44cf1b361dbb360059f94220e18c61ffae72c31b86c8b935a
MD5 55ea1bf9f90403b6cad7a374e5451e0f
BLAKE2b-256 522d96fa011fc8b70913b8041d45cc01efb56998c61ba09a89e803a2ce3be590

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page