Skip to main content

A basic but fast, persistent and threadsafe caching system

Project description

This package lets you efficiently retrieve pages from the Internet by caching request’s results.

Basic commands

Importing required modules first:

from webscrapetools import urlcaching

Initializing the cache:

urlcaching.set_cache_path(‘.wst_cache’)

The option _expiry_days_ sets the cache expiry period, default is 10 days.

This is a required step: otherwise responses to url calls will simply not be cached. Cache data are stored in the specified folder, so that re-using the same string makes the cache persistent. This creates the folder on the fly if it does not exist. The following command cleans up the cache, making sure we start with no prior data:

urlcaching.empty_cache()

Opening an url with the following command stores the repsonse content behind the scene, so that subsequent calls will not hit the network.

urlcaching.open_url(‘http://www.google.com’)

Full example

from webscrapetools import urlcaching
import time

# Initializing the cache
urlcaching.set_cache_path('.wst_cache')

# Making sure we start from scratch
urlcaching.empty_cache()

# Demo with 5 identical calls... only the first one is delayed, all others are hitting the cache
count_calls = 1
while count_calls <= 5:
    start_time = time.time()
    urlcaching.open_url('http://deelay.me/5000/http://www.google.com')
    duration = time.time() - start_time
    print('duration for call {}: {:0.2f}'.format(count_calls, duration))
    count_calls += 1

# Cleaning up
urlcaching.empty_cache()

The code above outputs the following:

duration for call 1: 6.74 duration for call 2: 0.00 duration for call 3: 0.00 duration for call 4: 0.00 duration for call 5: 0.00

Example plugging in a custom client

The framework lets you customize the way you access the web. It is therefore possible to drive a browser via Selenium for example.

from webscrapetools import urlcaching
urlcaching.set_cache_path('./output/tests', max_node_files=10, rebalancing_limit=100)

def dummy_client():
    return None

def dummy_call(_, key):
    return '{:d}'.format(int(key)) * int(key), key

# simulating calls using the dummy client
keys = ('{:05d}'.format(count) for count in range(500))
for key in keys:
    urlcaching.open_url(key, init_client_func=dummy_client, call_client_func=dummy_call)

urlcaching.empty_cache()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webscrapetools-0.4.6.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

webscrapetools-0.4.6-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file webscrapetools-0.4.6.tar.gz.

File metadata

  • Download URL: webscrapetools-0.4.6.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/28.8.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.0

File hashes

Hashes for webscrapetools-0.4.6.tar.gz
Algorithm Hash digest
SHA256 497c54177467f963b6507957de292cf6f25bfd4774d76f77844f6ee6ea05a029
MD5 7ffb3410c3973d221bfb6e5dcfee811f
BLAKE2b-256 8cb7e33d7f46b6a7208af1e59585367c4f38c8ecf16b412c1389e1faef8c3c3d

See more details on using hashes here.

File details

Details for the file webscrapetools-0.4.6-py3-none-any.whl.

File metadata

  • Download URL: webscrapetools-0.4.6-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/28.8.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.0

File hashes

Hashes for webscrapetools-0.4.6-py3-none-any.whl
Algorithm Hash digest
SHA256 78c5dc33dc25c3d610fa7e98b7e15cc2de41a3f4d43a12179729aa4c024025af
MD5 62d06c38fd416634e5b1b8d78ff904f8
BLAKE2b-256 d54156cd74a6b21434eb6f732e86b83d24d2133eced134cf3e9547b53ea2142a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page