Skip to main content

Pick up failed list processing without re-processing

Project description

PickMeUp - Pick up failed list processing without re-processing

PickMeUp allows to pick up a failed processing of elements.

elements = [1,2,3,4,5,6,7,8]
with PickMeUp(elements, name="element_processing") as l:
    for e in l:
        result = process(e)
        save(result)

If process fails for elements dividable by 4, the script crashes. You now have to fix process and re-run the script. Normally, you would re-process 1, 2, and 3. However, you already have results for those! And process might be expensive. PickMeUp will know that the processing failed at 4, and will only iterate over 4, 5, 6, 7, and 8, saving you from re-processing.

Without PickMeUp

  1. Process 1,2,3
  2. Fix process
  3. Process 1,2,3,4,5,6,7,8

Showcase GIF: Without PickMeUp

With PickMeUp

  1. Process 1,2,3
  2. Fix process
  3. Process 4,5,6,7,8

Showcase GIF: With PickMeUp

Installation

pip install pickmeup

Usage

Create a context using your list and a name, and use the context element as your list:

with PickMeUp(yourList, "some_unique_name") as l:
    for e in l:
        # Do stuff with e

Example Use-Case: Scraping

Imagine you want to crawl content from a website. You have a list of 50 content pages, and you want to extract the src of one specific image from each page. You wrote the extractor (identifying the important image, extracting the src) by looking at the first 2 pages. It seems that the important image can be identified with the id=important-image:

import requests
from bs4 import BeautifulSoup


def scrape(url: str) -> str:
    """Returns the html of the given `url`"""
    resp = requests.get(url)
    return resp.content

def extract_important_image_src(html: str) -> str:
    """Finds the important image in the given `html` and
    returns its `src` attribute"""
    soup = BeautifulSoup(html)
    return soup.find('img', {'id': 'important-image'})['src']

URLS = ["example.com/content/1", "example.com/content/2", ...] 
for url in URLS:
    html = scrape(url)
    src = extract_important_image_src(html)
    print(f"{src} extracted!")

However, it just so happens that half of the content pages follow an old design, and those pages mark their image with id=important. Your extractor breaks, because there is no element with id=important-image and the subscript ['src'] fails:

> "example.com/content/1 extracted!"
> ...
> "example.com/content/25 extracted!"
> TypeError: 'NoneType' object is not subscriptable

So you enhance extract_important_image_src to handle the old design. But now you have to parse the first 25 pages again! Or skip them manually. Instead, if you used PickMeUp:

URLS = ["example.com/content/1", "example.com/content/2", ...]
with PickMeUp(URLS, name="url_processing") as l:
    for url in l:
        html = scrape(url)
        src = extract_important_image_src(html)
        print(f"{src} extracted!")

You could just update extract_important_image_src and re-run your script. It will pick up the scraping at the element that failed, and proceed as if nothing happened. If there is another design change somewhere, the processing will fail again. But again, you can fix the issue and re-run the script, without re-processing all the urls that worked before!

Caveats

PickMeUp creates state for your processing by dumping the remaining elements to disk. This creates a few caveats:

  • If your elements are really large objects, this will take much disk space.
  • If you are using a generator, and the elements are expensive to create, PickMeUp will take the time to create them all and dump them to disk.
  • Changing the list elements between script runs without clearing state creates undefined behavior!
  • List elements have to be serializable!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pickmeup-0.0.2.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pickmeup-0.0.2-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file pickmeup-0.0.2.tar.gz.

File metadata

  • Download URL: pickmeup-0.0.2.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for pickmeup-0.0.2.tar.gz
Algorithm Hash digest
SHA256 3d3c3f6297e4a38d455fe8dc45243062f0381948b06fe9ef5802a1519b0054b2
MD5 bdf55aa02de42b946428f906e4345a03
BLAKE2b-256 19f1d09a01eacfe8fec4c321dd19ccfac8682fe3a59bb866b75293d42b5b6d51

See more details on using hashes here.

File details

Details for the file pickmeup-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pickmeup-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for pickmeup-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6a9654abd85ab274de77a95736d91966afc0399f0847bb34c7f90d6347782e24
MD5 c873638a04d6bec38d035822c9070a8c
BLAKE2b-256 7d9c003d24fda4be67e71718ee0aa71c4a9914787763c1eba8cd387ee0887146

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page