Skip to main content

web/html scraping toolkit

Project description

whsk

whsk (pronounced "whisk") is a command line utility for web scraper authors.

It provides a set of utilities for inspecting HTML responses, and applying selectors against them.

Installation

It is recommended you install whsk with uvx or pipx.

uvx whsk is the fastest way to get running with whsk

It currently consists of two utilities:

whsk shell

whsk shell fetches a page, automatically parsing HTML, XML, or JSON responses. It then opens an ipython shell allowing you to interact with the raw and parsed response.

When the command runs it will print a table of the variables it has loaded (which will depend on the type of page and particular flags passed):

$ uvx whsk shell https://example.com 
            variables
┌──────────┬───────────────────────┐
│ url      │ https://example.com   │
│ resp     │ <Response [200 OK]>   │
│ root     │ lxml.html.HtmlElement │
└──────────┴───────────────────────┘

In [1]:

The In[1]: is an ipython prompt, the variables in the table area available for inspection & usage.

If you pass a selector from the command line, that first query will be made for you:

$ uvx whsk shell https://example.com --xpath //p
            variables
┌──────────┬───────────────────────┐
│ url      │ https://example.com   │
│ resp     │ <Response [200 OK]>   │
│ root     │ lxml.html.HtmlElement │
│ selector │ //p                   │
│ selected │ 2 elements            │
└──────────┴───────────────────────┘

In [1]:

Options

 Usage: whsk shell [OPTIONS] URL                                                        
                                                                                        
 Launch an interactive Python shell for scraping                                        
                                                                                        
╭─ Arguments ──────────────────────────────────────────────────────────────────────────╮
│ *    url      TEXT  URL to scrape [default: None] [required]                         │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --ua                TEXT  User agent to make requests with                           │
│ --postdata  -p      TEXT  POST data (will make a POST instead of GET)                │
│ --header    -h      TEXT  Additional headers in format 'Name: Value'                 │
│ --css       -c      TEXT  css selector                                               │
│ --xpath     -x      TEXT  xpath selector                                             │
│ --help                    Show this message and exit.                                │
╰──────────────────────────────────────────────────────────────────────────────────────╯

whsk query

whsk query takes the same command line options as whsk shell but instead of opening a shell will output the results of the --css or --xpath selection, and then exit immediately.

As such, you must provide one of the two selector parameters.

This can be used for rapid testing of queries without opening the shell each time.

Options

Usage: whsk query [OPTIONS] URL                                                                       
                                                                                                       
 Run a one-off query against the URL                                                                   
                                                                                                       
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────╮
│ *    url      TEXT  URL to scrape [default: None] [required]                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────╮
│ --ua                TEXT  User agent to make requests with                                          │
│ --postdata  -p      TEXT  POST data (will make a POST instead of GET)                               │
│ --header    -h      TEXT  Additional headers in format 'Name: Value'                                │
│ --css       -c      TEXT  css selector                                                              │
│ --xpath     -x      TEXT  xpath selector                                                            │
│ --help                    Show this message and exit.                                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯

Common Parameters

--ua

This parameter is provided as a shortcut to set common browser "User-Agent" headers.

It must be one of:

  • linux.chrome
  • linux.firefox
  • mac.chrome
  • mac.firefox
  • mac.safari
  • win.chrome
  • win.edge
  • win.firefox

These will use the values in user_agents.py, a relatively recent snapshot of a real user agent for the browser in question.

If you need to set a custom user agent, use --header 'user-agent: whatever you need'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whsk-0.3.1.tar.gz (227.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whsk-0.3.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file whsk-0.3.1.tar.gz.

File metadata

  • Download URL: whsk-0.3.1.tar.gz
  • Upload date:
  • Size: 227.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.22

File hashes

Hashes for whsk-0.3.1.tar.gz
Algorithm Hash digest
SHA256 45d279a00501b7310b92d8279595f5070687fbbd2626148765b99527a9711727
MD5 ba905da78dc62a6188f4f4aaa05d79b3
BLAKE2b-256 457f073b3c29cffb1363f88a6526d07281ca01cb33c66f1dadbe2f299a5777ac

See more details on using hashes here.

File details

Details for the file whsk-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: whsk-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.22

File hashes

Hashes for whsk-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 820c80fd4a097d4c4fe11fac53065ca1240bb66eab783db08631a926b57bb835
MD5 619378ad8b0efdfb3c911942e6e6022c
BLAKE2b-256 516f0c3742c276466761b67550fa4e9718af7f70123b65c533f91392e1a31759

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page