Skip to main content

web/html scraping toolkit

Project description

whsk

whsk (pronounced "whisk") is a command line utility for web scraper authors.

It provides a set of utilities for inspecting HTML responses, and applying selectors against them.

Installation

It is recommended you install whsk with uvx or pipx.

uvx whsk is the fastest way to get running with whsk

It currently consists of two utilities:

whsk shell

whsk shell fetches a page, automatically parsing HTML, XML, or JSON responses. It then opens an ipython shell allowing you to interact with the raw and parsed response.

When the command runs it will print a table of the variables it has loaded (which will depend on the type of page and particular flags passed):

$ uvx whsk shell https://example.com 
            variables
┌──────────┬───────────────────────┐
│ url      │ https://example.com   │
│ resp     │ <Response [200 OK]>   │
│ root     │ lxml.html.HtmlElement │
└──────────┴───────────────────────┘

In [1]:

The In[1]: is an ipython prompt, the variables in the table area available for inspection & usage.

If you pass a selector from the command line, that first query will be made for you:

$ uvx whsk shell https://example.com --xpath //p
            variables
┌──────────┬───────────────────────┐
│ url      │ https://example.com   │
│ resp     │ <Response [200 OK]>   │
│ root     │ lxml.html.HtmlElement │
│ selector │ //p                   │
│ selected │ 2 elements            │
└──────────┴───────────────────────┘

In [1]:

Options

 Usage: whsk shell [OPTIONS] URL                                                        
                                                                                        
 Launch an interactive Python shell for scraping                                        
                                                                                        
╭─ Arguments ──────────────────────────────────────────────────────────────────────────╮
│ *    url      TEXT  URL to scrape [default: None] [required]                         │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --ua                TEXT  User agent to make requests with                           │
│ --postdata  -p      TEXT  POST data (will make a POST instead of GET)                │
│ --header    -h      TEXT  Additional headers in format 'Name: Value'                 │
│ --css       -c      TEXT  css selector                                               │
│ --xpath     -x      TEXT  xpath selector                                             │
│ --help                    Show this message and exit.                                │
╰──────────────────────────────────────────────────────────────────────────────────────╯

whsk query

whsk query takes the same command line options as whsk shell but instead of opening a shell will output the results of the --css or --xpath selection, and then exit immediately.

As such, you must provide one of the two selector parameters.

This can be used for rapid testing of queries without opening the shell each time.

Options

Usage: whsk query [OPTIONS] URL                                                                       
                                                                                                       
 Run a one-off query against the URL                                                                   
                                                                                                       
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────╮
│ *    url      TEXT  URL to scrape [default: None] [required]                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────╮
│ --ua                TEXT  User agent to make requests with                                          │
│ --postdata  -p      TEXT  POST data (will make a POST instead of GET)                               │
│ --header    -h      TEXT  Additional headers in format 'Name: Value'                                │
│ --css       -c      TEXT  css selector                                                              │
│ --xpath     -x      TEXT  xpath selector                                                            │
│ --help                    Show this message and exit.                                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯

Common Parameters

--ua

This parameter is provided as a shortcut to set common browser "User-Agent" headers.

It must be one of:

  • linux.chrome
  • linux.firefox
  • mac.chrome
  • mac.firefox
  • mac.safari
  • win.chrome
  • win.edge
  • win.firefox

These will use the values in user_agents.py, a relatively recent snapshot of a real user agent for the browser in question.

If you need to set a custom user agent, use --header 'user-agent: whatever you need'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whsk-0.3.0.tar.gz (227.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whsk-0.3.0-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file whsk-0.3.0.tar.gz.

File metadata

  • Download URL: whsk-0.3.0.tar.gz
  • Upload date:
  • Size: 227.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.22

File hashes

Hashes for whsk-0.3.0.tar.gz
Algorithm Hash digest
SHA256 57c1c94441119504ed396ecc08170ffb096ed0d178fb3f64350a227af33bf7e4
MD5 917c8df798f8eddf029fa197179c33db
BLAKE2b-256 97945ff18c7e44097ecd84b59024b4dc76e60700bafc76a2f0bdd4e8763fcd3e

See more details on using hashes here.

File details

Details for the file whsk-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: whsk-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.22

File hashes

Hashes for whsk-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6fc081ed17de2dcc4f622d2087ec4ab79374faac5353f46823b338b571ad1ab1
MD5 e99d6ab1ca9617369ee39a41d305e1c3
BLAKE2b-256 ec170b89631575f45ac935d0d71cfbefa551f1e41af46bbeee4d00a2363f7818

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page