Skip to main content

web/html scraping toolkit

Project description

whsk

whsk (pronounced "whisk") is a command line utility for web scraper authors.

It provides a set of utilities for inspecting HTML responses, and applying selectors against them.

Installation

It is recommended you install whsk with uvx or pipx:

uvx whsk is the fastest way to get running with whsk

It currently consists of two utilities:

whsk shell

whsk shell fetches a page, automatically parsing HTML, XML, or JSON responses. It then opens an ipython shell allowing you to interact with the raw and parsed response.

When the command runs it will print a table of the variables it has loaded (which will depend on the type of page and particular flags passed):

$ uvx whsk shell https://example.com 
            variables
┌──────────┬───────────────────────┐
│ url      │ https://example.com   │
│ resp     │ <Response [200 OK]>   │
│ root     │ lxml.html.HtmlElement │
└──────────┴───────────────────────┘

In [1]:

The In[1]: is an ipython prompt, the variables in the table area available for inspection & usage.

If you pass a selector from the command line, that first query will be made for you:

$ uvx whsk shell https://example.com --xpath //p
            variables
┌──────────┬───────────────────────┐
│ url      │ https://example.com   │
│ resp     │ <Response [200 OK]>   │
│ root     │ lxml.html.HtmlElement │
│ selector │ //p                   │
│ selected │ 2 elements            │
└──────────┴───────────────────────┘

In [1]:

Options

 Usage: whsk shell [OPTIONS] URL                                                        
                                                                                        
 Launch an interactive Python shell for scraping                                        
                                                                                        
╭─ Arguments ──────────────────────────────────────────────────────────────────────────╮
│ *    url      TEXT  URL to scrape [default: None] [required]                         │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --ua                TEXT  User agent to make requests with                           │
│ --postdata  -p      TEXT  POST data (will make a POST instead of GET)                │
│ --header    -h      TEXT  Additional headers in format 'Name: Value'                 │
│ --css       -c      TEXT  css selector                                               │
│ --xpath     -x      TEXT  xpath selector                                             │
│ --help                    Show this message and exit.                                │
╰──────────────────────────────────────────────────────────────────────────────────────╯

whsk query

whsk query takes the same command line options as whsk shell but instead of opening a shell will output the results of the --css or --xpath selection, and then exit immediately.

As such, you must provide one of the two selector parameters.

This can be used for rapid testing of queries without opening the shell each time.

Options

Usage: whsk query [OPTIONS] URL                                                                       
                                                                                                       
 Run a one-off query against the URL                                                                   
                                                                                                       
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────╮
│ *    url      TEXT  URL to scrape [default: None] [required]                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────╮
│ --ua                TEXT  User agent to make requests with                                          │
│ --postdata  -p      TEXT  POST data (will make a POST instead of GET)                               │
│ --header    -h      TEXT  Additional headers in format 'Name: Value'                                │
│ --css       -c      TEXT  css selector                                                              │
│ --xpath     -x      TEXT  xpath selector                                                            │
│ --help                    Show this message and exit.                                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯

Common Parameters

--ua

This parameter is provided as a shortcut to set common browser "User-Agent" headers.

It must be one of:

  • linux.chrome
  • linux.firefox
  • mac.chrome
  • mac.firefox
  • mac.safari
  • win.chrome
  • win.edge
  • win.firefox

These will use the values in user_agents.py, a relatively recent snapshot of a real user agent for the browser in question.

If you need to set a custom user agent, use --header 'user-agent: whatever you need'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whsk-0.2.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whsk-0.2.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file whsk-0.2.0.tar.gz.

File metadata

  • Download URL: whsk-0.2.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.24

File hashes

Hashes for whsk-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fc2a0200efc284bb20cabbaf143eaccd7099fa902b7474a67429ff22009e81ec
MD5 273e77a816d6a7e6cba00047a9457252
BLAKE2b-256 82bd9da75ddbde354e6f4d56a39a55357e14a6bdf74bf1fc08697ad22d3a0b58

See more details on using hashes here.

File details

Details for the file whsk-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: whsk-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.24

File hashes

Hashes for whsk-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 161563e4f288f0359a09ed94eb8f4dfd87acb69ac92739bec6e698a90801ea37
MD5 cd3ce65a20f31be4e59af48d096d13ac
BLAKE2b-256 1b694817c60bf7dc3272a7bacd36c4e5c7af3434ae2eae122496b656a1874b0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page