It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
Project description
scrape cli
It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
It's based on the great and simple scraping tool written by Jeroen Janssens.
Installation
You can install scrape-cli using several methods:
Using pipx (recommended for CLI tools)
pipx install scrape-cli
Using uv (modern Python package manager)
# Install as a global CLI tool (recommended)
uv tool install scrape-cli
# Or install with uv pip
uv pip install scrape-cli
# Or run temporarily without installing
uvx scrape-cli --help
Using pip
pip install scrape-cli
Or install from source:
git clone https://github.com/aborruso/scrape-cli
cd scrape-cli
pip install -e .
Requirements
- Python >=3.6
- requests
- lxml
- cssselect
How does it work?
Using the Test HTML File
In the resources directory you'll find a test.html file that you can use to test various scraping scenarios. Here are some examples:
- Extract all table data:
# CSS
scrape -e "table.data-table td" resources/test.html
# XPath
scrape -e "//table[contains(@class, 'data-table')]//td" resources/test.html
- Get all list items:
# CSS
scrape -e "ul.items-list li" resources/test.html
# XPath
scrape -e "//ul[contains(@class, 'items-list')]/li" resources/test.html
- Extract specific attributes:
# CSS
scrape -e "a.external-link" -a href resources/test.html
# XPath
scrape -e "//a[contains(@class, 'external-link')]/@href" resources/test.html
- Check if an element exists:
# CSS
scrape -e "#main-title" --check-existence resources/test.html
# XPath
scrape -e "//h1[@id='main-title']" --check-existence resources/test.html
- Extract nested elements:
# CSS
scrape -e ".nested-elements p" resources/test.html
# XPath
scrape -e "//div[contains(@class, 'nested-elements')]//p" resources/test.html
- Get elements with specific attributes:
# CSS
scrape -e "[data-test]" resources/test.html
# XPath
scrape -e "//*[@data-test]" resources/test.html
- Additional XPath examples:
# Get all links with href attribute
scrape -e "//a[@href]" resources/test.html
# Get checked input elements
scrape -e "//input[@checked]" resources/test.html
# Get elements with multiple classes
scrape -e "//div[contains(@class, 'class1') and contains(@class, 'class2')]" resources/test.html
# Get text content of specific element
scrape -e "//h1[@id='main-title']/text()" resources/test.html
General Usage Examples
A CSS selector query like this
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'
Note: When using both -b and -e options together, they must be specified in the order -be (body first, then expression). Using -eb will not work correctly.
or an XPATH query like this one:
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be "//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a"
gives you back:
<html>
<head>
</head>
<body>
<a href="/wiki/Afghanistan" title="Afghanistan">
Afghanistan
</a>
<a href="/wiki/Albania" title="Albania">
Albania
</a>
<a href="/wiki/Algeria" title="Algeria">
Algeria
</a>
<a href="/wiki/Andorra" title="Andorra">
Andorra
</a>
<a href="/wiki/Angola" title="Angola">
Angola
</a>
<a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
Antigua and Barbuda
</a>
<a href="/wiki/Argentina" title="Argentina">
Argentina
</a>
<a href="/wiki/Armenia" title="Armenia">
Armenia
</a>
...
...
</body>
</html>
Text Extraction
You can extract only the text content (without HTML tags) using the -t option, which is particularly useful for LLMs and text processing:
# Extract all text content from a page
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -t
# Extract text from specific elements
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -te 'table.wikitable td'
# Extract text from headings only
scrape -te 'h1, h2, h3' resources/test.html
The -t option automatically excludes text from <script> and <style> tags and cleans up whitespace for better readability.
Some notes on the commands:
-eto set the query-bto add<html>,<head>and<body>tags to the HTML output-tto extract only text content (useful for LLMs and text processing)
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrape_cli-1.2.0.tar.gz.
File metadata
- Download URL: scrape_cli-1.2.0.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ca5bea688018d94d70f5a1b680a4edf4acb9e71ee4ec087b777ac9c1c9b0ee4
|
|
| MD5 |
d9b06eb45f3c089ec0773f63c763a3b3
|
|
| BLAKE2b-256 |
0f8b77107e8d1ccaf902d4bf0de9af1e6ee600f15a7f13a2217eaaf4b39cf35f
|
File details
Details for the file scrape_cli-1.2.0-py3-none-any.whl.
File metadata
- Download URL: scrape_cli-1.2.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55a329115fd258fe0f0cd4d84b1b6621a76ca70fc14296f2684b042391e76cbf
|
|
| MD5 |
3cfdfad673f8725369a27950554ea75c
|
|
| BLAKE2b-256 |
714b194431e178956d8ce7e5f7e9f0e341f29d203344a022b2673889fd2d9aab
|