It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
Project description
scrape cli
It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.
It's based on the great and simple scraping tool written by Jeroen Janssens.
Installation
You can install scrape-cli using pip:
pip install scrape-cli
Or install from source:
git clone https://github.com/[username]/scrape-cli
cd scrape-cli
pip install -e .
Requirements
- Python >=3.6
- requests
- lxml
- cssselect
How does it work?
A CSS selector query like this
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'
or an XPATH query like this one:
curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be '//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a'
gives you back:
<html>
<head>
</head>
<body>
<a href="/wiki/Afghanistan" title="Afghanistan">
Afghanistan
</a>
<a href="/wiki/Albania" title="Albania">
Albania
</a>
<a href="/wiki/Algeria" title="Algeria">
Algeria
</a>
<a href="/wiki/Andorra" title="Andorra">
Andorra
</a>
<a href="/wiki/Angola" title="Angola">
Angola
</a>
<a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
Antigua and Barbuda
</a>
<a href="/wiki/Argentina" title="Argentina">
Argentina
</a>
<a href="/wiki/Armenia" title="Armenia">
Armenia
</a>
...
...
</body>
</html>
Some notes on the commands:
-eto set the query-bto add<html>,<head>and<body>tags to the HTML output.
How to use it in Linux
# go in example to the home folder
cd ~
# download scrape-cli
wget "https://github.com/aborruso/scrape-cli/releases/download/1.1/scrape-linux-x86_64"
# move it in a folder of your PATH as /usr/bin
sudo mv ./scrape-linux-x86_64 /usr/bin/scrape
# give it execute permission
sudo chmod +x /usr/bin/scrape
# use it
Please note: in OSX it seems not to work (#8).
Note on building it
The original source is written in Python 2, then I have built it in Python 2 environment.
There are two modules requirements: install in this environment cssselect and then lxml, in this order (using pip).
I have built it using pyinstaller and this command: pyinstaller --onefile scrape.py.
Once you have built it, it's an executable, and it's possible to use it in any environment.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrape_cli-1.1.1.tar.gz.
File metadata
- Download URL: scrape_cli-1.1.1.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9d949069bd0db30c3e8c9b24aa7f6440b8a372f77131711f5f5224ac24c595b
|
|
| MD5 |
ce8424dc68a77cd3330f4012d916f669
|
|
| BLAKE2b-256 |
03e161e5691ca0c22cfca583b095305402929d5b350b81d865e6964a4d7ad225
|
File details
Details for the file scrape_cli-1.1.1-py3-none-any.whl.
File metadata
- Download URL: scrape_cli-1.1.1-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
475ab16f29a150c3217b70b273f8b0722cff4bbb04d70037a108767a140b1a2a
|
|
| MD5 |
723d80ca8d38f5eaf1056c806090feca
|
|
| BLAKE2b-256 |
4b889f6e031cae998049871f914044dcc989fe51fcb590789350dd1320445c0d
|