Skip to main content

It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.

Project description

PyPI version Python Versions

scrape cli

It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.

It's based on the great and simple scraping tool written by Jeroen Janssens.

Installation

You can install scrape-cli using pip:

pip install scrape-cli

Or install from source:

git clone https://github.com/[username]/scrape-cli
cd scrape-cli
pip install -e .

Requirements

  • Python >=3.6
  • requests
  • lxml
  • cssselect

How does it work?

A CSS selector query like this

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'

or an XPATH query like this one:

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be '//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a'

gives you back:

<html>
 <head>
 </head>
 <body>
  <a href="/wiki/Afghanistan" title="Afghanistan">
   Afghanistan
  </a>
  <a href="/wiki/Albania" title="Albania">
   Albania
  </a>
  <a href="/wiki/Algeria" title="Algeria">
   Algeria
  </a>
  <a href="/wiki/Andorra" title="Andorra">
   Andorra
  </a>
  <a href="/wiki/Angola" title="Angola">
   Angola
  </a>
  <a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
   Antigua and Barbuda
  </a>
  <a href="/wiki/Argentina" title="Argentina">
   Argentina
  </a>
  <a href="/wiki/Armenia" title="Armenia">
   Armenia
  </a>
...
...
 </body>
</html>

Some notes on the commands:

  • -e to set the query
  • -b to add <html>, <head> and <body> tags to the HTML output.

How to use it in Linux

# go in example to the home folder
cd ~
# download scrape-cli
wget "https://github.com/aborruso/scrape-cli/releases/download/1.1/scrape-linux-x86_64"
# move it in a folder of your PATH as /usr/bin
sudo mv ./scrape-linux-x86_64 /usr/bin/scrape
# give it execute permission
sudo chmod +x /usr/bin/scrape
# use it

Please note: in OSX it seems not to work (#8).

Note on building it

The original source is written in Python 2, then I have built it in Python 2 environment.
There are two modules requirements: install in this environment cssselect and then lxml, in this order (using pip).

I have built it using pyinstaller and this command: pyinstaller --onefile scrape.py.

Once you have built it, it's an executable, and it's possible to use it in any environment.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_cli-1.1.1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrape_cli-1.1.1-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file scrape_cli-1.1.1.tar.gz.

File metadata

  • Download URL: scrape_cli-1.1.1.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for scrape_cli-1.1.1.tar.gz
Algorithm Hash digest
SHA256 e9d949069bd0db30c3e8c9b24aa7f6440b8a372f77131711f5f5224ac24c595b
MD5 ce8424dc68a77cd3330f4012d916f669
BLAKE2b-256 03e161e5691ca0c22cfca583b095305402929d5b350b81d865e6964a4d7ad225

See more details on using hashes here.

File details

Details for the file scrape_cli-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: scrape_cli-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for scrape_cli-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 475ab16f29a150c3217b70b273f8b0722cff4bbb04d70037a108767a140b1a2a
MD5 723d80ca8d38f5eaf1056c806090feca
BLAKE2b-256 4b889f6e031cae998049871f914044dcc989fe51fcb590789350dd1320445c0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page