Tool for parsing URL webpage into JSON + RDF.
Project description
URL Scrub
Tool for parsing URL webpage into JSON + RDF.
Setup
Dependencies
- Python:
3.10
geckodriver
orchromedriver
Installation Process
-
Install
urlscrub
withpip
python3.10 -m pip install urlscrub
-
Install
geckodriver
-
Download Firefox and install.
-
Linux (Ubuntu):
sudo apt-get install firefox
-
-
Unzip
geckodriver
/geckodriver.exe
file into a preferred directory. -
Append the directory containing
geckodriver
to yourPATH
variable. (Guide)
-
-
Install
chromedriver
-
Download Google Chrome and install.
-
Find the version of Google Chrome you have installed.
-
Download
chromedriver.zip
with the most corresponding version number.- Exact version number not required (Ex: chromedriver
102.0.5005.61
w/ Google Chrome102.0.5005.115
)
- Exact version number not required (Ex: chromedriver
-
Unzip
chromedriver
/chromedriver.exe
file into a preferred directory. -
Append the directory containing
chromedriver
to yourPATH
variable. (Guide)
-
Command Line Usage
-
Command:
urlscrub --skip-cookies --driver "chrome" -l "https://www.amazon.com/All-new-Kindle-Oasis-now-with-adjustable-warm-light/dp/B07GRSK3HC"
-
Response:
{ "results": [ { "type": "product", "productTitle": "Kindle Oasis \u2013 With adjustable warm light", "availability": "In Stock.", "rating": "19,734 ratings", "imageURL": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SX679_.jpg" } ] }
Guides
-
Appending directories to your
PATH
environment variable.- Windows Guide
- Linux:
-
Append path to your
.bashrc
/.zshrc
export PATH="<geckodriver_dir>/:$PATH"
-
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.