Tool for parsing URL webpage into JSON + RDF.
Project description
URL Scrub
Tool for parsing URL webpage into JSON + RDF.
Setup
Dependencies
- Python:
3.10 geckodriverorchromedriver
Installation Process
-
Install
urlscrubwithpippython3.10 -m pip install urlscrub
-
Install
geckodriver-
Download Firefox and install.
-
Linux (Ubuntu):
sudo apt-get install firefox
-
-
Unzip
geckodriver/geckodriver.exefile into a preferred directory. -
Append the directory containing
geckodriverto yourPATHvariable. (Guide)
-
-
Install
chromedriver-
Download Google Chrome and install.
-
Find the version of Google Chrome you have installed.
-
Download
chromedriver.zipwith the most corresponding version number.- Exact version number not required (Ex: chromedriver
102.0.5005.61w/ Google Chrome102.0.5005.115)
- Exact version number not required (Ex: chromedriver
-
Unzip
chromedriver/chromedriver.exefile into a preferred directory. -
Append the directory containing
chromedriverto yourPATHvariable. (Guide)
-
Command Line Usage
-
Command:
urlscrub --skip-cookies --driver "chrome" -l "https://www.amazon.com/All-new-Kindle-Oasis-now-with-adjustable-warm-light/dp/B07GRSK3HC"
-
Response:
{ "results": [ { "type": "product", "productTitle": "Kindle Oasis \u2013 With adjustable warm light", "availability": "In Stock.", "rating": "19,734 ratings", "imageURL": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SX679_.jpg" } ] }
Guides
-
Appending directories to your
PATHenvironment variable.- Windows Guide
- Linux:
-
Append path to your
.bashrc/.zshrcexport PATH="<geckodriver_dir>/:$PATH"
-
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file urlscrub-0.1.0.tar.gz.
File metadata
- Download URL: urlscrub-0.1.0.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb273932060d70d725fedbd916d55bae66ae5685741ba1911f4b44aab0f1c61a
|
|
| MD5 |
c75b63003a86afa4258b614b3acf16ce
|
|
| BLAKE2b-256 |
a211f662281213d63de4926d03877a45ecda2813ff14e7e6df57b2723211bbf2
|