Tool for parsing URL webpage into JSON + RDF.

Project description

URL Scrub

Tool for parsing URL webpage into JSON + RDF.

Setup

Dependencies

Python: 3.10
geckodriver or chromedriver

Installation Process

Install urlscrub with pip
```
python3.10 -m pip install urlscrub
```
Install geckodriver
- Download Firefox and install.
  - Linux (Ubuntu):
```
sudo apt-get install firefox
```
- Download geckodriver.zip.
- Unzip geckodriver/geckodriver.exe file into a preferred directory.
- Append the directory containing geckodriver to your PATH variable. (Guide)
Install chromedriver
- Download Google Chrome and install.
- Find the version of Google Chrome you have installed.
  - Open Google Chrome web browser.
  - Click on 3 vertical dots at top right. (Picture)
  - At the bottom of the dropdown, select Help, then About Google Chrome. (Picture)
  - Remember the version number displayed (Picture; Ex: 102.0.5005.115)
- Download chromedriver.zip with the most corresponding version number.
  - Exact version number not required (Ex: chromedriver 102.0.5005.61 w/ Google Chrome 102.0.5005.115)
- Unzip chromedriver/chromedriver.exe file into a preferred directory.
- Append the directory containing chromedriver to your PATH variable. (Guide)

Command Line Usage

Command:

urlscrub --skip-cookies --driver "chrome" -l "https://www.amazon.com/All-new-Kindle-Oasis-now-with-adjustable-warm-light/dp/B07GRSK3HC"

Response:

{
  "results": [
    {
      "type": "product",
      "productTitle": "Kindle Oasis \u2013 With adjustable warm light",
      "availability": "In Stock.",
      "rating": "19,734 ratings",
      "imageURL": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SX679_.jpg"
    }
  ]
}

Guides

Appending directories to your PATH environment variable.
- Windows Guide
- Linux:
  - Append path to your .bashrc/.zshrc
```
export PATH="<geckodriver_dir>/:$PATH"
```
Guide to install VcXsrv for running Firefox on WSL2

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jul 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urlscrub-0.1.0.tar.gz (8.5 kB view details)

Uploaded Jul 9, 2022 Source

File details

Details for the file urlscrub-0.1.0.tar.gz.

File metadata

Download URL: urlscrub-0.1.0.tar.gz
Upload date: Jul 9, 2022
Size: 8.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for urlscrub-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`eb273932060d70d725fedbd916d55bae66ae5685741ba1911f4b44aab0f1c61a`
MD5	`c75b63003a86afa4258b614b3acf16ce`
BLAKE2b-256	`a211f662281213d63de4926d03877a45ecda2813ff14e7e6df57b2723211bbf2`

See more details on using hashes here.

urlscrub 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta