Customizable Web Scrapper to get alerts when criteria is met on web sites.
Project description
WebScraper
- This program can scrap data from websites using different scrapers, and send an email when matches/ changes deadening on the scraper used
- There are 2 types of scrapers:
- Generic: Can scrap any website, but might not be as exact
- Specific: Can scrap only specific websites, but will be more exact
Generic Scrapers
- Text
- Diff
Specific Scrapers
- Cars.com
How to use
Text
- Set these specific env variables
-
SCRAPER=text # Scraper to use URL=<URL> # URL to scrape TEXT=<TEXT> # Text to look for
- Ensure all other required env variables are set
Diff
- Set these specific env variables
-
SCRAPER=diff # Scraper to use URL=<URL> # URL to scrape PERCENTAGE=<PERCENTAGE_DIFF> # Percentage difference to look for
- Ensure all other required env variables are set
Cars.com
- Set these specific env variables
-
SCRAPER=cars_com # Scraper to use URL=https://www.cars.com/shopping/results/ # URL to scrape, must be on the results page, for a specific search
- Ensure all other required env variables are set
Required env variables
SLEEP_TIME_SEC= # Time to sleep between each scrape
SENDER_EMAIL= # Email to send from
FROM_EMAIL= # Name to send from i.e. '"Web Scraper" <no-reply@jstockley.com>'
RECEIVER_EMAIL= # Email to send to
PASSWORD= # Password for the sender's email
SMTP_SERVER= # SMTP server to use
SMTP_PORT= # SMTP port to use
TLS= # True/False to use TLS
Running multiple of the same scraper
To run 2+ scrapers of the same type, i.e. 2 diff scrapers, make sure the host folder mapping is different
Ex:
diff-scraper-1:
image: jnstockley/web-scraper:latest
volumes:
- ./diff-scraper-1-data/:/app/data/
environment:
- TZ=America/Chicago
- SCRAPER=diff
- URL=https://google.com
- PERCENTAGE=5
- SLEEP_TIME_SEC=21600
diff-scraper-2:
image: jnstockley/web-scraper:latest
volumes:
- ./diff-scraper-2-data/:/app/data/
environment:
- TZ=America/Chicago
- SCRAPER=diff
- URL=https://yahoo.com
- PERCENTAGE=5
- SLEEP_TIME_SEC=21600
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file js_web_scraper-20260505142618.dev0.tar.gz.
File metadata
- Download URL: js_web_scraper-20260505142618.dev0.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e91ea007f3978465425c8b4f1e3e2e19359404624328f335dec2ddef975e7b51
|
|
| MD5 |
0783020af5488d4f6014e1936fae5d6b
|
|
| BLAKE2b-256 |
ae5332d4dd15921f60e4ac7a60d96ece3322e18276be24035898a225d28d99f3
|
Provenance
The following attestation bundles were made for js_web_scraper-20260505142618.dev0.tar.gz:
Publisher:
ci-cd.yml on jnstockley/web-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
js_web_scraper-20260505142618.dev0.tar.gz -
Subject digest:
e91ea007f3978465425c8b4f1e3e2e19359404624328f335dec2ddef975e7b51 - Sigstore transparency entry: 1440086605
- Sigstore integration time:
-
Permalink:
jnstockley/web-scraper@45abcf39a46344a90a6212aa79287660e707b8c1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/jnstockley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-cd.yml@45abcf39a46344a90a6212aa79287660e707b8c1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file js_web_scraper-20260505142618.dev0-py3-none-any.whl.
File metadata
- Download URL: js_web_scraper-20260505142618.dev0-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5f3af812aca2359e4cfebcd7d1857455992bb8776c9af37b1f7fd0aa9e27a39
|
|
| MD5 |
5c7c6ffc50528fb27ebb8f0789a73bbf
|
|
| BLAKE2b-256 |
e85e1e69eff6d52f193e1cd749d89b4658aa0bfa12e7a4a4e51c2ab926b49b0d
|
Provenance
The following attestation bundles were made for js_web_scraper-20260505142618.dev0-py3-none-any.whl:
Publisher:
ci-cd.yml on jnstockley/web-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
js_web_scraper-20260505142618.dev0-py3-none-any.whl -
Subject digest:
b5f3af812aca2359e4cfebcd7d1857455992bb8776c9af37b1f7fd0aa9e27a39 - Sigstore transparency entry: 1440086614
- Sigstore integration time:
-
Permalink:
jnstockley/web-scraper@45abcf39a46344a90a6212aa79287660e707b8c1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/jnstockley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-cd.yml@45abcf39a46344a90a6212aa79287660e707b8c1 -
Trigger Event:
push
-
Statement type: