Dynamic web scraper specifically designed for AnsularJS websites that runs at specified time intervals and notifies you about specific updates on the site
Project description
😁 Welcome!!
Contents
- 😁 Welcome!!
- Contents
- 🌐 Dynamic Web Scraper
- 💡 Use case examples
- ✨ Features
- 📦 Installation
- 📲 Usage
- ✏ Manual Installation
- ❌ Common errors
- 👥 Contributing
🌐 Dynamic Web Scraper
💻 Windows and Linux compatible. 💻
This is a dynamic web scraper specifically designed for websites that have to wait for certain elements to load (such as AngularJS). It runs at specified time intervals. This can be used to monitor when a new element is added to the website instead of having to manually refresh it.
Instead of looking at the source, it waits until all elements are loaded to retrieve the results by using selenium and a Firefox driver.
Whenever a new element is discovered, it will notify you and save it to a file so that it doesn't notify you again for that same element in the future.
💡 Use case examples
This is useful, for example to notify you when a certain keyword is found on a website, such as:
- New job on a job board
- New product on an online store
- New article on a blog post
- ...
✨ Features
- Automated Scraping: Runs at user-defined intervals, extracting data without manual input.
- Notification System: Notifies users via Windows notifications when new data is found.
- Robust Parsing: Utilizes customizable search strings and regular expressions for data extraction.
📦 Installation
(Go below for manual installation.)
From PyPI
Requirements:
pip
# or recommended
pipx
pipx
is optional but recommended, you can use pip
instead.
pipx
:
pipx install dynamic-scraper
pip
:
pipx install dynamic-scraper
You can also clone the repository and install:
git clone https://github.com/P-ict0/Dynamic-Web-Scraper.git
cd Dynamic-Web-Scraper
python -m pip install .
📲 Usage
For help:
dynamic-scraper --help
General usage:
dynamic-scraper -u "https://www.example.com" -s "search this text"
Also see common errors if you encounter any issues with the browser.
Options
-
--url, -u
: Required. The URL of the webpage from which to fetch data. -
--search_string, -s
: Required. The string you want to search for within the webpage. -
--regex, -r
: Optional. The regular expression pattern used to store the results nicely.- Default =
search_string
.
- Default =
-
--interval, -i
: Optional. The interval in minutes at which the script should run repeatedly.- Default =
5
.
- Default =
-
--json_path, -j
: Optional. The file path where the found results will be saved as JSON. Default is a path relative to the script location.- Default =
data/results.json
.
- Default =
-
--use-previous
,-p
: Optional. Use results from previous runs, if present- Default =
False
.
- Default =
-
--no-headless
: Optional. Disable headless mode for the webdriver and run maximized -
--verbose, -v
: Optional. Increase verbosity level (-v
,-vv
, etc.)-v
: INFO-vv
: DEBUG- Default: WARNING
-
--quiet, -q
: Optional Suppress all notifications, only get output in the console
Advanced Options
-
--locator-type
,-t
: Optional. Type of locator to wait for the element to load.- Default =
xpath
. - Options:
xpath
,id
,class_name
,name
,tag_name
,link_text
,partial_link_text
,css_selector
.
- Default =
-
--locator-value
,-l
: Optional. Value of the locator to search for.- Default =
//section[@class='list-item ng-scope']
.
- Default =
Note: The results will be appended to the specified JSON file, creating a historical data log if run repeatedly.
✏ Manual Installation
git clone https://github.com/P-ict0/AngularJS-Dynamic-Web-Scraper.git
Recommended to use a virtual environment:
python3 -m venv venv
source venv/bin/activate # Linux
venv\Scripts\activate # Windows
pip install -r requirements.txt
You can now run:
python src/web_scraper/scraper.py [args]
❌ Common errors
You may also need to install the latest geckodriver from here and add it to your PATH.
👥 Contributing
Contributions are welcome! Please fork the repository and submit a pull request with your suggested changes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dynamic_scraper-0.4.0.tar.gz
.
File metadata
- Download URL: dynamic_scraper-0.4.0.tar.gz
- Upload date:
- Size: 47.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c7f3d40b679ec7e865bff6e63b2d0649b28ea14965e3016a2cd24a469c862a6 |
|
MD5 | 8577a0de284eae5e69b53a0fcef87407 |
|
BLAKE2b-256 | 4bd8c9a4c2c6e5f7b7a90baca945fe7db891d59d16c5719e84ca3af6bd522417 |
File details
Details for the file dynamic_scraper-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: dynamic_scraper-0.4.0-py3-none-any.whl
- Upload date:
- Size: 36.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b35cf034f4abff6becfdde1a47d8f49972425151c199972250e7a5e50150f71 |
|
MD5 | 7339e2df7c1fd0f2e148854a29568d41 |
|
BLAKE2b-256 | 351eb150ebdffe64735f7732b7a5f4be250f9d63b677d3334272289ee67110d8 |