Using BeautifulSoup and Selenium Webdriver to crawl websites and retrieve their resources to keep documentation for educative purposes.
Project description
Archival Spider
Efficient means to documenting your projects info.
Inspired By
Project inspired by the likes of archive.org and miscellaneous free archival and curation projects. Intended to work for a broader public with a larger objective.
About
Python project which uses mainly BeautifulSoup and Selenium Webdriver in order to crawl through websites and retrieve their resources in order to keep a personal record of documentation studied. Not meant to be used without webmasters permissions; this is only for learning purposes. We do not encourage you to breach terms of any website.
To-do
- Included files within script should be able to:
- Follow principles of deduplication based filesystem such as: Duplicacy - Cloud Backup Tool, Borg - Deduplicating Archiver, SDFS - Deduplicating FS
- Permit elastic mapping as external scripts continue to be stored in CDN for using network bandwidth instead
- Inline styles by using Pynliner - CSS-to-inline-styles conversion tool
- Can follow principles of mind mapping and memory techniques, such as:
- Make decentralization possible due to browsing websites offline, saved per domain
- Add as pip package
- Zipper to minimize manual operations by automatizing and streamline
- Add silent mode
Troubleshooting
Common Issues:
Chrome not running!
With issues like selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
, do the following:
Try
ps aux
and see if there are multiple processes running. In linux, withkillall -9 chromedriver
andkillall -9 chrome
you can make sure to free up processes to run the app again. In windows, the command is:taskkill /F /IM chrome.exe
. This is usually a result of crashes mid-runs, and is easily fixable.
..."encodings\cp1252.py", line 19, in encode...
UnicodeEncodeError: 'charmap' codec can't encode characters in position XXXX-YYYY: character maps to
This is a windows encoding issue and it may be possible to fix by running the following commands before running the script:
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
Donate
Donate if you can spare a few bucks for pizza, coffee or just general sustenance. I appreciate it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file archival-web-spider-netrules-0.0.1.tar.gz
.
File metadata
- Download URL: archival-web-spider-netrules-0.0.1.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2893ebf554e10bd31fcb9a1b77edee2ae90b913b7ea2703c005a10449db1003 |
|
MD5 | d52b0cff27aa411efebb89b603dbd816 |
|
BLAKE2b-256 | 9ce1bb756f108dedbed2cd27fc80b00708925a8a4f610da7d8b1777b3f309a8d |
File details
Details for the file archival_web_spider_netrules-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: archival_web_spider_netrules-0.0.1-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f06adaeeecf5554e78b20fdfc47a7277f49188040f4bfb18c2fcfe9ffe4d59f6 |
|
MD5 | f6ade1b51b30f2cf91a69215311c2d83 |
|
BLAKE2b-256 | 416f85517cdbf570b03c8b7f1aa9903d4a742b546037f0ce403345a47ef7fa40 |