Skip to main content

Using BeautifulSoup and Selenium Webdriver to crawl websites and retrieve their resources to keep documentation for educative purposes.

Project description

Archival Spider Donate to the founder

Efficient means to documenting your projects info.

Educational/Experimental project Contribute to the repo Compliant with TravisCI standard Gitpod Ready-to-Code Accepting contributions Vulnerabilities Code coverage HitCount

Inspired By

Project inspired by the likes of archive.org and miscellaneous free archival and curation projects. Intended to work for a broader public with a larger objective.

About

Python project which uses mainly BeautifulSoup and Selenium Webdriver in order to crawl through websites and retrieve their resources in order to keep a personal record of documentation studied. Not meant to be used without webmasters permissions; this is only for learning purposes. We do not encourage you to breach terms of any website.

To-do

Troubleshooting

Common Issues:

Chrome not running!

With issues like selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash, do the following:

Try ps aux and see if there are multiple processes running. In linux, with killall -9 chromedriver and killall -9 chrome you can make sure to free up processes to run the app again. In windows, the command is: taskkill /F /IM chrome.exe. This is usually a result of crashes mid-runs, and is easily fixable.

..."encodings\cp1252.py", line 19, in encode...

UnicodeEncodeError: 'charmap' codec can't encode characters in position XXXX-YYYY: character maps to

This is a windows encoding issue and it may be possible to fix by running the following commands before running the script: set PYTHONIOENCODING=utf-8 set PYTHONLEGACYWINDOWSSTDIO=utf-8

Donate

Donate if you can spare a few bucks for pizza, coffee or just general sustenance. I appreciate it.

Donate Button

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archival-web-spider-netrules-0.0.1.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file archival-web-spider-netrules-0.0.1.tar.gz.

File metadata

  • Download URL: archival-web-spider-netrules-0.0.1.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.8.7

File hashes

Hashes for archival-web-spider-netrules-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b2893ebf554e10bd31fcb9a1b77edee2ae90b913b7ea2703c005a10449db1003
MD5 d52b0cff27aa411efebb89b603dbd816
BLAKE2b-256 9ce1bb756f108dedbed2cd27fc80b00708925a8a4f610da7d8b1777b3f309a8d

See more details on using hashes here.

File details

Details for the file archival_web_spider_netrules-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: archival_web_spider_netrules-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.8.7

File hashes

Hashes for archival_web_spider_netrules-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f06adaeeecf5554e78b20fdfc47a7277f49188040f4bfb18c2fcfe9ffe4d59f6
MD5 f6ade1b51b30f2cf91a69215311c2d83
BLAKE2b-256 416f85517cdbf570b03c8b7f1aa9903d4a742b546037f0ce403345a47ef7fa40

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page