Skip to main content

Using BeautifulSoup and Selenium Webdriver to crawl websites and retrieve their resources to keep documentation for educative purposes.

Project description

Archival Spider Donate to the founder

Efficient means to documenting your projects info.

Educational/Experimental project Contribute to the repo Compliant with TravisCI standard Gitpod Ready-to-Code Accepting contributions Vulnerabilities Code coverage HitCount

Inspired By

Project inspired by the likes of archive.org and miscellaneous free archival and curation projects. Intended to work for a broader public with a larger objective.

About

Python project which uses mainly BeautifulSoup and Selenium Webdriver in order to crawl through websites and retrieve their resources in order to keep a personal record of documentation studied. Not meant to be used without webmasters permissions; this is only for learning purposes. We do not encourage you to breach terms of any website.

To-do

Troubleshooting

Common Issues:

Chrome not running!

With issues like selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash, do the following:

Try ps aux and see if there are multiple processes running. In linux, with killall -9 chromedriver and killall -9 chrome you can make sure to free up processes to run the app again. In windows, the command is: taskkill /F /IM chrome.exe. This is usually a result of crashes mid-runs, and is easily fixable.

..."encodings\cp1252.py", line 19, in encode...

UnicodeEncodeError: 'charmap' codec can't encode characters in position XXXX-YYYY: character maps to

This is a windows encoding issue and it may be possible to fix by running the following commands before running the script: set PYTHONIOENCODING=utf-8 set PYTHONLEGACYWINDOWSSTDIO=utf-8

Donate

Donate if you can spare a few bucks for pizza, coffee or just general sustenance. I appreciate it.

Donate Button

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archival-web-spider-netrules-0.0.1.tar.gz (8.0 kB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page