A library to create a bot / spider / crawler.
Project description
Exoskeleton
For my dissertation I downloaded hundreds of thousands of documents and feed them into a machine learning pipeline. Using a high-speed-connection is helpful but carries the risk to run an involuntary denial-of-service attack on the servers that provide those documents. This creates a need for a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.
Exoskeleton is a python framework that aims to help you build a similar bot. Main functionalities are:
- Managing a download queue within a MariaDB database.
- Avoid processing the same URL more than once.
- Working through that queue by either
- downloading files to disk,
- storing the page source code into a database table,
- or making PDF-copies of webpages.
- Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
- Sending progress reports to the admin.
Exoskeleton has an extensive documentation.
Two other python libraries were created as part of this project:
- userprovided : check user input for validity and plausibility / covert input into better formats
- bote : send messages (currently via a local or remote SMTP server)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for exoskeleton-0.9.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 603db5e8a1126be7de8863884dab704fdb71b5517b4bc9a749a781cfe85189e0 |
|
MD5 | d15be3811d2ddcf54ba34c264376c5e8 |
|
BLAKE2b-256 | 550e84f19cc599cf37d790e0919d0f8fa8989fdb37e9bdbe694a306914d641ec |