A library to create a bot / spider / crawler.
Project description
Exoskeleton
For my dissertation I downloaded hundreds of thousands of documents and feed them into a machine learning pipeline. Using a high-speed-connection is helpful but carries the risk to run an involuntary denial-of-service attack on the servers that provide those documents. This creates a need for a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.
Exoskeleton is a python framework that aims to help you build a similar bot. Main functionalities are:
- Managing a download queue within a MariaDB database.
- Avoid processing the same URL more than once.
- Working through that queue by either
- downloading files to disk,
- storing the page source code into a database table,
- or making PDF-copies of webpages.
- Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
- Sending progress reports to the admin.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
exoskeleton-0.9.0.tar.gz
(16.8 kB
view hashes)
Built Distribution
Close
Hashes for exoskeleton-0.9.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a446dab3b916d2c469fc96b10f21b3389074be9067b03875b4cd6c794020470 |
|
MD5 | c2c632e6f8537305c21321042e3c866c |
|
BLAKE2b-256 | 18ece1090190f0b4362452658f053979353b52e7d89cab5be11d2cad3357a6b1 |