A library to create a bot / spider / crawler.
Project description
Exoskeleton
For my dissertation I downloaded hundreds of thousands of documents and feed them into a machine learning pipeline. Using a high-speed-connection is helpful but carries the risk to run an involuntary denial-of-service attack on the servers that provide those documents. This creates a need for a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.
Exoskeleton is a python framework that aims to help you build a similar bot. Main functionalities are:
- Managing a download queue within a MariaDB database.
- Avoid processing the same URL more than once.
- Working through that queue by either
- downloading files to disk,
- storing the page source code into a database table,
- or making PDF-copies of webpages.
- Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
- Sending progress reports to the admin.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
exoskeleton-0.9.1.tar.gz
(16.4 kB
view hashes)
Built Distribution
Close
Hashes for exoskeleton-0.9.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f65fc6c76976aaa09c57984b01908fc7cad095520c8b687f00fff0139e883a42 |
|
MD5 | 4eb82f8605be7b25b77a5633641a8d34 |
|
BLAKE2b-256 | 1534a9e6acf7a037eb59deda193a4bf48326940f1f60a220f7f327d9bee93944 |