Skip to main content

A library to create a bot / spider / crawler.

Project description

Exoskeleton

Python package Supported Python Versions Last commit pypi version

For my dissertation I downloaded hundreds of thousands of documents and feed them into a machine learning pipeline. Using a high-speed-connection is helpful but carries the risk to run an involuntary denial-of-service attack on the servers that provide those documents. This creates a need for a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.

Exoskeleton is a python framework that aims to help you build a similar bot. Main functionalities are:

  • Managing a download queue within a MariaDB database.
  • Avoid processing the same URL more than once.
  • Working through that queue by either
    • downloading files to disk,
    • storing the page source code into a database table,
    • or making PDF-copies of webpages.
  • Managing already downloaded files:
    • Storing multiple versions of a specific file.
    • Assigning labels to downloads, so they can be found and grouped easily.
  • Sending progress reports to the admin.

Exoskeleton has an extensive documentation.

Two other python libraries were created as part of this project:

  • userprovided : check user input for validity and plausibility / covert input into better formats
  • bote : send messages (currently via a local or remote SMTP server)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exoskeleton-0.9.2.tar.gz (17.6 kB view hashes)

Uploaded Source

Built Distribution

exoskeleton-0.9.2-py3-none-any.whl (22.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page